Statistical Genetics Analysis

The idea behind the following web pages is to provide a rudimentary set of guidelines to enable researchers to initiate a linkage study. Specifically this guide will advise on how to organise data, under the assumption that the data has been genotyped by the Australian Genome Research Facility (AGRF). Most references to pre - LINKAGE format assume files in AGRF standard formats.

Contemplating a Linkage Study?

Researchers and clinicians routinely come across patients with a particular disease that report further family members which are affected with the same or similar symptoms. This will indicate that the disease in question is likely to be of genetic origin rather than having environment causes as it is unlikely that all members of the family will have been exposed to the same environmental triggers.. The temptation is to try and identify the gene and the allele of the gene responsible for the disease.

To identify the gene responsible requires the location of the gene within the genome. This is done by linkage analysis. It involves the collection of genetic data from a series of markers in the human genome which are then tested for co-inheritance with the disease pattern in the family. Linkage analysis involves the probability calculation of genetic data on pedigrees to gauge the possibility of a marker being linked to a disease.

Diseases can be caused by single genes (examples are Huntington's disease or Marfan's syndrome) or by a complex interaction of several genes which may also need environmental triggers (examples are diabetes and multiple sclerosis).

Linkage studies for Mendelian or single gene disorders are most likely to succeed with one or several large pedigrees. Affecteds that are related are much more likely to share the same disease locus and furthermore the same allele at the disease locus that leads to disease, thus avoiding complications such as locus and allelic heterogeneity. For complex diseases allelic and locus heterogeneity are the norm. This makes the analysis for complex diseases so much more difficult.

Whilst it is possible to carry out power studies they usually only reflect best case scenarios and are often unrealistic, especially for complex diseases. Unless you have a highly penetrant simple mendelian disorder it is unlikely that one small family will provide enough power to achieve a LOD score of 3. Thus if you are contemplating a study in a disease inform yourself about the genetic aetiology. Is the disease likely to be complex? If so have you got a unique clinical subtype that may allow you to map a mendelian subtype of the disease? If not you will be looking at a very large study with many affected sibling pairs or a very large number of pedigrees or even an association study if it is difficult to find pedigrees with sufficient number of affecteds. If the phenotype is highly unique and severe it is more likely that the disease will be Mendelian. An examination of the pedigree may immediately reveal the mode of inheritance in this case.

Literature searches pertaining to the disease may reveal that possible causal genes have already been identified. If this is the case then a mutation screen of the probable disease gene(s) is a logical first step, starting with the known polymorphisms already identified. If this fails to identify a causal mutation then a genome wide screen may be called for. This is a systematic search of the genome for the location of the gene using genetic signposts (the markers).

For genome wide scans we advise the use of the Australian Genome Research Facility (AGRF) to carry out the genotyping. The AGRF has years of experience in generating high quality genotyping data, giving you the best chance of locating the disease causing gene. The AGRF has several commercial sets of markers that enable the rapid genome wide scanning of individuals and enable the location of the disease gene. These include a high density 5cM set and a lower density 10cM set (this means the markers are spaced at approximately 5 or 10cM apart). The 10 cM set is usually sufficient for a genome wide scan for a Mendelian disorder. Please contact AGRF for pricing and further details. A recent exciting addition of tools to the linkage mapper are the 10K, 2 50K SNP chip from Affymetrix. The AGRF now also provides this service. With costs similar to a 5 cM scan per individual but more information available this will become a more common alternative from 2004 onwards. Caution: there are currently some methodological challenges in linkage mapping utilising such data as assumptions regarding complete linkage equilibrium are broken for many SNP markers in the chip.

Most of the data preparation programs we have generated have stemmed from genotyping data from the AGRF and thus are tailored to the output from the AGRF, whether it be from human or murine linkage or association studies.

Starting the study

In general the more pedigree member's blood samples can be collected the more likely it is that the study will successfully identify the disease gene. Both unaffecteds and affecteds are important in the analysis. If the disease is of late onset the analysis will be more difficult and more samples need to be collected. In general the more complex the phenotype (uncertainty about the affectedness status of individuals) the more samples are needed to be successful.

Preparation of Data

You will receive the genotyping data in an EXCEL spreadsheet format from the AGRF once your samples have been genotyped. This data cannot be immediately analysed as it needs to undergo cleaning and preparation so that it can be analysed with a linkage mapping program such as GENEHUNTER, MERLIN or VITESSE.

The data will contain pedigree and genotyping errors some of which can be detected and removed. This cleaning process is part of the data preparation. In general the genotyping error rate is low for genome wide scan data since these markers have been chosen carefully because they give reliable results being easier to call and having high heterozygosity.

There are three main steps involved in the preparation of the data:

  1. Making the .dat and .pre file (creating LINKAGE style input files)
  2. Identification of pedigree errors
  3. Identification of genotyping errors

Making the .dat and .pre files

Download the C program LINKPREP or use the LINKAGE software. Read the installation and help guide. You will require several other sources of information such as a pedigree file and a file containing the map distances.

Identification of pedigree errors

We like to use PREST to identify pedigree errors. This helpful program also identifies correct relationships and does relevant statistical tests. Alternatively you could proceed immediately to the identification of genotyping errors which often give clues to pedigree errors as these can be easily identified in genome wide scans as individuals who display abnormally high numbers of genotyping errors. These individuals can be manually identified and new relationships tested.

Identification of genotyping errors

We like using PEDCHECK for this task. This program identifies genotyping errors on both autosomes and sex chromosomes by checking Mendelian inheritance rules. It summarises the genotyping errors found by pedigree and by marker at the end of its output. If the number of genotyping errors is particularly large for a pedigreee then it is likely that a pedigree error has caused most of the Mendelian genotyping errors. In this case you should use PREST to identify the cause. Chris Wilkinson wrote a useful PERL (name?) script that summarises the PEDCHECK output and puts marker names in the output.

Mendelian inheritance checks require the genotyping data from relatives. Hence Mendelian error checking is not possible in case and control studies (association studies) or sib pair studies where no parental genotyping data exists. In these cases another type of genotyping error checking may be carried out. Excessive recombination events that increase the map distance beyond the given map may indicate the presence of genotyping errors. Genotyping errors will appear as recombination events in some cases. This type of genotyping error detection has to be carried out with extreme caution but can be very useful in fine mapping. This method is in general not very useful or sensible for 10cM genome wide scans but the denser the map the more certainty exists in determining the recombination events and the likelihood of these being in error. The software MERLIN and SIBMED will carry out such tests.

Useful Websites for Statistical Genetics

The most useful website for statistical genetics is the Rockefeller linkage site. You will find all the programs mentioned above (apart from some of the in house software written by our group) at this website.
http://linkage.rockefeller.edu/soft/list.html

The Gene CRC website also provides further details and some references:
http://www.genecrc.org/index.htm

Last modified: 15 March 2004