Introduction to Linkage Analysis March 2002
3 Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those genes? Linkage analysis What are those genes? Association analysis
Where are those genes?
Outline How is genetic information organized? Chromosomes Sequence Examples of genetic variation Changes that have observable effects Genetic markers Linkage analysis Strategy for surveying variation in families
Genetic Information Human Genome 22 autosomes X and Y Sequence of 3 x 10 9 base-pairs ~17-20 bp can identify unique sequence in the genome Variation Most sequence is conserved across individuals 1 in 10 3 base-pairs differs between chromosomes
DNA Polymer of 4 bases Purines (A) – Adenine (G) – Guanine Pyrimidines (C) – Cytosine (T) – Thymine Double Helix Complementary Strands Hydrogen Bonds
Some Types of DNA Sequence Genes ~30,000 in humans Exons, translated into protein Introns, transcribed into RNA, but not protein Promoters Enhancers Repeat DNA Pseudogenes
Genetic Code DNA RNA Protein DNA: 4 bases (A,T,C,G) RNA: 4 bases (A,U,C,G) Proteins: 20 amino-acids Universal Genetic Code Translation between DNA/RNA and protein Three bases code for one amino-acid
Genetic Code
Example of CFTR Variants
Phenotype vs. Genotype Genotype Underlying genetic constitution Phenotype Observed manifestation of a genotype Different changes within CFTR all lead to cystic fibrosis phenotype
Common types of DNA variants Tandem repeats Microsatellites Single nucleotide polymorphisms Insertions Deletions
Repeat Length Polymorphisms Variable Number Tandem Repeats VNTRs Typical repeat units of 10 – 100s bp E.g.: ~110 bp repeat in IL1RN gene Microsatellites Simple repeat sequences Most popular are 2, 3 or 4 bp E.g.: ACACACAC … D naming scheme (e.g., D2S160)
Microsatellites Most popular markers for linkage analysis Large number of alleles (10 is common) Can distinguish and track individual chromosomes in families Relatively abundant ~15,000 mapped loci
SNPs Single Nucleotide Polymorphisms Change one nucleotide Insert Delete Replace it with a different nucleotide Many have no phenotypic effect Some can disrupt or affect gene function
A little more on SNPs Most SNPs have only two alleles Easy to automate their scoring Becoming extremely popular Typing Methods Sequencing Restriction Site Hybridization
Classifying Genotypes Each individual carries two alleles If there are n alternative alleles … … there will be n (n + 1) / 2 possible genotypes 3 possible genotypes for SNPs, typically more for microsatellites and VNTRs Homozygotes The two alleles are the same Heterozygotes The two alleles are different
Genes in an individual Sexual reproduction One copy inherited from father One copy inherited from mother Each individual has 2 copies of each chromosome 2 copies of each gene These copies may be similar or different
Meiosis Leads to formation of haploid gametes from diploid cells Assortment of genetic loci Recombination or crossover
What happens in meiosis …
Recombination 1-
Recombination Actual No. of recombinants between two locations An average of one per Morgan Observed Usually, only odd / even number of crossovers between two locations can be established
Recombination and Map Distance
Intuition for Linkage Analysis Millions of variations that could be responsible for disease Impractical to investigate individually Within families, they organized into limited number of haplotypes Sample modest number of markers to determine whether each stretch of chromosome is shared
Tracing Chromosomes
IBD At each location, try to establish whether siblings (or twins) share 0, 1 or 2 chromosomes Inference may be probabilistic
Example of Scoring IBD Parental genotypes are available Siblings are IBD = 2 Share maternal and paternal chromosomes
Example of Scoring IBD II Parental genotypes unavailable IBD between siblings may be 0, 1 or 2 Likelihood of each outcome depends on frequency of allele A
Example of IBD scoring III Looking at multiple consecutive markers helps infer IBD Especially without parental genotypes IBD = 2 may be quite likely
Notation - IBD sharing (0, ½ and 1) Z 0 - probability = 0 Z 1 - probability = ½ Z 2 - probability = 1
Typical IBD information
Model
No Linkage
Linkage
Hypothesis Test evidence for linked genetic effect Fit two models Full model (Q,A,C,E) Restricted model (A,C,E) Maximum likelihood test Compare likelihoods using ²
Analysis Estimate along chromosome For example, using Genehunter or Merlin Test hypothesis at each location Summarize results in linkage curve Chi-squared is 50:50 mixture of 1 df and point mass zero
Lod scores Often, report results as lod scores Genome is large, many locations tested Threshold for significance is usually LOD > ~3
Sample Linkage Curve LOD