Calculation of IBD probabilities David Evans University of Oxford Wellcome Trust Centre for Human Genetics.

Slides:



Advertisements
Similar presentations
A. Novelletto, F. De Rango Dept. Cell Biology, University of Calabria GENOTYPING CONCORDANT / DISCORDANT COUSIN PAIRS.
Advertisements

Tutorial #8 by Ma’ayan Fishelson. Computational Difficulties Algorithms that perform multipoint likelihood computations sum over all the possible ordered.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
METHODS FOR HAPLOTYPE RECONSTRUCTION
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
Tutorial #2 by Ma’ayan Fishelson. Crossing Over Sometimes in meiosis, homologous chromosomes exchange parts in a process called crossing-over. New combinations.
Genetic linkage analysis Dotan Schreiber According to a series of presentations by M. Fishelson.
Basics of Linkage Analysis
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
Linkage analysis: basic principles Manuel Ferreira & Pak Sham Boulder Advanced Course 2005.
GGAW - Oct, 2001M-W LIN Study Design for Linkage, Association and TDT Studies 林明薇 Ming-Wei Lin, PhD 陽明大學醫學系家庭醫學科 台北榮民總醫院教學研究部.
Genetic Analysis.
Human Genetics Genetic Epidemiology.
MALD Mapping by Admixture Linkage Disequilibrium.
Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003.
Tutorial #6 by Ma’ayan Fishelson Based on notes by Terry Speed.
Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.
Parametric and Non-Parametric analysis of complex diseases Lecture #8
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
Tutorial #11 by Anna Tzemach. Background – Lander & Green’s HMM Recombinations across successive intervals are independent  sequential computation across.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics
Linkage Analysis in Merlin
Copy the folder… Faculty/Sarah/Tues_merlin to the C Drive C:/Tues_merlin.
Linkage and LOD score Egmond, 2006 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
Introduction to QTL analysis Peter Visscher University of Edinburgh
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Fine mapping QTLs using Recombinant-Inbred HS and In-Vitro HS William Valdar Jonathan Flint, Richard Mott Wellcome Trust Centre for Human Genetics.
Non-Mendelian Genetics
Calculation of IBD State Probabilities Gonçalo Abecasis University of Michigan.
Introduction to Linkage Analysis Pak Sham Twin Workshop 2003.
Gene Hunting: Linkage and Association
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics
Recombination and Linkage
Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 15: Linkage Analysis VII
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Mapping and cloning Human Genes. Finding a gene based on phenotype ’s of DNA markers mapped onto each chromosome – high density linkage map. 2.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Genetic Theory Pak Sham SGDP, IoP, London, UK. Theory Model Data Inference Experiment Formulation Interpretation.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Mx Practical TC20, 2007 Hermine H. Maes Nick Martin, Dorret Boomsma.
Using Merlin in Rheumatoid Arthritis Analyses Wei V. Chen 05/05/2004.
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
Introduction to Genetic Theory
Genetic principles for linkage and association analyses Manuel Ferreira & Pak Sham Boulder, 2009.
QTL Mapping Using Mx Michael C Neale Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Regression Models for Linkage: Merlin Regress
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
upstream vs. ORF binding and gene expression?
Recombination (Crossing Over)
Mapping Quantitative Trait Loci
Error Checking for Linkage Analyses
Calculation of IBD probabilities
Association Analysis Spotted history
Lecture 9: QTL Mapping II: Outbred Populations
IBD Estimation in Pedigrees
Linkage Analysis Problems
Tutorial #6 by Ma’ayan Fishelson
Presentation transcript:

Calculation of IBD probabilities David Evans University of Oxford Wellcome Trust Centre for Human Genetics

This Session … Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm (MERLIN) Single locus probabilities Hidden Markov Model => Multipoint IBD Other ways of calculating IBD status Elston-Stewart Algorithm MCMC approaches MERLIN Practical Example IBD determination Information content mapping SNPs vs micro-satellite markers?

Identical by Descent Identical by state only Two alleles are IBD if they are descended from the same ancestral allele Identity By Descent (IBD)

Consider a mating between mother AB x father CD: IBD 0 : 1 : 2 = 25% : 50% : 25% Sib 2 Sib1 ACADBCBD AC2110 AD1201 BC1021 BD0112 Example: IBD in Siblings

1/23/4 4/41/42/41/33/4 4/4 1/4 Affected relatives not only share disease alleles IBD, but also tend to share marker alleles close to the disease locus IBD more often than chance IBD sharing forms the basis of non- parametric linkage statistics Why is IBD Sharing Important?

Crossing over between homologous chromosomes

Cosegregation => Linkage A1A1 A2A2 Q1Q1 Q2Q2 A1A1 A2A2 Q1Q1 Q2Q2 Non-recombinant Parental genotypes (many, 1 – θ) A1A1 A2A2 Q1Q1 Q2Q2 Recombinant genotypes (few, θ ) Parental genotype Alleles close together on the same chromosome tend to stay together in meiosis; therefore they tend be co-transmitted.

Segregating Chromosomes MARKER DISEASE GENE

Marker Shared Among Affecteds Genotypes for a marker with alleles {1,2,3,4} 1/23/4 4/41/42/41/33/4 4/4 1/4

Linkage between QTL and marker Marker QTL IBD 0IBD 1IBD 2

NO Linkage between QTL and marker Marker

IBD can be trivial… / 22 / 2 / 2 / IBD=0

Two Other Simple Cases… / 2 / 2 / 11 / 112 / 2 / IBD=2 22 / 22 /

A little more complicated… 12 / IBD=1 (50% chance) 22 / 12 / 12 / IBD=2 (50% chance)

And even more complicated… 11 / IBD=? 11 /

Bayes Theorem for IBD Probabilities         j=0,1,2 jIBDGPj P i GPi P GP i GPi P GP Gi Gi P )|()( )|()( )( )|()( )( ), P(IBD )|(

Sib 1Sib 2 P(observing genotypes | k alleles IBD) k=0k=1k=2 A1A1A1A1 A1A1A1A1 p14p14 p13p13 p12p12 A1A1A1A1 A1A2A1A2 2p 1 3 p 2 p12p2p12p2 0 A1A1A1A1 A2A2A2A2 p12p22p12p22 00 A1A2A1A2 A1A1A1A1 p12p2p12p2 0 A1A2A1A2 A1A2A1A2 4p 1 2 p 2 2 p1p2p1p2 2p 1 p 2 A1A2A1A2 A2A2A2A2 2p 1 p 2 3 p1p22p1p22 0 A2A2A2A2 A1A1A1A1 p12p22p12p22 00 A2A2A2A2 A1A2A1A2 p1p22p1p22 0 A2A2A2A2 A2A2A2A2 p24p24 p23p23 p22p22 P(Genotype | IBD State)

Worked Example 11 / 11 / )|2( )|1( )|0( )( )2|( )1|( )0|(         GIBDP G P G P GP GP GP GP p

Worked Example 11 / 11 /

For ANY PEDIGREE the inheritance pattern at any point in the genome can be completely described by a binary inheritance vector of length 2n: v(x) = (p 1, m 1, p 2, m 2, …,p n,m n ) whose coordinates describe the outcome of the paternal and maternal meioses giving rise to the n non-founders in the pedigree p i (m i ) is 0 if the grandpaternal allele transmitted p i (m i ) is 1 if the grandmaternal allele is transmitted ac / bd / ab / cd / v(x) = [0,0,1,1]

Inheritance Vector Inheritance vectorPriorPosterior /161/ /161/ / / /161/ /161/ / / /161/ /161/ / / /161/ /161/ / /160 In practice, it is not possible to determine the true inheritance vector at every point in the genome, rather we represent partial information as a probability distribution of the possible inheritance vectors ab p1p1 m2m2 p2p2 m1m1 bbac acab

Computer Representation At each marker location ℓ Define inheritance vector v ℓ Meiotic outcomes specified in index bit Likelihood for each gene flow pattern Conditional on observed genotypes at location ℓ 2 2n elements !!! LLLLLLLLLLLLLLLL

Abecasis et al (2002) Nat Genet 30:97-101

Multipoint IBD IBD status may not be able to be ascertained with certainty because e.g. the mating is not informative, parental information is not available IBD information at uninformative loci can be made more precise by examining nearby linked loci

ac / bd / 11 / 12 / ab / 11 / cd / 12 / Multipoint IBD IBD = 0 IBD = 0 or IBD =1?

Complexity of the Problem in Larger Pedigrees For each person 2n meioses in pedigree with n non-founders Each meiosis has 2 possible outcomes Therefore 2 2n possibilities for each locus For each genetic locus One location for each of m genetic markers Distinct, non-independent meiotic outcomes Up to 4 nm distinct outcomes!!!

m = 10… … Marker Inheritance vector (2 2xn ) m = (2 2 x 2 ) 10 =~ possible paths !!! Example: Sib-pair Genotyped at 10 Markers 1 P(G | 0000)(1 – θ) 4

m = 10… … Marker Inheritance vector (L[0000] + L[0101] + L[1010] + L[1111] ) / L[ALL] P(IBD) = 2 at Marker Three 1 (2) (1) (2) IBD

m = 10… … Marker Inheritance vector P(IBD) = 2 at arbitrary position on the chromosome 1 (L[0000] + L[0101] + L[1010] + L[1111] ) / L[ALL]

Lander-Green Algorithm The inheritance vector at a locus is conditionally independent of the inheritance vectors at all preceding loci given the inheritance vector at the immediately preceding locus (“Hidden Markov chain”) The conditional probability of an inheritance vector v i+1 at locus i+1, given the inheritance vector v i at locus i is θ i j (1-θ i ) 2n-j where θ is the recombination fraction and j is the number of changes in elements of the inheritance vector [0000] Conditional probability = (1 – θ) 3 θ Locus 2Locus 1 [0001] Example:

m = 10… … Marker Inheritance vector M(2 2n ) 2 = 10 x 16 2 = 2560 calculations Lander-Green Algorithm 1

m… … Total Likelihood = 1’Q 1 T 1 Q 2 T 2 …T m-1 Q m 1 P(G|[0000]) … 0 Q i = 0 P(G|[1111]) P(G|[0001]) n x 2 2n diagonal matrix of single locus probabilities at locus i (1-θ) 4 … θ4θ4 T i = (1-θ) 3 θ (1-θ) 4 … θ4θ4 … … … …(1-θ)θ 3 … (1-θ) 3 θ (1-θ)θ 3 2 2n x 2 2n matrix of transitional probabilities between locus i and locus i+1 ~m(2 2n ) 2 operations = 2560 for this case !!!

Further speedups… Trees summarize redundant information Portions of vector that are repeated Portions of vector that are constant or zero Speeding up convolution Use sparse-matrix by vector multiplication Use symmetries in divide and conquer algorithm (Idury & Elston, 1997)

Lander-Green Algorithm Summary Factorize likelihood by marker Complexity  m·e n Strengths Large number of markers Relatively small pedigrees

Elston-Stewart Algorithm Factorize likelihood by individual Complexity  n·e m Small number of markers Large pedigrees With little inbreeding VITESSE, FASTLINK etc

Other methods Number of MCMC methods proposed ~Linear on # markers ~Linear on # people Hard to guarantee convergence on very large datasets Many widely separated local minima E.g. SIMWALK

MERLIN-- Multipoint Engine for Rapid Likelihood Inference

Capabilities Linkage Analysis NPL and K&C LOD Variance Components Haplotypes Most likely Sampling All IBD and info content Error Detection Most SNP typing errors are Mendelian consistent Recombination No. of recombinants per family per interval can be controlled Simulation

MERLIN Website Reference FAQ Source Binaries Tutorial Linkage Haplotyping Simulation Error detection IBD calculation

Input Files Pedigree File Relationships Genotype data Phenotype data Data File Describes contents of pedigree file Map File Records location of genetic markers

Example Pedigree File x 3 3 x x x 4 4 x x x 1 2 x x x 4 3 x x Encodes family relationships, marker and phenotype information

Data File Field Codes CodeDescription MMarker Genotype AAffection Status. TQuantitative Trait. CCovariate. ZZygosity. S[n]Skip n columns.

Example Data File T some_trait_of_interest M some_marker M another_marker Provides information necessary to decode pedigree file

Example Map File CHROMOSOME MARKER POSITION 2 D2S D2S … Indicates location of individual markers, necessary to derive recombination fractions between them

Worked Example 11 / 11 / 9 4 )|2( 9 4 )|1( 9 1 )|0(     GIBDP G P G P p merlin –d example.dat –p example.ped –m example.map --ibd

Application: Information Content Mapping Information content: Provides a measure of how well a marker set approaches the goal of completely determining the inheritance outcome Based on concept of entropy E = -ΣP i log 2 P i where P i is probability of the ith outcome I E (x) = 1 – E(x)/E 0 Always lies between 0 and 1 Does not depend on test for linkage Scales linearly with power

Application: Information Content Mapping Simulations ABI (1 micro-satellite per 10cM) deCODE (1 microsatellite per 3cM) Illumina (1 SNP per 0.5cM) Affymetrix (1 SNP per 0.2 cM) Which panel performs best in terms of extracting marker information? merlin –d file.dat –p file.ped –m file.map --information

SNPs + parents microsat + parents SNP microsat 0.2 cM 3 cM 0.5 cM 10 cM Densities Position (cM) Information Content SNPs vs Microsatellites