Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.

Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website

Roadmap Definition: haplotype and haplotype inference Why infer haplotypes Infer haplotypes from pedigree data –Most probable haplotype configurations –Haplotype configurations with minimum recombinations Infer haplotypes from population data –Combinatorial: Clark’s, Perfect Phylogeny –Statistical methods: EM, Bayesian (MCMC) Infer haplotypes from pooled samples Haplotype block partition Tag SNP selection

Genotype and Haplotype {1 2}...AATGCCGCAA...GTC......AATGCCGCAA...GTC......AGTGCCGCAA...TAC......AGTGCCGCAA...TAC... PaternalMaternal

Typical Genotype Data Two alleles for each individual –Chromosome origin for each allele is unknown Multiple haplotype pairs can fit observed genotype Molecular haplotyping is expensive Observation: AC GA TC Marker1 Marker2 Marker3 Possible haplotypes: AC GA TC AC GA CT AC AG TC AC AG CT

Haplotypes are important! Phase may determine phenotype Phase helps exploit linkage disequilibrium Infer state of neighboring alleles Phase clarifies identity-by-descent status

Common Uses of Haplotypes Linkage disequilibrium studies –Summarize genetic variation Selecting markers to genotype –Identify haplotype tag SNPs Candidate gene association studies –Test haplotype associations –Help interpret single marker associations Understanding evolution of human populations

The problem… Haplotypes are hard to measure directly –X-chromosome in males –Sperm typing –Other molecular techniques Often, statistical or combinatorial methods for reconstruction required

Haplotype Inference on population data m=6, m’=4 {1 2} {1 1} {1 2} {2 2} m 1|2 1|1 1|2 2|2 2|1 1|1 1|2 2|2 1|2 1|1 2|1 1|2 2|2 2|1 1|1 2|1 1|2 2|2 2|1 1|1 2|1 2|2 …… 2 m’

Information on Relatives Number of ambiguous individuals increases rapidly with number of markers Family information can help, but many ambiguities remain

Haplotype Inference on Pedigrees, Mendelian Law {1 2}{1 1} {1 2} {2 2} 2|11|1 2|12 {1 2} {1 *} {1 1}{1 2}

Haplotype inference on pooled samples The input contain n pools Each pool contains k individuals, thus 2k haplotypes and m markers At each marker, we are given the number of alleles for the k individuals for each pool The goal is to find the haplotype frequencies Example: n=3, k=2, m=5 24322 02312 12223

Haptotyping pedigree data: statistical formulation Statistical formulation: find the most probable haplotype configuration Need to calculate the probability of a pedigree on every haplotype configuration Recall for linkage analysis, we need to calculate the probability of a pedigree, that sums over all possible haplotype configs

Haptotyping pedigree data : statistical formulation Thus the linkage programs like Genehunter, Allegro, Merlin could compute the most probable haplotypes But, it is time consuming…. In addition to exact computation, there are some approximation algorithms, mainly based on important sampling, e.g. SimWalk. Still very time consuming, may consider many configurations with very small probabilities

Recombination and combinatorial formulation {1 2} {1 1} {1 2} 1|2 {1 2} 1|1 {1 2} 1|2 {1 2} 1|2 1|1 1|2 1|1 1|2 2|1 1|2

MRHC Problem Find a minimum recombinant haplotype configuration from a given pedigree with genotype data. Assumptions: Mendelian law (no mutations); Recombination events are rare. Well supported from real data. {1 2} … {1 2} {2 2} … {1 1} {1 2} {2 2}... {1 2}... {1 2} {2 2} … {1 1} {1 2} {2 2} … {1 2}... {1 1} {1 2} {2 2}... Input

MRHC Problem (cont’d) PS: parental source of the two alleles at the locus (i.e. phase) GS: grandparental source of an allele Haplotype configuration = assignment of PS and GS values. Output PS=0 PS=1 GS 2 =1 1|2 2|1 … 1|2 2|1 2|2 … 1|1 1|2 2|2... 1|2 2|1... 1|2 2|2 … 1|1 2|1 2|2 … 1|2 2|1... 1|1 1|2 2|2... A B GS 2 =1 GS 2 =0

Previous Results Genotype elimination (O’Connell’00). –For data requiring no recombinant, exhaustive elimination. Genetic algorithm (Tapadar et al.’00). –Time consuming. MRH (Qian & Beckmann’02). –Six step rule-based algorithm. –Locus by locus at every step, extremely slow for biallelic (e.g. SNP) markers.

Thm. MRHC is NP-Hard.  Idea: Reduction from a variant of set cover.  First complexity result.  Remains hard for two loci.  Remains hard when no loops. Li & Jiang’03, Doi, Li & Jiang’03

Block-Extension Algorithm Iterative, heuristic, five steps. Rules are derived from Mendelian law, MR principle, block concept and some greedy ideas based on the following observations: Block structures are common in haplotypes. Double recombination events are rare. Common haplotype blocks shared in siblings. … Advantages/Disadvantages Time complexity (BE: O(dmn) / MRH: O(2 d m 3 n 2 )) Li & Jiang’03

Block-Extension Algorithm 1 3 4 2 2 4 3 2 1 2 3 2 2 1 3 4 3 3 2 2 4 3 4 1 * 4 2 2 4 3 2 * 12 345 6 1 1 2 2 3 3 4 1 3 4 2 2 4 3 2 1 2 3 2 2 1 3 4 2 3 * 12 345 6 1 1 2 2 3 3 4 3 3 2 2 4 3 4 1 3 4 2 2 4 3 2 1 3 4 2 2 4 3 2 1 2 3 2 2 1 3 4 2 3 3 4 1 4 2 * 12 345 6 1 1 2 2 3 3 4 3 3 2 2 4 3 4 1 3 4 2 2 4 3 2

Block-Extension Algorithm 1|1 1|2 2 3 3 4 1|2(-1,0) 2|3(1,-1) 2|1(-1,-1) 3 4 1|3(-1,1) 2|4(1,-1) 2|4(-1,-1) 3|2(-1,-1) 1|3 3 2 2 4 3 4 1 3 4|2(1,-1) 2 4 2|3(1,-1) 2|3 3 4 1 4 2 * 12 345 6 1|1 1|2 2 3 3 4 1|2(-1,0) 2|3(1,-1) 2|1(-1,-1) 3 4 1|3(-1,1) 2|4(1,-1) 3|2(-1,-1) 1|3 3 2 2 4 3 4 3|1(1,0) 4|2(1,-1) 2|3(1,-1) 2|3 3 4 1 4 2 * 12 345 6

Dynamic Programming Algorithms Locus-based dynamic programming algorithm –Linear time in the number of the members –Applicable to only tree pedigrees Member-based dynamic programming algorithm –Linear time in the number of the loci –Applicable to general pedigrees with small sizes Doi, Li & Jiang’03

root Locus-Based Dynamic Programming 21 5 43 6 7 8

Constraint-Finding Algorithm Assumptions: –No missing alleles, no errors. –Zero recombinants. Idea: finding all feasible (i.e. 0-recombinant) haplotype configurations is equivalent to reducing the degree of freedom in PS/GS assignment. Li & Jiang’03

Four Levels of Constraints PS=0 GS 2 =1 1|2 2|1 … 1|2 2|1 2|2 … 1|1 1|2 2|2... 1|2 2|1... 1|2 2|2 … 1|1 2|1 2|2 … 1|2 2|1... 1|1 1|2 2|2... A B Based on 0-recombinant (for a pair of loci): Level 3: Haplotype constraint Level 4: Grouping constraint Based on Mendelian law (on single locus) : Level 1: GS constraint Level 2: PS constraint

Level 3 and Level 4 Constraints {1 2} {1 1} {1 2} 1 2 345 6 1 2 2 1 1 2 2 1 45 6 1 2 2 1 1 2 2 1 45 6 {1 2} 45 6

Level 3 and Level 4 Constraints The variables represent PS values and the equations are over Z 2

Analysis of Constraint-Finding Algorithm Thm. Every solution consistent with the constraint equations is a feasible solution and vice versa. Steps: –find all constraints, in the form of linear equations over Z 2 –solve the equations by Gaussian elimination –enumerate all feasible haplotype configurations Exact polynomial time (O(n 3 m 3 ); genotype elimination: exponential)

Integer Linear Programming Combines missing data imputation and haplotype inference. Regardless of the pedigree structure, number of recombinants, number of variables are linear of problem size. Implicitly checks the Mendelian consistency for pedigree genotype data with missing alleles, which is also an NPC problem. Could find all possible optimal solutions. Solved by a branch-and-bound algorithm. Effective for practical size problems in terms of time efficiency. Accurate in terms of missing alleles imputation and haplotype inference. Li & Jiang’04a

ILP for MRHC with Missing Data 1.Define variables. 2.Define linear constraints. 3.Define a linear objective function of the variables. 4.Preprocess constraints. 5.Apply branch-and-bound strategy to find solutions. (a partial order relationship and some other special relationships). 6.Estimate bounds. 7.Apply a maximum likelihood approach to multiple optimal solutions.

Formulation M j :={m k } set of all possible alleles at marker locus j and let t j = |M j |. M 1 = {1, 2}, M 2 = {1,2} {1 2} {1 1} {1 2} {1 0} {1 2} 1234 Individual 4: … Define t j f vars for each paternal allele and t j m vars for each maternal allele at locus j of individual i:

Formulation: Variables Define 2 g vars for each paternal allele and maternal allele at locus j for individual i Var g 1 = 0 (or 1) iff paternal allele is copied from father’s paternal (or maternal) allele. Var g 2 defined similarly. Define r vars:

Formulation: Objective Function Subject to Genotype constraints: Objective function:

Formulation: Constraints Mendelian law of inheritance constraints (a child i and its father f ): Constraints for r vars:

A Partial Order Relationship Denote: Inequalities with 2 variables: 1 2 3 4 5 8 6 9 11 10 7 1 2 3’ 8 9 11 10

Forced Variables Rule 1: Rule 2: Rule 3:

Lower and Upper Bounds Lower bounds –Linear relaxation. –Summation of the number of recombinants in each nuclear family. –Effective for data with large number of recombinants. Upper bound –Obtained by block-extension algorithm. –Effective for data with small number of recombinants.

Statistical Assessment E-M algorithm to estimate haplotype frequencies for data that consist of multiple pedigrees.

PedPhase software Simulated data were generated to compare our algorithms, as well as MRH in terms of efficiency, accuracy. Three different pedigree structures. Multiallelic and biallelic data. Numbers of loci: 10, 25 and 50. Number of recombinants: 0-4. 100 runs per data set.

Pedigree Structures

Accuracy Results of BE Algorithm

Efficiency Results

More Results from ILP

Real Data Analysis Data set (Gabriel et al.’02) 93 members, 12 pedigrees (each with 7-8 members); chromosome 3, 4 regions, each region 1-4 blocks.

Common Haplotypes & Frequencies

Results From ILP on the Whole Dataset 3.82 4.00 0.45 0.034

What if there are no relatives? Rely on linkage disequilibrium Assume that population consists of small number of distinct haplotypes Haplotypes tend to be similar

Clark’s Haplotyping Algorithm Clark (1990) Mol Biol Evol 7:111-122 One of the first haplotyping algorithms –Computationally efficient –Very fast Today, more accurate alternatives are often available

Clark’s Haplotyping Algorithm Find homozygous individuals –Initialize a list of known haplotypes Resolve ambiguous individuals –If possible, use two haplotypes from list –Otherwise, use one known haplotype and augment list If unphased individuals remain –Assign phase randomly to one individual –Augment haplotype list and continue from previous step

Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003

00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 sites Ancestral haplotype Extant haplotypes at the leaves Site mutations on edges The Perfect Phylogeny Model of Haplotype Evolution

The Perfect Phylogeny Model We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed. In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root.

Justification for Perfect Phylogeny Model In the absence of recombination each haplotype of any individual has a single parent, so tracing back the history of the haplotypes in a population gives a tree. Recent strong evidence for long regions of DNA with no recombination. Key to the NIH haplotype mapping project. (See NYT October 30, 2002) Mutations are rare at selected sites, so are assumed non-recurrent. Connection with coalescent models.

Perfect Phylogeny Haplotype (PPH) Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny. 12 a22 b02 c10 sites A haplotype pair explains a genotype if the merge of the haplotypes creates the genotype. Example: The merge of 0 1 and 1 0 explains 2 2. Genotype matrix S

The PPH Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12 a10 a01 b00 b01 c10 c10

The Haplotype Phylogeny Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12 a10 a01 b00 b01 c10 c10 1 c c a a b b 2 10 01 00

The Alternative Explanation 12 a22 b02 c10 12 a11 a00 b00 b01 c10 c10 No tree possible for this explanation

Efficient Solutions to the PPH problem - n genotypes, m sites Reduction to a graph realization problem (GPPH) - build on Bixby-Wagner or Fushishige solution to graph realization O(nm alpha(nm)) time. Reduction to graph realization - build on Tutte’s graph realization method O(nm^2) time. Direct, from scratch combinatorial approach - O(nm^2) Bafna et al. Berkeley (EHK) approach - specialize the Tutte solution to the PPH problem - O(nm^2) time.

The DPPH Method Bafna et al. O(nm^2) time Based on deeper combinatorial observations about the PPH problem. A matrix-centric approach (rather than tree-centric), although a graph is used in the algorithm. First, we need to understand why some sets of haplotypes have a perfect phylogeny, and some do not.

When does a set of haplotypes fit a perfect phylogeny? Arrange the haplotypes in a matrix, two haplotypes for each individual. Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all three pairs: 0,1 and 1,0 and 1,1 This is the 3-Gamete Test

The Alternative Explanation 12 a22 b02 c10 12 a11 a00 b00 b01 c10 c10 No tree possible for this explanation

12 a22 b02 c10 12 a10 a01 b00 b01 c10 c10 1 c c a a b b 2 0 0 1 0 The Tree Explanation Again

PPH: The Combinatorial Problem Input: A ternary matrix (0,1,2) M with 2N rows partitioned into N pairs of rows, where the two rows in each pair are identical. Def: If a pair of rows (r,r’) in the partition have entry values of 2 in a column j then positions (r,j) and (r’,j) are called Mates.

Output: A binary matrix M’ created from M by replacing each 2 in M with either 0 or 1, such that a)A position is assigned 0 if and only if its Mate is assigned 1. b) M’ passes the 3-Gamete Test, i.e., does not contain a 3x2 submatrix (after row and column permutations) with all three combinations 0,1; 1,0; and 1,1

Initial Observations If two columns of M contain the following rows 2 0 2 0 mates then M’ will contain a row with 1 0 and a row with 0 1 in those columns. This is a forced expansion.

Initial Observations Similarly, if two columns of M contain the mates 2 1 then M’ will contain a row with 1 1 and a row with 0 1 in those columns. This is a forced expansion.

If a forced expansion of two columns creates 0 1 in those columns, then any 2 2 1 0 2 2 in those columns must be set to be 0 1 1 0 We say that two columns are forced out-of-phase. If a forced expansion of two columns creates 1 1 in those columns, then any 2 2 2 2 in those columns must be set to be 1 0 We say that two columns are forced in-phase.

1 2 2 1 2 2 2 0 2 2 0 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 0 2 2 0 a a b b c c d e d e 1 2 3 Columns 1 and 2, and 1 and 3 are forced in-phase. Columns 2 and 3 are forced out-of-phase. Example: 10 11 10 00 1 3 a a e e 10 11 00 10 1 2 a a b b 00 01 10 00 2 3 b b e e

Immediate Failure It can happen that the forced expansion of cells creates a 3x2 submatrix that fails the 3-Gamete Test. In that case, there is no PPH solution for M. Example: 20 11 02 Will fail the 3-Gamete Test

An O(nm^2)-time Algorithm Find all the forced phase relationships by considering columns in pairs. Find all the inferred, invariant, phase relationships. Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred. Result: An implicit representation of all solutions to the PPH problem.

1 2 2 2 0 0 0 1 2 2 2 0 0 0 2 0 2 0 0 0 2 2 0 2 0 0 0 2 1 2 2 2 0 2 0 1 2 2 2 0 2 0 1 2 2 0 2 0 0 1 2 2 0 2 0 0 2 2 0 0 0 2 0 2 2 0 0 0 2 0 a a b b c c d e d e 1 2 3 4 5 6 7 A running example.

Overview of Bafna et al. algorithm First, represent the forced phase relationships, and the needed decisions, in a graph G.

1 7 2 5 3 4 6 Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase. Graph G

1 7 2 5 3 4 6 Each Red edge indicates that the columns are forced in-phase. Each Blue edge indicates that the columns are forced out-of-phase. Let Gf be the subgraph of Gc defined by the red and blue edges. Graph Gc

1 7 2 5 3 4 6 Graph Gf has three connected components.

The Central Theorem There is a solution to the PPH problem for M if and only if there is a coloring of the dashed edges of Gc with the following property: For any triangle (i,j,k) in Gc, where there is one row containing 2’s in all three columns i,j and k (any triangle containing at least one dashed edge will be of this type), the coloring makes either 0 or 2 of the edges blue (out-of-phase). Nice, but how do we find such a coloring?

1 7 2 5 3 4 6 Theorem 1: If there are any dashed edges whose ends are in the same connected component of Gf, at least one edge is in a triangle where the other edges are not dashed, and in every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges. This is an “inferred” coloring. Graph Gf Triangle Rule

1 7 2 5 3 4 6

Corollary Inside any connected component of Gf, ALL the phase relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pairwise column comparisons, or by triangle-based inferred colorings. Hence, the phase relationships of all the columns in a connected component of Gf are INVARIANT over all the solutions to the PPH problem.

The dashed edges in Gf can be ordered so that the inferred colorings can be done in linear time. Modification of DFS. See the paper for details, or assign it as a homework exercise.

Finishing the Solution Problem: A connected component C of G may contain several connected components of Gf, so any edge crossing two components of Gf will still be dashed. How should they be colored?

1 7 2 5 3 4 6 How should we color the remaining dashed edges in a connected component C of Gc?

Answer For a connected component C of G with k connected components of Gf, select any subset S of k-1 dashed edges in C, so that S together with the red and blue edges span all the nodes of C. Arbitrarily, color each edge in S either red or blue. Infer the color of any remaining dashed edges by successive use of the triangle rule.

1 7 2 5 3 4 6 Pick and color edges (2,5) and (3,7) The remaining dashed edges are colored by using the triangle rule.

1 7 2 5 3 4 6

Theorem 2 Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining dashed edges. Different colorings of S determine different colorings of the remaining dashed edges. Each different coloring of S determines a different solution to the PPH problem. All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.

Comparing the programs - R.H. Chung All three are fast and practical (under one second) on problem instances of size 50 x 30. DPPH is the fastest, followed by HPPH and GPPH. HPPH encounters memory problems with large input.

30500.650.02060.0215 3001509.33.04.49 5002503611.521.5 2000100023316401866 sites individ GPPH DPPH HPPH times shown are in seconds on an 800 Mhz machine.

A Phase-Transition Problem, as the ratio of sites to genotypes changes, how does the probability that the PPH solution is unique change? For greatest utility, we want genotype data where the PPH solution is unique. Intuitively, as the ratio of genotypes to sites increases, the probability of uniqueness increases.

Extension With recombination The papers: See wwwcsif.cs.ucdavis.edu/~gusfield

The E-M Haplotyping Algorithm Excoffier and Slatkin (1995) Mol Biol Evol 12:921-927 Provide a clear outline of how the algorithm can be applied to genetic data Combination of two strategies –E-M statistical algorithm for missing data –Counting algorithm for allele frequencies

E-M Algorithm For Haplotyping 1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency of each haplotype by counting 4. Repeat steps 2 and 3 until frequencies are stable

E-M Algorithm for Haplotyping Cost grows rapidly with number of markers Typically appropriate for < 25 SNPs –Fewer microsatellites More accurate than Clark’s method Fully or partially phased individuals contribute most of the information

Enhancements to E-M List only haplotypes present in sample – Gradually expand subset of markers under consideration, eliminating haplotypes with low estimated frequency from consideration at each stage SNPHAP [Clayton (2001)] HAPLOTYPER [Qin et al. (2002)]

Divide-And-Conquer Approximation No. of potential haplotypes increases exponentially –Actual no. of haplotypes doesn’t Approximation –Successively divide marker set –Run E-M assuming segments associate randomly –Proceed, ignoring composites of segments with zero frequency Order: ~ m log m Exact E-M is order ~ 2m

Other Recent Developments … Newer methods try to further improve haplotype estimation by favoring sets of similar haplotypes Stephens et al. (2001) Am J Hum Genet 68:978-89 Genealogical approach, which implies haplotypes are similar to each other…

Method based on Gibbs sampler MCMC method –Stochastic, random procedure –Improves solution gradually Given initial set of haplotypes Sample haplotypes for one individual at a time, assuming other haplotypes are true Repeat a few million times…

Update Procedure I Pick individual U to update at random Calculate haplotype frequencies F in all other individuals –Since everyone is “phased”, this is done by counting Sample new haplotypes for U from conditional distribution of U’s haplotypes given F

Update Procedure I This procedure would produce an estimate of haplotype frequencies that equivalent to the E-M algorithm… Stephens et al (2001) suggested an alternative estimate of F…

Update Procedure II Estimate F from the other individuals Construct F* to include haplotypes in F and also other similar (possibly differing at a few sites, due to mutations) Update U’s haplotypes conditional on F*

Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.

Similar presentations

Presentation on theme: "Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.

Similar presentations

Presentation on theme: "Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website."— Presentation transcript:

Similar presentations

About project

Feedback