Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Haplotype Inference on Pedigrees and Applications

Similar presentations


Presentation on theme: "Efficient Haplotype Inference on Pedigrees and Applications"— Presentation transcript:

1 Efficient Haplotype Inference on Pedigrees and Applications
11/28/2018 Efficient Haplotype Inference on Pedigrees and Applications Previous talks either work on population data, or work directly from SNP alignment, my talk today will focus on the algorithms issues for pedigree data. Jing Li Tao Jiang Dept of Computer Science University of California, Riverside November 28, 2018

2 Outline Background The MRHC problem and complexity
11/28/2018 After briefly introduce some 1, we formally define the Min Rec Hap Cof. Prob, and show that it’s NP-hard. A heuristic algo. Called 3 is developed for such a prob. For a special case, that data require 0 recob., we give an exact alog. Called const-finding. At last I will show some experimental results and discuss some future work. Outline Background The MRHC problem and complexity A heuristic algorithm (block-extension) An exact algorithm for 0-recombinant data Two dynamic programming algorithms Integer linear programming formulation and solution for MRHC with missing alleles Experimental results and application in disease gene association mapping November 28, 2018

3 Terms Diploid Polymorphisms, marker, allele, and SNP
11/28/2018 From previous talks, most of us already know some terms related to haplotype reconsturction, such From a math pt of view, a genotype is an unordered pair at each marker locus on 2 chromosomes, while a haplotype is a vector on one chromosome. I believe most of you know the terminologies, not rigorous definitions, but from a math pt view. One of the central problems in genetics is to understand inheritance. The hereditary information of an organism is organized into a number of {\em chromosomes}, which mainly consist of {\em DNAs}. DNA variants, important for association study, short, highly linked, distinct configurations is often far lower than one would expect from independent sampling, haplotypes, might be a more biologically meaningful unit, frequency SNP is one of the DNA polymorphisms in population, property With the finishing of the draft of human genome DNA sequence, more and more researchers are interested in the difference of DNA sequences among different persons (DNA polymorphisms in population). The hope is that by examining this kind of difference, we could understand the genetic reasons for many complex diseases like cancer, heart disease and diabetes etc., and find a way to treat them. As we know that humans are diploid, which means each chromosome has two copies, one is from father and the other is from mother. These two copies are not exactly identical in terms of DNA sequences. markers are landmarks along chromosomes that can label the positions of a chromosome. Markers are not genes, but for simplicity, we can think of markers are part of genes. Marker states are called alleles. Single Nucleotide Polymorphism (SNP) is one kind of markers that at each position, only one nucleotide. The two alleles at one position is called genotypes and all alleles from one copy of chromosome is called haplotype. The haplotype information is what we need. But currently the techniques collect the genotypes without specifying which allele is from which parent. Pedigree is an extended family and can be used to assist the processing of haplotyping. Terms Diploid Polymorphisms, marker, allele, and SNP Genotype, homozygous & heterozygous Haplotype, paternal & maternal haplotypes Genotype A G C G 1 1 Haplotype 2 1 Biallelic Multiallelic 3 6 2 3 Maker locus Paternal Maternal November 28, 2018

4 Mendelian law of inheritance and recombination
11/28/2018 Mendelian law of inheritance and recombination Two more useful concepts in our problem setting are title. When consider one locus, each allele in a child is from one parent. In other words, if parents have such genotypes, child can only have one of the 4 possible genotype. When consider 2 loci, if the mother has haplotypes AC and BD, child may inherit one of the 4 possible haplotypes from the mother. The haplotypes AD and BC are products of crossover from the two original haplotypes of the mother. Such event is called recomb event. One of the central problems in genetics is to understand inheritance. If only consider one gene, people already know that the two alleles, each must from one parent. In another word, if the parents have such genotypes, there are total 4 possible genotypes in their children. But when considering two genes, for each parent, there are 4 possible haplotypes that can occur in a child, so there are 16 possible combinations from both parents. But the experimental results show that not all the possible combinations are equally distributed. For a long segment of chromosome, only a small number of common haplotypes occur in a whole population. If the haplotype in a child is an exact copy of a parent, which means the two chromosomes in the parent do not involve a crossover; otherwise there are some crossover occur, which is called recombination event. Father Mother Parent A B C D A C B D Child: C1 C2 C3 C4 A C B D A D B C A C A D B C B D November 28, 2018

5 Pedigree Pedigree, nuclear family, founder November 28, 2018
11/28/2018 Pedigree To study inheritance pattern in humans, {\em Pedigree analysis} is a very important method, because controlled crosses cannot be made. On the left, is a traditional illustration of a pedigree. Square stands for male and circle stand for female. Children are put under their parents. On the right, we put individual ID and their genotypes under each individual. Pedigree, nuclear family, founder November 28, 2018

6 Pedigree Pedigree, nuclear family, founder Mother Father ID Num
11/28/2018 Pedigree Founders, loop, family trio, nuclear family. In our formal definition of pedigree graph, we have a mating node that connects parents with their children. To study inheritance in humans, {\em Pedigree analysis} is a very important method in human genetics. The gender, marriage relation, parent-offspring relation and disease status can be represented by symbols in a pedigree graph. For example in Figure~\ref{fig:f15}, a square stands for a male, a circle stands for a female, the numbers below the squares or circles are the identification number of the corresponding individuals, a horizontal line connecting two opposite gender persons stands for marriage, and a vertical line from the marriage line to a member stands for a parent-offspring relation. %We call the members without parents information {\em The founders} of a pedigree are members (like members 3-1, 3-2, 3-4, 3-6 and 3-8 in Figure~\ref{fig:f15}) that do not have parents in the pedigree. {\em A nuclear family} in a pedigree is only comprised of parents and their children. (for example, the top nuclear family in Figure~\ref{fig:f15} Problem: given pedigree and marker genotype info, find haplotype configuration Standard method to use the pedigree structure, Pedigree, nuclear family, founder Father Mother ID Num Mating node Founders Genotypes Family trio loop Children Nuclear family November 28, 2018

7 Haplotyping from genotypes: The problem & methods
11/28/2018 Haplotyping from genotypes: The problem & methods Given a pedigree and the genotype data of each member in the pedigree, how to reconstruct the haplotypes. Given the genotype information of each individual with possible missing data, how to reconstruct the haplotypes. Two kinds of input data: population data and pedigree data. We are interested in pedigree data. And from now on we focus on pedigree data. Before we formally define the problem, let us take a look at some pedigree structures and biological significance of the problem. Two kinds of methods: statistical : find the most likely haplotypes based on genotype data. rule-based. Define some rules based on biological assumptions and find those haplotypes by the guide of those rules. Although statistical methods have solid theoretical bases, they are very computation intensive. Rule-based methods are usually very fast and with some reasonable biological assumptions, they are very powerful. Our methods are rule-based, for pedigree data and not restrict to SNP marker. Problem: Input: genotype data (possibly with missing alleles). Output: haplotypes. Input data: Data with pedigree (dependent individuals). Data without pedigree info (independent individuals). Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive Rule-based methods Define rules based on some plausible assumptions and find those haplotypes consistent with these rules. Adv: usually simple thus very fast Disadv: no numerical assessment of the reliability of the results November 28, 2018

8 11/28/2018 As we all know the haplotype reconstruction currently is a very important problem One of the central problems in genetics is to understand inheritance. From a hereditary point of view, haplotype is more biologically meaningful than genotype since children inherit haplotypes instead of genotypes from each parent. Haplotype data are more informative and more valuable. With the dense SNP haplotype map is being built, various new methods have been proposed to use haplotype information for gene mapping. Some existing statistical methods have also shown increased power by incorporating SNP haplotype information. However, due to the fact that humans are diploid and current widely used polymorphism screen techniques collect genotype data rather than haplotype data because of low costs. Computational methods deriving haplotype from genotype are highly demanded. While the dense SNP haplotype map is being built, various new methods~\cite{Lam2000, Liu2001, Service1999, Toivonen2000} have been proposed to use haplotype information for linkage disequilibrium mapping. Some existing statistical methods have also shown increased power by incorporating SNP haplotype information~\cite{Seltman2001,Zhang2002}. Motivations Haplotype is more biologically meaningful than genotype since each haplotype of a child is inherited from one parent. Haplotype data are more informative and more valuable in determining the association between diseases and genes and in study of human histories. The human genome project gave us the consensus genotype sequence of humans, but in order to understand the genetic effects on many complex diseases such as cancers, diabetes, osteoporoses, the genetic variations are more important, which can be represented by haplotypes. Current techniques collect genotype data. Computational methods deriving haplotypes from genotypes are highly demanded. The ongoing international HapMap project. November 28, 2018

9 11/28/2018 Motivations (cont’d) We are interested in rule based haplotyping algorithm for pedigree data, because, It’s generally believed that with parents/pedigree information, we could get more accurate haplotype and frequency estimations than from data w/o such information. Family-based association studies have been widely used. We would expect more family-based gene mapping methods that assume accurate haplotype information. Not only computation intensive, model-based statistical methods may use assumptions that may not hold in real datasets. November 28, 2018

10 MRHC Problem Assumptions:
11/28/2018 MRHC Problem The Min-Rec-Hap-Conf. is that given a pedigree and genotypes for each member, figure out a haplotype configuration with min recombinant. Here the assumption we used are 1,2. The problem is not as easy as it first sounds, especially for biallelic data like SNPs, because of the ambiguous case like this. (1 2) (2 2) (1 1) ... Find a minimum recombinant haplotype configuration from a given pedigree with genotype data. Assumptions: Mendelian law (no mutations); Recombination events are rare. Input November 28, 2018

11 11/28/2018 We observe that haplotype configuration can be described using PS/GS. The output of the example like this, on the graph, we use the bar to indicate that left allele is from father and right allele is from mother. This info can be encoded using a binary variable PS. In order to count the number of recombinants, we further use a binary variable GS to indicate an allele is from parent’s paternal or maternal haplotype (like the first allele 1 in the mother is from grandfather’s paternal). If two alleles on a haplotype at adjacent loci have different GS values, there must be a recombination event between these two loci. MRHC Problem PS: parental source of the two alleles at the locus (i.e. phase) GS: grandparental source of an allele Haplotype configuration = assignment of PS and GS values. 1|2 2|1 2|2 1|1 ... PS = GS(1) = 0 Output November 28, 2018

12 Previous Results Genotype elimination (O’Connell & Weeks’99).
11/28/2018 Previously, there are some heuristic algorithms like genotype elimination & Genetic algorithm. More recently, Qian & Beckmann proposed a Six step rule-based algorithm that works Locus by locus in each nuclear family. Although the authors claim that it’s faster that the previous ones, experimental results show it extremely slow for biallelic (e.g. SNP) markers. Our block extension algorithm is similar with the MRH, but by using the block concept, we achieve a very high speed. Previous Results Genotype elimination (O’Connell & Weeks’99). Can only find haplotype configurations requiring no recombinant in the pedigree, exhaustive elimination. Genetic algorithm (Tapadar et al.’00). Still time consuming, needs many iterations before convergence. MRH (Qian & Beckmann’02). Six step rule-based algorithm. Locus by locus at every step, extremely slow for biallelic (e.g. SNP) markers. November 28, 2018

13 Thm. MRHC is NP-complete.
11/28/2018 Thm. MRHC is NP-complete. We prove that in general the MRHC problem is NP-hard. We use a Reduction from a variant of set cover problem. This is the First complexity result concerning the problem. It Remains hard when there are only two loci. On the right is a gadget pedigree we used in our proof. Idea: Reduction from a variant of set cover. First complexity result concerning the problem. Remains hard when there are only two loci. Remains hard when no loops in a pedigree. November 28, 2018

14 Block-Extension Algorithm
11/28/2018 Block-Extension Algorithm Add a few words, fast Although it is in general NP-hard, it does not imply there is no efficient algorithm for real datasets. Our first algorithm is an Iterative, heuristic alg. That has five steps. …., Haplotype block structure is common in human genomes (Daly et al, Gabriel et al.) Double recombination events are rare within a small region. Shared haplotype blocks in siblings are strong evidence that no recombination should occur on those siblings. It is possible to determine that some individuals must involve recombination. Resolve these members at last. Step 2: two algorithms Block-extend algorithm Nuclear family dynamic programming algorithm Iterative, heuristic, five steps. Rules are derived from Mendelian law, MR principle, block concept and some greedy ideas based on the following observations: Block structures are common in haplotypes. Double recombination events are rare. Common haplotype blocks shared in siblings. November 28, 2018

15 Steps in BE algorithm Missing genotype data imputation by the Mendelian Law of inheritance PS and GS assignment by Mendelian Law Locus by locus, member by member, in a top-down scan Greedy assignment of PS/GS Bottom-up, find a child with known GS values and try to set the GS values of adjacent alleles. Sandwich rule. Block-Extension Iteratively extend the longest block to the same region of other members. Finishing the gaps between blocks by enumeration. November 28, 2018

16 Analysis of the Block-Extension Algorithm
11/28/2018 Analysis of the Block-Extension Algorithm As we can see the algorithm works in a way that try to locally optimize the assignment. It has the potential that more efficient the MRH because it works on the blocks instead of locus by locus. Its accuracy could decrease with the increase of the number of recombination events since some rules are based on the assumption that recombination are rare. Adv: Simple and efficient. Accurate when the number of recombination events is small. Disadv: Potential errors in steps 3 and 4. Accuracy could decrease with the increase of the number of recombination events. November 28, 2018

17 An Exact Algorithm for 0-Recombinant Data Based on Constraint-Finding
11/28/2018 The input data could be interpreted with 0 recombinants. As a heuristic, the block-extension algorithm does not always compute an optimal solution for MRHC, especially when the numberof required recombination events in the pedigree increases. One ofour main objectives is to find an algorithm that runs fast enoughon most real data and always gives an optimal haplotypeassignment, even if its running time is exponential in the worstcase. Our first attempt is for an efficient (polynomial time)algorithm to enumerate all possible haplotype assignments involving no recombinations. The algorithm may be useful for solving 0-recombinant data (i.e. data that can be interpreted with 0 recombinants) or serving as a subroutine in a general algorithm. We assume that the input data has no missing genotypes An Exact Algorithm for 0-Recombinant Data Based on Constraint-Finding Assumptions: Zero recombinants. No missing alleles, no errors. Idea: finding all feasible (i.e. 0-recombinant) haplotype configurations is equivalent to reducing the degree of freedom in PS/GS assignment. Steps: find all constraints, in the form of linear equations over Z2 solve the equations by Gaussian elimination enumerate all feasible haplotype configurations November 28, 2018

18 Four Levels of Constraints
11/28/2018 Alg. Process this constraints level by level, either fix, or reduce, concern Four Levels of Constraints Based on Mendelian law (on single locus) : Level 1: GS constraint Level 2: PS constraint Based on 0-recombinant assumption (for a pair of loci): Level 3: Haplotype constraint Level 4: Grouping constraint November 28, 2018

19 Level 3 and Level 4 Constraints
11/28/2018 Level 3 and Level 4 Constraints force 4 5 1 2 1 2 1 2 1 2 1 2 6 1 2 3 4 5 1 2 1 2 1 2 4 5 4 5 6 1 2 1 2 1 2 2 1 1 2 2 1 1 1 6 6 2 1 1 2 2 1 November 28, 2018

20 Level 3 and level 4 constraints
11/28/2018 Level 3 and level 4 constraints Mention meaning of x1 Note: The variables represent PS values and the equations are over Z2 November 28, 2018

21 Constraint-Finding Algorithm
Thm. Every solution consistent with the constraint equations is a feasible solution and vice versa. We can adapt the classical Gaussian elimination algorithm to find all consistent solutions in O(n3m3) time. Previously, only an exponential time algorithm is known. The algorithm is useful for solving 0-recombinant data and may serve as a subroutine in a general haplotyping algorithm. November 28, 2018

22 More Exact Algorithms Locus-based dynamic programming algorithm
Linear time in the number of the members Applicable to only tree pedigrees Member-based dynamic programming algorithm Linear time in the number of the loci Applicable to general pedigrees with small size Integer linear programming (ILP) with branch-and-bound Combines missing data imputation and haplotype inference together; implicitly checks the Mendelian consistency for pedigree genotype data with missing alleles, which is also an NPC problem. Effective for practical size problems, regardless of the pedigree structure November 28, 2018

23 ILP for MRHC with Missing Data
Alleles are represented as binary variables. Genotype info and the Mendelian law of inheritance are enforced by linear constraints. The objective of minimizing the total number of recombinants is encoded as a linear function of the variables. Effective preprocessing of constraints by taking advantage of special properties in our ILP formulation to reduce the number of variables. Branch-and-bound strategy to find solutions. The branch step guided by a partial order relationship (and some other special relationships) identified during the preprocessing step. Non-trivial bounds are estimated to prune the search tree. A maximum likelihood approach is used to select the best haplotype configuration from multiple optimal solutions. November 28, 2018

24 Formulation: variables
11/28/2018 Formulation: variables Explain the meaning of variables: g1=0(1) iff paternal allele copied from father paternal (maternal) allele. g2=0(1) iff paternal allele copied from matherpaternal (maternal) allele. Number of total loci: m. r2 is similarly defined. Possible alleles (total tj) at marker locus j: Define 2tj (f and m) vars and 2 g vars for each paternal allele and maternal allele at locus j for individual i Var fk (or mk)=1 iff the allele is mk. Var g1 = 0 (or 1) iff paternal allele is copied from father’s paternal (or maternal) allele. Var g2 defined similarly. Define r vars: November 28, 2018

25 Formulation Objective function: Subject to
Genotype constraints: (0 means missing allele) November 28, 2018

26 11/28/2018 Formulation Mendelian law of inheritance constraints (for a child i and its father f ): Constraints for r vars: November 28, 2018

27 ILP Solved by branch & bound Practical in terms of time efficiency
Could find all possible optimal solutions Very effective in terms of missing alleles imputation. November 28, 2018

28 11/28/2018 Mention MRH Simulation Studies Algorithms have been implemented in a program called PedPhase in C++. Simulated data were generated to compare our algorithms, as well as MRH. Three different pedigree structures. Multiallelic and biallelic data. Numbers of loci: 10, 25 and 50. Number of recombinants: 0-4. 100 runs per data set. November 28, 2018

29 Pedigree structures November 28, 2018

30 Accuracy Results of BE algorithm
November 28, 2018

31 Efficiency Results November 28, 2018

32 More results from ILP November 28, 2018

33 Real data analysis Data set (Gabriel et al.’02)
11/28/2018 Real data analysis Gabriel et al Data set (Gabriel et al.’02) 93 members, 12 pedigrees (each with 7-8 members); chromosome 3, 4 regions, each region 1-4 blocks. November 28, 2018

34 11/28/2018 Reconstruction of common haplotypes and estimation of their frequencies 5%, November 28, 2018

35 Results from ILP on the whole dataset
November 28, 2018

36 An application of haplotype inference
We have developed a new haplotype association mapping method based on density-based clustering for case-control data. The method regards haplotype segments as data points in a high dimensional space, and defines a new pairwise haplotype distance measure. Clusters are then identified by a density-based clustering algorithm. Z-scores based on the number of cases and controls in a cluster can be used as an indicator of the degree of association between a cluster and the disease under study. Results are very promising. But it needs haplotypes as input. November 28, 2018

37 An application of haplotype inference
Haplotypes are inferred by computational methods that we mentioned earlier. For example: a real data set that we analyzed consists of 385 nuclear families of size 4 (2 parents with 2 affected children). We do haplotype inference first using our ILP algorithm. The haplotypes transmitted to (affected) children are treated as cases and un-transmitted haplotypes as controls. The haplotype association method was applied then. November 28, 2018

38 Summary Li, J. and T. Jiang. Efficient Rule-Based Haplotyping Algorithm for Pedigree Data. RECOMB’03 NP-completeness proof for general pedigrees. An efficient heuristic algorithm: block-extension. An efficient exact algorithm for 0-recombinant data. Doi, K., J. Li and T. Jiang. Minimum Recombinant Haplotype Configuration on Tree Pedigrees. WABI’03 NP-completeness proof for loopless pedigrees. Two dynamic programming algorithms Li, J. and T. Jiang. An Exact Solution for Finding Minimum Recombinant Haplotype Configurations on Pedigrees with Missing Data by Integer Linear Programming. RECOMB’04, to appear. Li, J. and T. Jiang. Haplotype Association Mapping by Density-Based Clustering in Case-Control Studies. RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotypes, CMU, 2004. November 28, 2018

39 11/28/2018 Report all minimum recombinant configurations and then do statistical analysis to select the best haplotype configuration. Methods for population samples Methods of whole genome association study using haplotype information Future Work Comparison between our algorithms and statistical algorithms on the accuracy of inferred haplotypes. Incorporating the likelihood of recombination into the objective function of the ILP formulation. Haplotype inference without pedigree information. November 28, 2018

40 Acknowledgements Whitehead/MIT Center for Genome Research
11/28/2018 Qian Acknowledgements Dr. Dajun Qian from City of Hope. Whitehead/MIT Center for Genome Research Drs. David Altshuler, Mark Daly, Stacey Gabriel, and Stephen Schaffner, and their entire group. Financial support from NSF and CMOST 973 project. November 28, 2018


Download ppt "Efficient Haplotype Inference on Pedigrees and Applications"

Similar presentations


Ads by Google