Half-Sibling Reconstruction A Theoretical Analysis Saad Sheikh Department of Computer Science University of Illinois at Chicago Brothers! ? ?

Slides:



Advertisements
Similar presentations
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
. Exact Inference in Bayesian Networks Lecture 9.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Wavelength Assignment in Optical Network Design Team 6: Lisa Zhang (Mentor) Brendan Farrell, Yi Huang, Mark Iwen, Ting Wang, Jintong Zheng Progress Report.
Polynomial Time Approximation Schemes Presented By: Leonid Barenboim Roee Weisbert.
Probabilistically Checkable Proofs (and inapproximability) Irit Dinur, Weizmann open day, May 1 st 2009.
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011.
Tuesday, May 14 Genetic Algorithms Handouts: Lecture Notes Question: when should there be an additional review session?
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Introduction to PCP and Hardness of Approximation Dana Moshkovitz Princeton University and The Institute for Advanced Study 1.
PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.
Approximate Counting via Correlation Decay Pinyan Lu Microsoft Research.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Computational problems, algorithms, runtime, hardness
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
IGES 2003 How many markers are necessary to infer correct familial relationships in follow-up studies? Silvano Presciuttini 1,3, Chiara Toni 2, Fabio Marroni.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 25 Instructor: Paul Beame.
Packing Element-Disjoint Steiner Trees Mohammad R. Salavatipour Department of Computing Science University of Alberta Joint with Joseph Cheriyan Department.
CSE 589 Applied Algorithms Spring Colorability Branch and Bound.
Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings.
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
On Approximating Four Covering/Packing Problems Bhaskar DasGupta, Computer Science, UIC Mary Ashley, Biological Sciences, UIC Tanya Berger-Wolf, Computer.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Algorithms for hard problems Introduction Juris Viksna, 2015.
Approximating Buy-at-Bulk and Shallow-Light k-Steiner Trees Mohammad T. Hajiaghayi (CMU) Guy Kortsarz (Rutgers) Mohammad R. Salavatipour (U. Alberta) Presented.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Approximation Algorithms by bounding the OPT Instructor Neelima Gupta
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Approximate Algorithms (chap. 35)
Computability and Complexity
ICS 353: Design and Analysis of Algorithms
Haplotype Reconstruction
Introduction to PCP and Hardness of Approximation
IBD Estimation in Pedigrees
Instructor: Aaron Roth
Parsimony population haplotyping
Presentation transcript:

Half-Sibling Reconstruction A Theoretical Analysis Saad Sheikh Department of Computer Science University of Illinois at Chicago Brothers! ? ?

Many Problems exist where No way to ascertain the ground-truth Correlate naturally with theoretical problems E.g. Finding Communities, Sequence Alignment and: Sibling Reconstruction Given genetic information on a cohort of individuals determine the sibling relationships in the population. Theoretically linked to classical problems including graph coloring, triangle packing and Raz’s Parallel Repetition Theorem Introduction

Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier Lemon sharks, Negaprion brevirostris 2 Brown-headed cowbird (Molothrus ater) eggs in a Blue-winged Warbler's nest Biological Motivation

Gene Unit of inheritance Allele Actual genetic sequence Locus Location of allele in entire genetic sequence Diploid 2 alleles at each locus Basic Genetics IndividualLocus 1Locus 2 I15/1020/30 I21/450/60

Microsatellites (STR) Advantages: Codominant (easy inference of genotypes and allele frequencies) Many heterozygous alleles per locus Possible to estimate other population parameters Cheaper than SNPs But: Few loci And: Large families Self-mating … CACACACA 5’ Alleles CACACACA CACACACACACA CACACACACACACA #1 #2 #3 Genotypes 1/12/2 3/3 1/21/32/3

Sibling Reconstruction Problem Sibling Groups: 2, 4, 5, 6 1, 3 7, 8 22/221/68 88/221/57 1/36 33/441/35 77/661/34 33/551/43 33/441/32 11/221/21 allele1/allele2 Locus2Locus1Animal S={P1={2,4,5,6},P2={1,3},P3={7,8}} 33/77

Existing Methods MethodApproachAssumptions Almudevar & Field (1999,2003) Minimal Sibling groups under likelihood Minimal sibgroups, representative allele frequencies KinGroup (2004)Markov Chain Monte Carlo/ML Allele Frequencies etc. are representative Family Finder(2003) Partition population using likelihood graphs Allele Frequencies etc. are representative Pedigree (2001)Markov Chain Monte Carlo/ML Allele Frequencies etc are representative COLONY (2004)Simulated AnnealingMonogamy for one sex Fernandez & Toro (2006) Simulated AnnealingCo-ancestry matrix is a good measure, parents can be reconstructed or are available

Objective Find the minimum Full Sibgroups necessary to explain the cohort Algorithm [Berger-Wolf et al. ISMB 2007] Enumerate all maximal feasible full sibgroups Determine the minimum number of full sibgroups necessary to explain the cohort Complexity [Ashley et al. JCSS 2009] NP-Hard (Graph Coloring) Inapproximable ILP [Chaovalitwongse et al. 09 INFORMS JoC] Full Sibling Reconstruction

4-allele rule: siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No: 3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Yes:3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Mendelian Constraints Num distinct alleles Num alleles that appear with 3 others or are homozygote

Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S 1, S 2,…,S m } where S i subset of U Find:the smallest number of sets in S whose union is the universe U Minimum Set Cover is NP-hard (1+ln n)-approximable (sharp)

Min number of sibgroups is just ONE (effective) way to interpret parsimony Alternate Objectives Sibship that minimizes number of parents Sibship that minimizes number of matings Sibship that maximizes family size Sibship that tries to satisfy uniform allele distributions Parsimony: Alternate Objectives

Generate candidate sets by all pairs of individuals Compare every set to every individual x if x can be added to the set without any affecting “accomodability” or violating 2-allele: add it If the “accomodability” is affected, but the 2-allele property is still satisfied: create a new copy of the set, and add to it Otherwise ignore the individual, compare the next 2-Allele Algorithm Overview

Add New Group Add (won’t accommodate (2,2)) Can’t add (a+R =4) Examples 1,4 1, 2 3, 4 3, 2 1,4 1,2 3,2 1,4 1,2 1,1 1,5

Problem Statement: Given a population U of individuals, partition the individuals into groups G such that the parents (mothers+fathers) necessary for G are minimized Observations and Challenges: MinParents: intractable, inapproximable Reduction from Min-Rep Problem (Raz’s Parallel Repetition Theorem) There may be O(2 |loci| ) potential parents for a sibgroup Self-mating (plants) may or may not be allowed Parsimony: Minimize Parents

Objective Minimize the number of parents necessary to generate the sibling reconstruction Algorithm Enumerate all (closed) maximal feasible full sibgroups Generate all possible parents for full sibgroups Use a special vertex cover to determine the minimum number of parents Complexity [Ashley et al. AAIM 2009] NP-Hard (Raz’s Parallel Repetition Theorem) Inapproximable Full Sibling Reconstruction MP

2-prover 1-round proof system label cover problem for bipartite graphs small inapproximability boosting (Raz’s parallel repetition theorem) parallel repetition of 2-prover 1-round proof system label cover problem for some kind of “graph product” for bipartite graphs larger inapproximability Unique games conjecture restriction Raz’s Parallel Repetition Theorem a parallel repetition of any two-prover one-round proof system (MIP(2,1)) decreases the probability of error at an exponential rate.

Inapproximability for MINREP (Raz’s parallel repetition theorem) Let L  NP and x be an input instance of L L MINREP O(n polylog(n) ) time xLxL xLxL OPT ≤ α+β 0 < ε < 1 is any constant OPT  (α+β) 2 log  |A| +|B|

MINREP (minimum representative) problem α partitions all of equal size β partitions all of equal size … A B A1A1 A2A2 AαAα B1B1 B2B2 BβBβ B3B3 … … A1A1 A2A2 AαAα B1B1 B2B2 B3B3 BβBβ B “super”-nodes A “super”-nodes associated “super”-graph H input graph G … (A 1,B 2 )  H if  u  A 1 and v  B 2 such that (u,v)  G In this case, edge (u,v)  G a witness of the super-edge (A 1,B 2 )  H α partitions all of equal size … A B A1A1 A2A2 AαAα B1B1 B2B2 BβBβ B3B3

MINREP goal Valid solution: A’  A and B’  B such that A’  B’ contains a witness for every super-edge Objective: minimize the size of the solution |A’  B’|

Informally, given a set of children given a candidate set of parents assuming we believe in Mendelian inheritance law assuming that the parents tried to be as much monogamous as possible can we partition the children into a set of full siblings (full sibling group has the same pair of parents) Can reduce MINREP to show that this problem is hard

1. Generate M a set of covering groups 2. Select S, a subset of M 3. For each group x in S 1. Generate Parent Pairs for x 2. Insert parent vertices into graph G (if needed) 3. Connect the parents in each parent pair 4. Cover the minimum vertices necessary to (doubly) cover all the individuals Min Parents Sib Reconstruction M={{1,2},{3,6,7},{3,5}, {2,4},{1,6},{2,5},{6,7}} S={{1,2,4},{3,5},{6,7}} X={3,5} {F=5/10, M=2/20},{F=12/44.M=1/49} 5/ 10 2/ 20 12/ 44 1/ 49 X={3,5}

Objective Minimum number of half-sibgroups necessary to explain the cohort Algorithm Enumerate all maximal feasible half-sibgroups Use min set cover to determine the minimum number of groups Complexity NP-Hard Inapproximable (Exact Cover By 3-Sets) Min Half-Sibs Reconstruction

Half-sibs rule: siblings have 2 alleles at each locus from which one allele must be present in each individual Yes: 3/3, 1/3, 1/5, 1/6,8/3,10/3,29/3 (3/1) No: 31/3, 1/6, 29/10 Mendelian Constraints

Half-Sibs Enumeration 22/221/68 88/221/57 1/36 33/441/35 77/661/34 33/551/43 33/441/32 11/221/21 allele1/allele2 Locus2Locus1Animal 33/77 Alleles at Locus 1={1,2,3,4,5,6} All Pairs: (1,2) =>{1,2,3,4,5,6,7,8} (1,3),(1,4), (1,5),(1,6) (2,3)=>{1,2,4,5,6} … Alleles at Locus 2 ={11,22,33,44,55,66,77,88} All Pairs: (11,33)=>{1,2,3,5,6} (11,22)=>{1,7,8} (33,66)=>{2,3,4,5,6} …. Common: {1,2,3,5,6} {1,7,8} ….

Enumeration Algorithm

Results

Biologically Correct Reconstructions { {1,2,4,5},{7,8,10,11},{13,14,15,16},{ 17,18,19,20} } { {1,2,7,8},{4,5,10,11 } {13,14,17,18} { 15,16,19,20} } { {1,2,7,8},{4,5,10,11 } {13,14,15,16} { 17,18,19,20} } { {1,2,4,5},{7,8,10,11 } {13,14,17,18} { 15,16,19,20} } Inherent Problem in Half-Sibs Reconstruction

Reconstruct both paternal and maternal half- sibgroups! What does a half-sibgroup represent? A parent Intersection of Half-Sibgroups give us full-sibgroups Sibling Reconstruction by Minimizing Parents Raz’s Parallel Repetition theorem, Inapproximability, again! Solution

Sibling Reconstruction problem is NP-Hard and Inapproximable for following objectives Minimum Full Sibs Reconstruction Minimum Half-Sibs Reconstruction Minimum Parents Full-Sibs Reconstruction Minimum Half-Sibs Reconstruction with double cover We need to think more about Half-Sibs Problem Conclusion

T. Y. Berger-Wolf, S. I. Sheikh, B. DasGupta, M. Ashley, W. Chaovalitwongse and S. P. Lahari, Reconstructing Sibling Relationships in Wild Populations In Proceedings of 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) 2007 and Bioinformatics, 23(13). S. I. Sheikh, T. Y. Berger-Wolf, Mary V. Ashley, Isabel C. Caballero, Wanpracha Chaovalitwongse and B. DasGupta " Error-Tolerant Sibship Reconstruction for Wild Populations In Proceedings of 7th Annual International Conference on Computational Systems Biology (CSB 2008). Mary. Ashley, Tanya Y. Berger-Wolf, Isabel Caballero, Wanpracha Chaovalitwongse, Chun-An Chou, Bhaskar DasGupta and Saad Sheikh, Full Sibling Reconstructions in Wild Populations From Microsatellite Genetic Markers, to appear in Computational Biology: New Research, Nova Science Publishers. M. V. Ashley, I. C. Caballero, W. Chaovalitwongse, B. DasGupta, P. Govindan, S. Sheikh and T. Y. Berger-Wolf. KINALYZER, A Computer Program for Reconstructing Sibling Groups, Molecular Ecology Resources References

M. Ashley, T. Berger-Wolf, W. Chaovalitwongse, B. DasGupta, A Khokhar S. Sheikh On Approximating An Implicit Cover Problem in Biology, Proceedings of 5th International Conference on Algorithmic Aspects of Information and Management 2009 (to appear) W. Chaovalitwongse, C-A Chou, T. Y. Berger-Wolf, B. DasGupta, S. Sheikh, M. V. Ashley, I. C. Caballero. New Optimization Model and Algorithm for Sibling Reconstruction from Genetic Markers INFORMS Journal of Computing (to appear) M. Ashley, T. Berger-Wolf, P. Berman, W. Chaovalitwongse, B. DasGupta, and M.-Y. Kao. On Approximating Four Covering and Packing Problems Journal of Computer and System Science (to appear) S. I. Sheikh, T. Y. Berger-Wolf, Mary V. Ashley, Isabel C. Caballero, Wanpracha Chaovalitwongse and B. DasGupta " Combinatorial Reconstruction of Half-Sibling Groups References

Mary Ashley UIC W. Art Chaovalitwongse Rutgers Isabel Caballero UIC Sibship Reconstruction Project Ashfaq Khokhar UIC Tanya Berger-Wolf UIC Priya Govindan UIC Bhaskar DasGupta UIC Thank You!! Questions? Chun-An (Joe) Chou Rutgers