Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings.

Slides:



Advertisements
Similar presentations
Population-based metaheuristics Nature-inspired Initialize a population A new population of solutions is generated Integrate the new population into the.
Advertisements

A Tutorial on Learning with Bayesian Networks
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
. Exact Inference in Bayesian Networks Lecture 9.
Lab 3 : Exact tests and Measuring of Genetic Variation.
Lab 3 : Exact tests and Measuring Genetic Variation.
Fast Algorithms For Hierarchical Range Histogram Constructions
Tutorial #1 by Ma’ayan Fishelson
METHODS FOR HAPLOTYPE RECONSTRUCTION
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
Chapter 11 Mendel & The Gene Idea.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011.
Tuesday, May 14 Genetic Algorithms Handouts: Lecture Notes Question: when should there be an additional review session?
Basics of Linkage Analysis
Pedigree Analysis.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Mutual Information Mathematical Biology Seminar
Complexity and Approximation of the Minimum Recombinant Haplotype Configuration Problem Authors: Lan Liu, Xi Chen, Jing Xiao & Tao Jiang.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Parameters and Statistics Probabilities The Binomial Probability Test.
Probabilistic methods for phylogenetic trees (Part 2)
Half-Sibling Reconstruction A Theoretical Analysis Saad Sheikh Department of Computer Science University of Illinois at Chicago Brothers! ? ?
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Biodiversity IV: genetics and conservation
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Chapter 5 Characterizing Genetic Diversity: Quantitative Variation Quantitative (metric or polygenic) characters of Most concern to conservation biology.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Pedigree Analysis.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang Dept of Computer Science University of California – Riverside (joint work with.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
Calculation of IBD State Probabilities Gonçalo Abecasis University of Michigan.
Sample pedigree - cystic fibrosis female male affected individuals.
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Trait evolution Up until now, we focused on microevolution – the forces that change allele and genotype frequencies in a population This portion of the.
Quantitative Genetics
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
On Approximating Four Covering/Packing Problems Bhaskar DasGupta, Computer Science, UIC Mary Ashley, Biological Sciences, UIC Tanya Berger-Wolf, Computer.
Genetic pedigree analysis of spring Chinook salmon reintroduced above Foster Dam Melissa Evans, Kathleen O’Malley, Marc Johnson, Michael Banks, Dave Jacobson,
Bootstraps and Jackknives Hal Whitehead BIOL4062/5062.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Lecture 22: Quantitative Traits II
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Mendelian genetics in Humans: Autosomal and Sex- linked patterns of inheritance Obviously examining inheritance patterns of specific traits in humans.
Difference between a monohybrid cross and a dihybrid cross
The coalescent with recombination (Chapter 5, Part 1)
Pedigree Analysis.
IBD Estimation in Pedigrees
Presentation transcript:

Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage. Maya Angelou Maya Angelou Isabel Caballero UIC Priya Govindan Rutgers Chun-An (Joe) Chou Rutgers Saad Sheikh Ecole Polytechnique Alan Perez-Rathkeo UIC Mary Ashley UIC W. Art Chaovalitwongse Rutgers Ashfaq Khokhar UIC Bhaskar DasGupta UIC Tanya Berger-Wolf UIC

Microsatellites (STR) Advantages: Advantages: Codominant (easy inference of genotypes and allele frequencies) Codominant (easy inference of genotypes and allele frequencies) Many heterozygous alleles per locus Many heterozygous alleles per locus Possible to estimate other population parameters Possible to estimate other population parameters Cheaper than SNPs Cheaper than SNPs But: But: Few loci Few loci And: And: Large families Large families Self-mating Self-mating … CACACACA 5’ Alleles CACACACA CACACACACACA CACACACACACACA #1 #2 #3 Genotypes 1/12/2 3/3 1/21/32/3

Siblings: two children with the same parents Question: given a set of children, find the sibling groups Diploid Siblings locus allele father (.../...),(a /b ),(.../...),(.../...)(.../...),(c /d ),(.../...),(.../...) mother (.../...),(e /f ),(.../...),(.../...) child one from father one from mother

Why Reconstruct Sibling Relationships? Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easierBut: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier

The Problem Ind Locus 1 Locus 2 allele 1/allele 2 11/2 21/33/4 31/43/5 43/37/6 51/33/4 61/33/7 71/58/2 81/62/2 Sibling Groups: 2, 4, 5, 6 1, 3 7, 8

Existing Methods MethodApproachError- Detection Assumptions Almudevar & Field (1999,2003) Minimal Sibling groups under likelihood NoMinimal sibgroups, representative allele frequencies KinGroup (2004) Markov Chain Monte Carlo/ML NoAllele Frequencies etc. are representative Family Finder(2003) Partition population using likelihood graphs NoAllele Frequencies etc. are representative Pedigree (2001) Markov Chain Monte Carlo/ML NoAllele Frequencies etc are representative COLONY (2004) Simulated Annealing/ ML YesMonogamy for one sex Fernandez & Toro (2006) Simulated Annealing/ ML NoCo-ancestry matrix is a good measure, parents can be reconstructed or are available

Inheritance Rules father (.../...),(a /b ),(.../...),(.../...)(.../...),(c /d ),(.../...),(.../...) mother child 1 (.../...),(e 1 /f 1 ),(.../...),(.../...) child 2 (.../...),(e 2 /f 2 ),(.../...),(.../...) child 3 (.../...),(e 3 /f 3 ),(.../...),(.../...) child n (.../...),(e n /f n ),(.../...),(.../...) … 4-allele rule: siblings have at most 4 distinct alleles in a locus 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Num distinct alleles Num alleles that appear with 3 others or are homozygot

Our Approach: Mendelian Constrains 4-allele rule: siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No: 3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Yes:3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Num distinct alleles Num alleles that appear with 3 others or are homozygot

Our Approach: Sibling Reconstruction Given: n diploid individuals sampled at l loci Find: Minimum number of 2-allele sets that contain all individuals NP-complete even when we know sibsets are at most approximation gap Ashley et al ’09 NP-complete even when we know sibsets are at most approximation gap Ashley et al ’09 ILP formulation Chaovalitwongse et al. ’07, ’10 ILP formulation Chaovalitwongse et al. ’07, ’10 Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07 Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07 Parallel implementation Sheikh, Khokhar, BW ‘10 Parallel implementation Sheikh, Khokhar, BW ‘10

IDalleles 11/2 22/3 32/1 41/3 53/2 61/4 Canonical families 1/11/2 1/3 1/4 2/2 2/32/43/43/34/4 1/1 1/2 2/1 2/2 1/3 1/4 2/3 2/4 3/1 4/1 3/2 4/2 1/1 1/2 2/1 1/1 1/3 2/1 2/3 3/1 2/1 3/2 1/2 1/3 2/1 3/1 IDalleles 155/43 243/ /55 455/ /43 655/78 1/3 2/1 2/3 2/1 3/2

Aside: Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S 1, S 2,…,S m } where S i subset of U Find:the smallest number of sets in S whose union is the universe U Minimal Set Cover is NP-hard (1+ln n)-approximable (sharp)

Are we done? Challenges No ground truth available No ground truth available Growing number of methods Growing number of methods Biologists need (one) reliable reconstruction Biologists need (one) reliable reconstruction Genotyping errors Genotyping errors Answer: Consensus Consensus is what many people say in chorus but do not believe as individuals Abba Eban ( ), Israeli diplomat In "The New Yorker," 23 Apr 1990

Consensus Methods Combine multiple solutions to a problem to generate one unified solution C : S * → S C : S * → S Based on Social Choice Theory Based on Social Choice Theory Commonly used where the real solution is not known e.g. Phylogenetic Trees Commonly used where the real solution is not known e.g. Phylogenetic Trees Consensus... S1S1 S2S2 SkSk S

Error-Tolerant Approach Sheikh et al. 08 Locus 1 Locus 2 Locus 3Locus l Sibling Reconstructio n Algorithm... Consensus... S1S1 S2S2 SkSk S

Distance-based Consensus Consensus... S1S1 S2S2 SkSk Ss S Search fqfqfqfq fqfqfqfq fdfdfdfd Algorithm –Compute a consensus solution S={g 1,..., g k } –Search for a good solution near S fdfdfdfd NP-hard for any f d, f q or an arbitrary linear combination Sheikh et al. ‘08

A Greedy Approach - Algorithm Compute a strict consensus Compute a strict consensus While total distance is not too large While total distance is not too large Merge two sibgroups with minimal (total) distance Merge two sibgroups with minimal (total) distance Quality: f q =n-|C| Quality: f q =n-|C| Distance function from solution C to C’ Distance function from solution C to C’ f d (C,C’)=sum of costs of merging groups in C to obtain C’ =sum of costs of assigning individuals to groups Cost of assigning individual to a group: ‏ Benefit: Alleles and allele pairs shared Cost: Minimum Edit Distance

Change costs to average per locus costs Change costs to average per locus costs Compare max group error on per locus basis Compare max group error on per locus basis Treat cost and benefit independently Treat cost and benefit independently In order to qualify a merge In order to qualify a merge Cost <= maxcost Cost <= maxcost Benefit >= minbenefit Benefit >= minbenefit Benefit = max benefit among possible merges Benefit = max benefit among possible merges Auto Greedy Consensus

A Greedy Approach {1,2}{3}{4}{5}{6,7} {1,2} {3} {4} {5} {6,7} S 1 = { {1,2,3},{4,5},{6,7} } S 2 = { {1,2,3},{4}, {5,6,7} } S 3 = { {1,2},{3,4,5},{6,7} } Strict Consensus S = { {1,2}, {3}, {4}, {5}, {6,7} } {1,2}{3,6,7}{4}{5}{6,7} {1,2} {3,6,7} {4} {5} {6,7} S = { {1,2}, {3}, {4}, {5}, {6,7} } S={ {1,2}, {3,6,7}, {4}, {5} }

Testing and Validation: Protocol 1. Get a dataset with known sibgroups (real or simulated) 2. Find sibgroups using our alg 3. Compare the solutions Partition distrance, Gusfield ’03 = assignment problem Partition distrance, Gusfield ’03 = assignment problem Compare to other sibship methods Compare to other sibship methods Family Finder, COLONY Family Finder, COLONY

Salmon (Salmo salar) - Herbinger et al., individuals, 6 families, 4 loci. No missing alleles Salmon (Salmo salar) - Herbinger et al., individuals, 6 families, 4 loci. No missing alleles Shrimp (Penaeus monodon) - Jerry et al., individuals,13 families, 7 loci. Some missing alleles Shrimp (Penaeus monodon) - Jerry et al., individuals,13 families, 7 loci. Some missing alleles Ants (Leptothorax acervorum )- Hammond et al., 2001 Ants are haplodiploid species. The data consists of 377 worker diploid ants Ants (Leptothorax acervorum )- Hammond et al., 2001 Ants are haplodiploid species. The data consists of 377 worker diploid ants Test Data Simulated populations of juveniles for a range of values of number of parents, offspring per parent, alleles, per locus, number of loci, and the distributions of those.

Experimental Protocol Generate F females and M males (F=M=5, 10, 20) Each with l loci (l=2, 4, 6,8,10) Each locus with a alleles (a=10, 15) Generate f families (f=5,10,20) For each family select female+male uniformly at random For each parent pair generate o offspring (o=5,10) For each offspring for each locus choose allele outcome uniformly at random Introduce random errors

Results

Results

Conclusions Combinatorial algorithms with minimal assumptions Combinatorial algorithms with minimal assumptions Behaves well on real and simulated data Behaves well on real and simulated data Better than others with few loci, few large families Better than others with few loci, few large families Error tolerant Error tolerant Useful, high demand Useful, high demand Useful, high demand Useful, high demand New and improved: Efficient implementation Perez-Rathlke et al. (in submission) Efficient implementation Perez-Rathlke et al. (in submission) Other objectives (bio vs math) Ashley et al. ‘10 Other objectives (bio vs math) Ashley et al. ‘10 Other genealogical relationships Sheikh et al. ‘09, ’10 Other genealogical relationships Sheikh et al. ‘09, ’10 Different combinatorial approach Brown & B-W, ‘10 Different combinatorial approach Brown & B-W, ‘10 Pedigree amalgamation Pedigree amalgamation