Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage. Maya Angelou Maya Angelou Isabel Caballero UIC Priya Govindan Rutgers Chun-An (Joe) Chou Rutgers Saad Sheikh Ecole Polytechnique Alan Perez-Rathkeo UIC Mary Ashley UIC W. Art Chaovalitwongse Rutgers Ashfaq Khokhar UIC Bhaskar DasGupta UIC Tanya Berger-Wolf UIC
Microsatellites (STR) Advantages: Advantages: Codominant (easy inference of genotypes and allele frequencies) Codominant (easy inference of genotypes and allele frequencies) Many heterozygous alleles per locus Many heterozygous alleles per locus Possible to estimate other population parameters Possible to estimate other population parameters Cheaper than SNPs Cheaper than SNPs But: But: Few loci Few loci And: And: Large families Large families Self-mating Self-mating … CACACACA 5’ Alleles CACACACA CACACACACACA CACACACACACACA #1 #2 #3 Genotypes 1/12/2 3/3 1/21/32/3
Siblings: two children with the same parents Question: given a set of children, find the sibling groups Diploid Siblings locus allele father (.../...),(a /b ),(.../...),(.../...)(.../...),(c /d ),(.../...),(.../...) mother (.../...),(e /f ),(.../...),(.../...) child one from father one from mother
Why Reconstruct Sibling Relationships? Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easierBut: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier
The Problem Ind Locus 1 Locus 2 allele 1/allele 2 11/2 21/33/4 31/43/5 43/37/6 51/33/4 61/33/7 71/58/2 81/62/2 Sibling Groups: 2, 4, 5, 6 1, 3 7, 8
Existing Methods MethodApproachError- Detection Assumptions Almudevar & Field (1999,2003) Minimal Sibling groups under likelihood NoMinimal sibgroups, representative allele frequencies KinGroup (2004) Markov Chain Monte Carlo/ML NoAllele Frequencies etc. are representative Family Finder(2003) Partition population using likelihood graphs NoAllele Frequencies etc. are representative Pedigree (2001) Markov Chain Monte Carlo/ML NoAllele Frequencies etc are representative COLONY (2004) Simulated Annealing/ ML YesMonogamy for one sex Fernandez & Toro (2006) Simulated Annealing/ ML NoCo-ancestry matrix is a good measure, parents can be reconstructed or are available
Inheritance Rules father (.../...),(a /b ),(.../...),(.../...)(.../...),(c /d ),(.../...),(.../...) mother child 1 (.../...),(e 1 /f 1 ),(.../...),(.../...) child 2 (.../...),(e 2 /f 2 ),(.../...),(.../...) child 3 (.../...),(e 3 /f 3 ),(.../...),(.../...) child n (.../...),(e n /f n ),(.../...),(.../...) … 4-allele rule: siblings have at most 4 distinct alleles in a locus 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Num distinct alleles Num alleles that appear with 3 others or are homozygot
Our Approach: Mendelian Constrains 4-allele rule: siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No: 3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Yes:3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Num distinct alleles Num alleles that appear with 3 others or are homozygot
Our Approach: Sibling Reconstruction Given: n diploid individuals sampled at l loci Find: Minimum number of 2-allele sets that contain all individuals NP-complete even when we know sibsets are at most approximation gap Ashley et al ’09 NP-complete even when we know sibsets are at most approximation gap Ashley et al ’09 ILP formulation Chaovalitwongse et al. ’07, ’10 ILP formulation Chaovalitwongse et al. ’07, ’10 Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07 Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07 Parallel implementation Sheikh, Khokhar, BW ‘10 Parallel implementation Sheikh, Khokhar, BW ‘10
IDalleles 11/2 22/3 32/1 41/3 53/2 61/4 Canonical families 1/11/2 1/3 1/4 2/2 2/32/43/43/34/4 1/1 1/2 2/1 2/2 1/3 1/4 2/3 2/4 3/1 4/1 3/2 4/2 1/1 1/2 2/1 1/1 1/3 2/1 2/3 3/1 2/1 3/2 1/2 1/3 2/1 3/1 IDalleles 155/43 243/ /55 455/ /43 655/78 1/3 2/1 2/3 2/1 3/2
Aside: Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S 1, S 2,…,S m } where S i subset of U Find:the smallest number of sets in S whose union is the universe U Minimal Set Cover is NP-hard (1+ln n)-approximable (sharp)
Are we done? Challenges No ground truth available No ground truth available Growing number of methods Growing number of methods Biologists need (one) reliable reconstruction Biologists need (one) reliable reconstruction Genotyping errors Genotyping errors Answer: Consensus Consensus is what many people say in chorus but do not believe as individuals Abba Eban ( ), Israeli diplomat In "The New Yorker," 23 Apr 1990
Consensus Methods Combine multiple solutions to a problem to generate one unified solution C : S * → S C : S * → S Based on Social Choice Theory Based on Social Choice Theory Commonly used where the real solution is not known e.g. Phylogenetic Trees Commonly used where the real solution is not known e.g. Phylogenetic Trees Consensus... S1S1 S2S2 SkSk S
Error-Tolerant Approach Sheikh et al. 08 Locus 1 Locus 2 Locus 3Locus l Sibling Reconstructio n Algorithm... Consensus... S1S1 S2S2 SkSk S
Distance-based Consensus Consensus... S1S1 S2S2 SkSk Ss S Search fqfqfqfq fqfqfqfq fdfdfdfd Algorithm –Compute a consensus solution S={g 1,..., g k } –Search for a good solution near S fdfdfdfd NP-hard for any f d, f q or an arbitrary linear combination Sheikh et al. ‘08
A Greedy Approach - Algorithm Compute a strict consensus Compute a strict consensus While total distance is not too large While total distance is not too large Merge two sibgroups with minimal (total) distance Merge two sibgroups with minimal (total) distance Quality: f q =n-|C| Quality: f q =n-|C| Distance function from solution C to C’ Distance function from solution C to C’ f d (C,C’)=sum of costs of merging groups in C to obtain C’ =sum of costs of assigning individuals to groups Cost of assigning individual to a group: Benefit: Alleles and allele pairs shared Cost: Minimum Edit Distance
Change costs to average per locus costs Change costs to average per locus costs Compare max group error on per locus basis Compare max group error on per locus basis Treat cost and benefit independently Treat cost and benefit independently In order to qualify a merge In order to qualify a merge Cost <= maxcost Cost <= maxcost Benefit >= minbenefit Benefit >= minbenefit Benefit = max benefit among possible merges Benefit = max benefit among possible merges Auto Greedy Consensus
A Greedy Approach {1,2}{3}{4}{5}{6,7} {1,2} {3} {4} {5} {6,7} S 1 = { {1,2,3},{4,5},{6,7} } S 2 = { {1,2,3},{4}, {5,6,7} } S 3 = { {1,2},{3,4,5},{6,7} } Strict Consensus S = { {1,2}, {3}, {4}, {5}, {6,7} } {1,2}{3,6,7}{4}{5}{6,7} {1,2} {3,6,7} {4} {5} {6,7} S = { {1,2}, {3}, {4}, {5}, {6,7} } S={ {1,2}, {3,6,7}, {4}, {5} }
Testing and Validation: Protocol 1. Get a dataset with known sibgroups (real or simulated) 2. Find sibgroups using our alg 3. Compare the solutions Partition distrance, Gusfield ’03 = assignment problem Partition distrance, Gusfield ’03 = assignment problem Compare to other sibship methods Compare to other sibship methods Family Finder, COLONY Family Finder, COLONY
Salmon (Salmo salar) - Herbinger et al., individuals, 6 families, 4 loci. No missing alleles Salmon (Salmo salar) - Herbinger et al., individuals, 6 families, 4 loci. No missing alleles Shrimp (Penaeus monodon) - Jerry et al., individuals,13 families, 7 loci. Some missing alleles Shrimp (Penaeus monodon) - Jerry et al., individuals,13 families, 7 loci. Some missing alleles Ants (Leptothorax acervorum )- Hammond et al., 2001 Ants are haplodiploid species. The data consists of 377 worker diploid ants Ants (Leptothorax acervorum )- Hammond et al., 2001 Ants are haplodiploid species. The data consists of 377 worker diploid ants Test Data Simulated populations of juveniles for a range of values of number of parents, offspring per parent, alleles, per locus, number of loci, and the distributions of those.
Experimental Protocol Generate F females and M males (F=M=5, 10, 20) Each with l loci (l=2, 4, 6,8,10) Each locus with a alleles (a=10, 15) Generate f families (f=5,10,20) For each family select female+male uniformly at random For each parent pair generate o offspring (o=5,10) For each offspring for each locus choose allele outcome uniformly at random Introduce random errors
Results
Results
Conclusions Combinatorial algorithms with minimal assumptions Combinatorial algorithms with minimal assumptions Behaves well on real and simulated data Behaves well on real and simulated data Better than others with few loci, few large families Better than others with few loci, few large families Error tolerant Error tolerant Useful, high demand Useful, high demand Useful, high demand Useful, high demand New and improved: Efficient implementation Perez-Rathlke et al. (in submission) Efficient implementation Perez-Rathlke et al. (in submission) Other objectives (bio vs math) Ashley et al. ‘10 Other objectives (bio vs math) Ashley et al. ‘10 Other genealogical relationships Sheikh et al. ‘09, ’10 Other genealogical relationships Sheikh et al. ‘09, ’10 Different combinatorial approach Brown & B-W, ‘10 Different combinatorial approach Brown & B-W, ‘10 Pedigree amalgamation Pedigree amalgamation