Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.

Slides:



Advertisements
Similar presentations
Informed search algorithms
Advertisements

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Supplementation with local, natural-origin broodstock may minimize negative fitness impacts in the wild Initial results of this study were published in.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Michele Samorani Manuel Laguna. PROBLEM : In meta-heuristic methods that are based on neighborhood search, whenever a local optimum is encountered, the.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Markov Random Fields and Graph Cuts Simon Prince
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
Inbreeding Depression “You might be a redneck if you think the theory of relativity has something to do with inbreeding”
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
CSC5160 Topics in Algorithms Tutorial 2 Introduction to NP-Complete Problems Feb Jerry Le
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Model Estimation and Comparison Gamma and Lognormal Distributions 2015 Washington, D.C. Rock ‘n’ Roll Marathon Velocities.
Mutual Information Mathematical Biology Seminar
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Evolutionary Computation Application Peter Andras peter.andras/lectures.
Half-Sibling Reconstruction A Theoretical Analysis Saad Sheikh Department of Computer Science University of Illinois at Chicago Brothers! ? ?
Quantitative Genetics
Review Session Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM I’ll answer questions on my material, then Chad will answer questions on.
Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Efficient Gathering of Correlated Data in Sensor Networks
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
PowerPoint Slides for Chapter 16: Variation and Population Genetics Section 16.2: How can population genetic information be used to predict evolution?
Solving the Concave Cost Supply Scheduling Problem Xia Wang, Univ. of Maryland Bruce Golden, Univ. of Maryland Edward Wasil, American Univ. Presented at.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Estimating a Population Proportion
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Introduction  Populations are described by their probability distributions and parameters. For quantitative populations, the location and shape are described.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
On Approximating Four Covering/Packing Problems Bhaskar DasGupta, Computer Science, UIC Mary Ashley, Biological Sciences, UIC Tanya Berger-Wolf, Computer.
Genetic pedigree analysis of spring Chinook salmon reintroduced above Foster Dam Melissa Evans, Kathleen O’Malley, Marc Johnson, Michael Banks, Dave Jacobson,
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ES 07 These slides can be found at optimized for Windows)
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Efficient Point Coverage in Wireless Sensor Networks Jie Wang and Ning Zhong Department of Computer Science University of Massachusetts Journal of Combinatorial.
Markov Random Fields in Vision
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Exhaustive search Exhaustive search is simply a brute- force approach to combinatorial problems. It suggests generating each and every element of the problem.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Introduction to Algorithms: Brute-Force Algorithms.
Hirophysics.com The Genetic Algorithm vs. Simulated Annealing Charles Barnes PHY 327.
Constraints Satisfaction Edmondo Trentin, DIISM. Constraint Satisfaction Problems: Local Search In many optimization problems, the path to the goal is.
GENETIC MAPPING IN PLANTS AND ANIMALS
Chapter 3: Maximum-Likelihood Parameter Estimation
GENETIC MAPPING IN PLANTS AND ANIMALS
Al-Imam Mohammad Ibn Saud University Large-Sample Estimation Theory
Model Estimation and Comparison Gamma and Lognormal Distributions
Chapter 25: Paired Samples and Blocks
Neural Networks for Vertex Covering
Error Checking for Linkage Analyses
Discrete Event Simulation - 4
POINT ESTIMATOR OF PARAMETERS
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Clustering.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha Chaovalitwongse (DIMACS and Rutgers IE) Mary Ashley (UIC Biology) Brothers! ? ?

The Problem Sibling Groups: 2, 3, 4, 5 2, 3, 4, 6 1, 7, 8 AnimalLocus 1Locus 2 allelel1/allele2 1149/167243/ /155245/ /177245/ /155253/ /155245/ /155245/ /151251/ /173255/255

Why Reconstruct Sibling Relationships? Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier

Previous Work: Statistical estimate of pairwise distance and maximum likelihood clustering into family groups: (Blouin et al. 1996; Thomas and Hill 2002; Painter 1997; Smith et al. 2001; Wang 2004) Graph clustering algorithms to form groups from pairwise likelihood distance graph: (Beyer and May, 2003) Use 4-allele Mendelian constraint and brute force find groups (non-optimal) that satisfy it: (Almudevar and Field, 1999)

Our Approach: Mendelian Constrains 4-allele rule: a group of siblings can have no more than 4 different alleles in any given locus 155/155, 149/155, 149/151, 149/173 2-allele rule: let a be the number of distinct alleles present in a given locus and R be the number of distinct alleles that either appear with three different alleles in this locus or are homozygous. Then a group of siblings must satisfy a + R ≤ 4 155/155, 149/155, 149/151

Our Algorithm—Template: 1.Construct possible sets S 1, S 2, …, S m that satisfy 2-allele (weaker 4-allele) rule 2.For each individual x find its set S j 3.Find minimum set cover from sets S 1, S 2, …, S m of all the individuals. Return sets in the cover as sibling groups

Aside: Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S 1, S 2,…,S m } where S i subset of U Find:the smallest number of sets in S whose union is the universe U Minimal Set Cover is NP-hard (1+ln n)-approximable (sharp)

Our Algorithm—2-allele: 1.Construct possible sets S 1, S 2, …, S m that satisfy 2-allele rule: for each locus independently create all sets that satisfy a+R ≤ 4, combine loci 2.(all the individuals are already assigned to sets from step 1) 3.Find minimum set cover from sets S 1, S 2, …, S m of all the individuals. Return sets in the cover as sibling groups

Our Algorithm—4-allele: 1.Construct possible sets S 1, S 2, …, S m that satisfy 4-allele rule (must exist since each pair of individuals forms a valid set) loc1loc2 ind11/12/3set(1,2) = {1,4}{2,3,5,6} ind21/45/6 2.For each individual x add it to S j only if it its alleles for each locus are in the set of alleles for that locus in S j 3.Find minimum set cover from sets S 1, S 2, …, S m of all the individuals. Return sets in the cover as sibling groups

Experimental Protocol: Create females and males, randomly pair them into couples, produce offspring, giving each juvenile one of each parent’s allele in each locus randomly. The parameter ranges for the study : Number of adult females F = 10, males M = 10 Number of loci sampled l = 2; 4; 6; 10 Num of alleles per locus a = 2; 5; 10; 20 Factor of the number of juveniles as the number of females j = 1; 2; 5; 10 Max number of offspring per couple o = 2; 5; 10; 30; 50

Algorithm Evaluation: 1.Use 4-allele algorithm on simulated juvenile population (using CPLEX 9.0 MIP solver to optimally solve Min Set Cover). 2.Compare results to the true known sibling groups. 3.Evaluate accuracy using a generalization of Gusfields’s partition distance (Information Proc. Letters, 2002)

Results As expected, the error increases as the number of juveniles increases

Results Surprisingly, and unlike any statistical and likelyhood method, the error does not depend on the number of loci and allele frequency

Results The error decreases as the number of true siblings increases. (When few siblings we underestimate number of sibling groups)

Conclusions Ours is a fully combinatorial method. Uses simple Mendelian constraints, no statistical estimates or a priori knowledge about data Even the very weak 4-allele constraint shows good trends (no dependence on number of loci sampled or allele frequency) Need to evaluate the 2-allele algorithm on simulated and real data and compare to other sibship reconstruction algorithms