A Data Compression Problem The Minimum Informative Subset.

Slides:



Advertisements
Similar presentations
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Advertisements

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Signatures of Selection
The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Evaluating Hypotheses
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Incorporating Mutations
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Issues concerning the interpretation of statistical significance tests.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Of Sea Urchins, Birds and Men Algorithmic Functions of Computational Biology – Course 1 Professor Istrail.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Genome Wide Association Studies using SNP
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Discrete Event Simulation - 4
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

A Data Compression Problem The Minimum Informative Subset

Informativeness-based Tagging SNPs Algorithm

Outline: Brief background to SNP selection A block-free tag SNP selection algorithm that maximizes ‘informativeness’  Halldorsson et al 2004

What does it mean to tag SNPs? SNP = Single Nucleotide Polymorphism  Caused by a mutation at a single position in human genome, passed along through heredity  Characterizes much of the genetic differences between humans  Most SNPs are bi-allelic  Estimated several million common SNPs (minor allele frequency >10% To tag = select a subset of SNPs to work with

Why do we tag SNPs? Disease Association Studies  Goal: Find genetic factors correlated with disease  Look for discrepancies in haplotype structure  Statistical Power: Determined by sample size  Cost: Determined by overall number of SNPs typed This means, to keep cost down, reduce the number of SNPs typed Choose a subset of SNPs, [tag SNPs] that can predict other SNPs in the region with small probability of error  Remove redundant information

What do we know? SNPs physically close to one another tend to be inherited together  This means that long stretches of the genome (sans mutational events) should be perfectly correlated if not for… Recombination breaks apart haplotypes and slowly erodes correlation between neighboring alleles  Tends to blur the boundaries of LD blocks Since SNPs are bi-allelic, each SNP defines a partition on the population sample.  If you are able to reconstruct this partition by using other SNPs, there would be no need to type this SNP  For any single SNP, this reconstruction is not difficult…

Complications: But the Global solution to the minimum number of tag SNPs necessary is NP-hard The predictions made will not be perfect  Correlation between neighboring tag SNPs not as strong as correlation between neighboring (not necessarily tagged) SNPs Haplotype information is usually not available for technical reasons  Need for Phasing

Tagging SNPs can be partitioned into the following three steps:  Determining neighborhoods of LD: which SNPs can infer each other  Tagging quality assessment: Defining a quality measure that specifies how well a set of tag SNPs captures the variance observed  Optimization: Minimizing the number of tag SNPs

Haplotype-based tagging SNPs: htSNPs Block-Based:  Define blocks as as set of SNPs that are in strong LD with each other, but not with neighboring blocks  Requires inference on exact location of haplotype blocks Recombination between the blocks but not within the blocks  Within each block, choose a subset of SNPs sufficiently rich to be able to reconstruct diversity of the block  Many algorithms exist for creating blocks… few select the same boundaries!

How do we create Haplotype Blocks? 1. Recombination-based block building algorithm: 1. Infinite sites assumption [each site mutates at most once] 2. Assume no recombination within a block 3. Implies each block should follow the four-gamete condition for any pair of sites (See Hudson and Kaplan) 2. Diversity-based test: A region is a block if at least 80% of the sequences occur in more than one chromosome. 1. Test does not scale well to large sample sizes. (See Patil et al (2001)) 2. To generalize this notion, one could look for sequences within a region accounting for 80% of the sampled population that each occur in at least 10% of the sample. 3. LD-based test: 1. D’ value of every pair of SNPs within the block shows significant LD given the individual SNP frequencies with a P-value of Two SNPs are considered to have a useful level of correlation if they occur in the same haplotype block [i.e. they are physically close with little evidence of recombination]. The set of SNPs that can be used to predict SNP s can be found by taking the union of all putative haplotype blocks that contain SNP s. 1. It is possible that many overlapping block decompositions will meet the rules defined by a rule-based algorithm for finding haplotype blocks

Methods for inferring haplotype blocks

Hypothesis – Haplotype Blocks? The genome consists largely of blocks of common SNPs with relatively little recombination shuffling in the blocks  Patil et. al, Science, 2001; Jeffreys et al. Nature Genetics; Daly et al. Nature Genetics, 2001 Compare block detection methods.  How well we can detect haplotype blocks?  Are the detection methods consistent?

Block detection methods Four gamete test, Hudson and Kaplan, Genetics, 1985, 111,  A segment of SNPs is a block if between every pair (aA and bB) of SNPs at most 3 gametes (ab, aB, Ab, AB) are observed. P-Value test  A segment of SNPs is a block if for 95% of the pairs of SNPs we can reject the hypothesis (with P-value 0.05 or 0.001) that they are in linkage equilibrium. LD-based, Gabriel et al. Science,2002,296:  Next slide

Gabriel et al. method For every pair of SNPs we calculate an upper and lower confidence bound on D’ (Call these D’u, D’l) We then split the pairs of SNPs into 3 classes:  Class I: Two SNPs are in ‘Strong LD’ if D’u >.98 and D’l >.7.  Class II: Two SNPs show ‘Strong evidence for recombination’ if D’u <.9. Gabriel et al. method

 Class III: The remaining SNP pairs, these are “uninformative”. A contiguous set of SNPs is a block if  (Class II)/(Class I + ClassII) < 5%. Special rules to determine if 2, 3 or 4 SNPs are a block. Furthermore there are distance requirements on the chromosome to determine if the SNPs are a block. Gabriel et al. method

One definition of block Based on the Four Gamete test. Intuition: when between two SNPs there are all four gametes, there is a recombination point somewhere inbetween the two sites

Four Gamete Block Test Hudson and Kaplan 1985 A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed BLOCKVIOLATES THE BLOCK DEFINITION

Finding Recombination Hotspots: Many Possible Partitions into Blocks A C T A G A T A G C C T G T T C G A C A A C A T A C T C T A T G A T C G G T T A T A C G A C A T A C T C T A T A G T A T A C T A G C T G G C A T All four gametes are present:

A C T A G A T A G C C T G T T C G A C A A C A T A C T C T A T G A T C G G T T A T A C G A C A T A C T C T A T A G T A T A C T A G C T G G C A T Find the left-most right endpoint of any constraint and mark the site before it a recombination site. Eliminate any constraints crossing that site. Repeat until all constraints are gone. The final result is a minimum-size set of sites crossing all constraints.

Tagging SNPs ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT A------A---TG-- G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- An example of real data set and its haplotype block structure. Colors refer to the founding population, one color for each founding haplotype Only 4 SNPs are needed to tag all the different haplotypes

Optimal Haplotype Block-Free Selection of Tagging SNPs for Genome-Wide Association Studies Halldorsson, Bafna, Lippert, Schwartz, Clark, Istrail (2004)

Tagging SNPs can be partitioned into the following three steps:  Determining neighborhoods of LD: which SNPs can infer each other  Tagging quality assessment: Defining a quality measure that specifies how well a set of tag SNPs captures the variance observed  Optimization: Minimizing the number of tag SNPs

Finding Neighborhoods: Goal is to select SNPs in the sample that characterize regions of common recent ancestry that will contain conserved haplotypes Recent common ancestry means that there has been little time for recombination to break apart haplotypes Constructing fixed size neighborhoods in which to look for SNPs is not desirable because of the variability of recombination rates and historical LD across the genome In fact, the size of informative neighborhoods is highly variable precisely because of variable recombination rates and SNP density Authors avoid block-building by recursively creating neighborhood with help of ‘informativeness’ measure

A measure of tagging quality assessment Assume all SNPs are bi-allelic Notation: I(s,t) = Informativeness of a SNP s with respect to a SNP t  i, j are two haplotypes drawn at random from the uniform distribution on the set of distinct haplotype pairs.  Note: I(s,t) =1 implies complete predictability, I(s,t)=0 when t is monomorphic in the population. I(s,t) easily estimated through the use of bipartite clique that defines each SNP  We can write I(s,t) in terms of an edge set Definition of I easily extended to a set of SNPs S by taking the union of edge sets Assumes the availability of haplotype phases New measure avoids some of the difficulties traditional LD measures have experienced when applied to tagging SNP selection  The concept of pairwise LD fails to reliably capture the higher-order dependencies implied by haplotype structure Defning Informativeness:

Bounded-Width Algorithm: k Most Informative SNPs (k-MIS) Input: A set of n SNPs S Output: subset of SNPs S’ such that I(S’,S) is maximal In its most general form, k-MIS is NP-hard by reduction of the set cover problem to MIS Algorithm optimizes informativeness, although easily adapted for other measures Define distance between two SNPs as the number of SNPs in between them k-MIS can be solved as long as distance between adjacent tag SNPs not too large

Define  Assignment A s [i]  S(A s )  Recursion function I w (s,l, S(A)) = score of the most informative subset of l SNPs chosen from SNPs 1 through s such that A s described the assignment for SNP s. Pseudocode Complexity: O(nk2 w ) in time and O(k2 w ) in space, assuming maximal window w

Evaluation Algorithm evaluated by Leave-One-Out Cross-Validation  accumulated accuracy over all haplotypes gives a global measure of the accuracy for the given data set. SNPs not typed were predicted by a majority vote among all haplotypes in the training set that were identical to the one being inferred  If no such haplotypes existed, the majority vote is taken among all training haplotypes that have the same allele call on all but one of the typed SNPs  etc. When compared to block-based method of Zhang:  Presumably, the advantage is due to the cost imposed by artificially restricting the range of influence of the few SNPs chosen by block boundaries ‘Informativeness’ was shown to be a “good” measure  aligned well with the leave-one-out cross validation results  extremely close to the results of optimizing for haplotype r 2

A Data Compression Problem Select SNPs to use in an association study  Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of candidate SNPs  Chromosome wide studies, whole genome-scans  For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated  It is less likely that there has been a recombination between two SNPs if they are close to each other.

Association studies Disease Responder Control Non-responder Allele 0Allele 1 Marker A is associated with Phenotype Marker A: Allele 0 = Allele 1 =

Evaluate whether nucleotide polymorphisms associate with phenotype TA GA A CG GA A CG TA A TA TC G TG TA G TG GA G Association studies

TA GA A CG GA A CG TA A TA TC G TG TA G TG GA G

SNP-Selection Axiom: Hypothesis-free associations Due to the many unknowns regarding the nature of common or complex disease, we should aim at SNP selection that confers maximal resolution power, i.e., genome-wide SNP scans with the hope of performing hypothesis-free disease associations studies, as opposed to hypothesis-driven candidate gene or region studies.

A New Measure Informativeness

SNP-Selection Axiom: Multi-allelic measure The tagging quality of the selected SNPs should by described by multi-allelic measure; sets of SNPs have combined information about predicting other SNPs

SNP-Selection Axioms: LD consistency and Block-freeness The highly concordant results of the block detection methods make the interior of LD blocks adequate for sparse SNP selection. However, block boundaries defined by these methods are not sharp, with no single “true” block partition. SNP selection should avoid dependence of particular definitions of “ haplotype block.”

A New SNP Selection Measure: Informativeness It satisfies the following six Axioms: 1. Multi-allelic measure 2. LD consistency: compares well with measures of LD 3. Block-freeness: independence on any particular block definition 4. Hypothesis-free associations: optimization achieves maximum haplotype resolution 5. Algorithmically sound: practical for genome-wide computations 6. Statistically sound: passes overfitting and imputation tests

Informativeness s h2h2 h1h1

s 1 s 2 s 3 s 4 s 5 I(s 1,s 2 ) = 2/4 = 1/2 Informativeness

s 1 s 2 s 3 s 4 s 5 I({s 1,s 2 }, s 4 ) = 3/4 Informativeness

s 1 s 2 s 3 s 4 s 5 I({s 3,s 4 },{s 1,s 2,s 5 }) = 3 S={s 3,s 4 } is a Minimal Informative Subset Informativeness

Minimum Set Cover = Minimum Informative Subset s1s1 s2s2 s5s5 s3s3 s4s4 e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 SNPs Edges s1s1 s2s2 s3s3 s4s4 s5s5 Graph theory insight Informativeness

Minimum Set Cover {s 3, s 4 } = Minimum Informative Subset s1s1 s2s2 s5s5 s3s3 s4s4 e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 SNPsEdges s1s1 s2s2 s3s3 s4s4 s5s5 Informativeness Graph theory insight

Connecting Informativeness with Measures of LD

The Minimum Informative SNPs in a Block of Complete LD

(k,w)-MIS Problem

(k,w)-MIS: O(nk2 w ) solution AsAs As1As As0As0 1010? ? ? ? ???? Opt

Validation Tests on Publicly-Accessible Data We performed tests using two publicly available datasets: LPL dataset of Nickerson et al. (2000): 142 chromosomes typed at 88 SNPs Chromosome 21 dataset of Patil et al. (2001): 20 chromosomes typed at 24,047 SNPs We also performed tests on an AB dataset Most of Chromosome chromosomes typed at 4102 SNPs

Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm Our block-free algorithm A region of Chr Caucasian samples

Block free tagging Minimum informative SNPs Perlegen Data Set Chromosome 21: 20 individuals, SNPs Block Free method Block Method Number of SNPs Informativeness

Block free tagging Minimum informative SNPs #SNPS Informationfraction Block-free 21 Block-free 15 With blocks Lipoprotein Lipase Gene, 71 individuals, 88 SNPs

Correct imputation block vs. block free Zhang et al. Block Free Perlegen dataset #SNPs typed # correct imputations

Leave one out Block free Perlege dataset #SNPs Informativeness Correlations of informativeness with imputation in leave one out studies

Conclusions

Existing LD based measures are not adequate for SNP subset selection, and do not extend easily to multiple SNPs The Informativeness measure for SNPs is Block-free, and extends easily to multiple SNPs. Practically feasible algorithms for genome-wide studies to compute minimum informative SNP subsets We are able to show that by typing only 20-30% of the SNPs, we are able to retain 90% of the informativeness.