Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
MALD Mapping by Admixture Linkage Disequilibrium.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Molecular & Genetic Epi 217 Association Studies
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
Genome-Wide Association Study (GWAS)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Molecular & Genetic Epi 217 Association Studies: Indirect John Witte.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte, Xin Liu & Mark Pletcher.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
The International Consortium. The International HapMap Project.
Motivations to study human genetic variation
Copyright OpenHelix. No use or reproduction without express written consent1.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Population stratification
Yufeng Wu and Dan Gusfield University of California, Davis
Introduction to SNP and Haplotype Analysis
Gil McVean Department of Statistics
Constrained Hidden Markov Models for Population-based Haplotyping
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Approximation Algorithms for the Selection of Robust Tag SNPs
SNPs and CNPs By: David Wendel.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** * Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, CNR ** Dipartimento di Dipartimento di Matematica e Applicazioni "R.M. Caccioppoli“, Universita’ degli Studi di Napoli “Federico II”

Summary Biological background –DNA –Chromosomes –Haplotypes and Genotypes –SNPs Haplotype analysis Tag SNPs selection –Problem definition –State of the art –Reconstruction Function and Linkage disequilibrium –Clustering techniques –Set covering techniques –Computational results –Conclusions and future work

Double Helix ((Watson-Crick) of two sequences of Nucleotides A, T, C. G Base pairs (A-T, G-C) are complementary One DNA sequence contains regions (i.e. genes, introns, exons) located in the same position of the sequence, in each individual of a species DNA Structure

Chromosomes One individual genome is organized in Chromosomes, i.e. large DNA macromolecules packaged in linear or circular shape In polyploid organisms multiple copies of each chromosome exist In diploid organisms (human) there are two copies of each chromosome, packaged in linear shape. Each Chromosome includes hundreds of different genes Four-arm structure during meiosis and mitosis

A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype. H1 AATCGCCTTA (maternal chrom) H2 ACACGTCTCA (paternal chrom) G(H1,H2) A A/C T/A C G T T/C A For disease association studies, haplotype data is more valuable than genotype data Haplotype data is hard to collect. Genotype data is easy to collect Haplotypes and genotypes

SNPs All humans are 99,99 % identical. polymorphism. Diversity? polymorphism. A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more). A G G A A A G T T T T G A A C C C C C C C T T T AATATATCG AATATATCG AATATATCG AATATATCG AATATATCG AATATATCG TCCGTATACCTA TCCGTATACCTA GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG

Haplotype analysis 1/2 A G G A A A G T T T T G A A C C C C C C C T T T Haplotype analysis* focuses on haplotypes and genotypes that are sequences of SNPs *

To reduce prohibitively expensive haplotyping costs, a two stage methodology has been proposed [1] Pilot Study All SNPs of interest are genotyped in a small sample of the population Common haplotypes are inferred using statistical methods A set of tag SNPs is selected for the population study Population Study Tag SNPs are genotyped in the remaining population Statistical methods are used to infer haplotypes over the tag SNPs Haplotypes over the tag SNPs are extrapolated to full haplotypes Two problems: Find a set of minimum cardinality Find a reconstruction function Haplotype analysis 2/2

Tag SNPs Selection: methods and models 1.Methods that find a minimum set of clusters of SNPs in high correlation (e.g. linkage disequlibrium) with each other (clusters are called blocks). SNPs prediction should be easier within a block 2.Methods that, given the block structure (based on correlation or on proximity) find a minimum set of SNPs which is able to distinguish each pair of haplotypes in a block; or assume that the number of tag SNPs is given and find a set of Tag which can reconstruct the haplotype of a unknown sample with high accuracy

Tag SNPs Selection: Problem definition Problem Definition Given a population of N haplotypes over M SNPs find a small set of SNPs (Tag SNPs) such that all the values of the other SNPs can be derived, with some reconstruction rule, from the selected values of the Tag SNPs. Two aspects: (1) Find a reconstruction function (2) Find a set of minimum cardinality that can reconstruct the other SNPs using (1) And Also: (3) Given (1) and (2), is there a proper way to identify blocks?

Tag SNPs Selection: Problem definition The Approaches Use a reconstruction function based on SNPs similarity Method 1 Cluster the SNPs according to a proper metric; Select the centroid of each cluster as a TAG SNPs. Method 2 Select a subset of SNPs that are able to differentiate each pair of haplotypes (Set Covering formulation) Both method are coherent with the adopted reconstruction function The performance in reconstruction can be used to derive the blocks ex- post

The “Majority Vote” 1.Given the set of TAG SNPs A training set T of haplotypes of which we know the value of all the SNPs A new haplotype H of which we know only the value of the TAG SNPs 2.Let S be the set of haplotypes in T that have the same values of H on the TAG SNPs 3.For each non-TAG SNPs, determine its most frequent value in S and use it as a prediction of the value of this SNPs of H The reconstruction function

The majority vote rule is based on the assumption that TAG SNPs characterize almost completely the haplotype If two haplotypes are equal on the TAG SNPs, then they are equal also on the other SNPs. The reconstruction function

Method 1: SNPs Clustering Clustering : find groups of elements with high dissimilarity between groups and small dissimilarity within each group w.r.t. a chosen distance function Main Assumption: TAG SNPs are those that are very similar to many other SNPs in the Training Data Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule cluster the SNPs in the haplotypes space using Hamming Distance (HD) with k-means algorithm, for a proper value of k Select k TAG SNPs as those closest to the HD-centroids of each clusters

Method 1: Set Covering Model The “classical” model: Find a minimal subset of TAG SNPs in such a way that each pair of haplotypes in the training set differ in the value of at least 1 TAG SNPs Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule Select SNPs associated with xi = 1 in the solution of the SC problem The above problem cannot be solved optimally for realistic sizes

Variants of the Set Covering Model The SC problem has a number of constraints quadratic in the number of haplotypes We use variations of the SC model (SCV) that enable to control the number of TAGs and their quality in a more effective way Used iterative herusitic based on reduced costs Minimize the number of TAGs for a given level  of differentiation between haplotypes Maximize the capacity to differentiate between haplotypes for a given number of TAG SNPs 

Some Remarks A good estimation on the number of TAG SNPs to be used in the model can be found efficiently measuring the quality of the clusters for different values of  The quality of the two methods (Clustering and Set Covering) can be compared directly using the same dimensions of the TAG SNPs set SC still non tractable if all SNPs are used (most literature uses the first SNPs). Start with centroids of clustering Add columns with pricing until LP oprimal Add columns with metric on SNPs until F.O. increases Solve IP

Computational results International HapMap Project Data on Chromosoma 21 of human genome YRI : Yoruba in Ibadan, Nigeria. JPT: Japanese in Tokyo, Japan CHB: Han Chinese in Beijing, China CEU : Utah Residents with Northern and Western European Ancestry # haplotypes# SNPs YRI JPT+CHB CEU

Computational results Experiments Setting a)Limited to the first block of 1500 SNPs (as in related literature), or b)Using all SNPs (  ) c)Used clustering with standard HD with modal centroids and random starting centroids d)Used SCR with fixed  using iterative heuristics based on reduced costs solved with CPLEX e)Reconstruction with majority rule f)Quality of reconstraction: if SNPs value coherent in more than 70% of matching haplotypes (set S), then predict, else declare undetermined g)2/3 of haplotypes used for training, 1/3 for testing

Computational results

Observations Reconstruction error in the range of 20% of the SNPs, improving on previous results (where comparable) 1.SCV method performs better that clustering expecially when all SNPs are used 2.Best results are obrtained with approx. 30 TAG SNPs. Larger values do not reduce the reconstruction error and slow down the computation 3.First time so many SNPs are treated simultaneously 4.Completely correct SNPs are in the range 10-20% With  30 TAGs we can reconstruct correctly  6000 SNPs…

Computational results Work in Progress Use the proposed method to indentify the blocks Use all SNPs on Training Set Apply SCV to select  TAG SNPs Apply majority rule to test set and select those SNPs that are predicted correclty all over the test set Create one block with these SNPs, associate them to TAG set, remove these SNPs from samples Iterate until sample contains only TAG SNPs or when no improvement is obtained …Preliminary results are encouraging … Larger data sets are needed in order to test the method properly