Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *

Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** * Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, CNR ** Dipartimento di Dipartimento di Matematica e Applicazioni "R.M. Caccioppoli“, Universita’ degli Studi di Napoli “Federico II”

Summary Biological background –DNA –Chromosomes –Haplotypes and Genotypes –SNPs Haplotype analysis Tag SNPs selection –Problem definition –State of the art –Reconstruction Function and Linkage disequilibrium –Clustering techniques –Set covering techniques –Computational results –Conclusions and future work

Double Helix ((Watson-Crick) of two sequences of Nucleotides A, T, C. G Base pairs (A-T, G-C) are complementary One DNA sequence contains regions (i.e. genes, introns, exons) located in the same position of the sequence, in each individual of a species DNA Structure

Chromosomes One individual genome is organized in Chromosomes, i.e. large DNA macromolecules packaged in linear or circular shape In polyploid organisms multiple copies of each chromosome exist In diploid organisms (human) there are two copies of each chromosome, packaged in linear shape. Each Chromosome includes hundreds of different genes Four-arm structure during meiosis and mitosis

A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype. H1 AATCGCCTTA (maternal chrom) H2 ACACGTCTCA (paternal chrom) G(H1,H2) A A/C T/A C G T T/C A For disease association studies, haplotype data is more valuable than genotype data Haplotype data is hard to collect. Genotype data is easy to collect Haplotypes and genotypes

SNPs All humans are 99,99 % identical. polymorphism. Diversity? polymorphism. A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more). A G G A A A G T T T T G A A C C C C C C C T T T AATATATCG AATATATCG AATATATCG AATATATCG AATATATCG AATATATCG TCCGTATACCTA TCCGTATACCTA GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC GGGGTGTGTGTAC TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGCTAGCACGCG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG TGTGTAATATACG

Haplotype analysis 1/2 A G G A A A G T T T T G A A C C C C C C C T T T Haplotype analysis* focuses on haplotypes and genotypes that are sequences of SNPs *http://www.hapmap.org/

To reduce prohibitively expensive haplotyping costs, a two stage methodology has been proposed [1] Pilot Study All SNPs of interest are genotyped in a small sample of the population Common haplotypes are inferred using statistical methods A set of tag SNPs is selected for the population study Population Study Tag SNPs are genotyped in the remaining population Statistical methods are used to infer haplotypes over the tag SNPs Haplotypes over the tag SNPs are extrapolated to full haplotypes Two problems: Find a set of minimum cardinality Find a reconstruction function Haplotype analysis 2/2

Tag SNPs Selection: methods and models 1.Methods that find a minimum set of clusters of SNPs in high correlation (e.g. linkage disequlibrium) with each other (clusters are called blocks). SNPs prediction should be easier within a block 2.Methods that, given the block structure (based on correlation or on proximity) find a minimum set of SNPs which is able to distinguish each pair of haplotypes in a block; or assume that the number of tag SNPs is given and find a set of Tag which can reconstruct the haplotype of a unknown sample with high accuracy

Tag SNPs Selection: Problem definition Problem Definition Given a population of N haplotypes over M SNPs find a small set of SNPs (Tag SNPs) such that all the values of the other SNPs can be derived, with some reconstruction rule, from the selected values of the Tag SNPs. Two aspects: (1) Find a reconstruction function (2) Find a set of minimum cardinality that can reconstruct the other SNPs using (1) And Also: (3) Given (1) and (2), is there a proper way to identify blocks?

Tag SNPs Selection: Problem definition The Approaches Use a reconstruction function based on SNPs similarity Method 1 Cluster the SNPs according to a proper metric; Select the centroid of each cluster as a TAG SNPs. Method 2 Select a subset of SNPs that are able to differentiate each pair of haplotypes (Set Covering formulation) Both method are coherent with the adopted reconstruction function The performance in reconstruction can be used to derive the blocks ex- post

The “Majority Vote” 1.Given the set of TAG SNPs A training set T of haplotypes of which we know the value of all the SNPs A new haplotype H of which we know only the value of the TAG SNPs 2.Let S be the set of haplotypes in T that have the same values of H on the TAG SNPs 3.For each non-TAG SNPs, determine its most frequent value in S and use it as a prediction of the value of this SNPs of H The reconstruction function

The majority vote rule is based on the assumption that TAG SNPs characterize almost completely the haplotype If two haplotypes are equal on the TAG SNPs, then they are equal also on the other SNPs. The reconstruction function

Method 1: SNPs Clustering Clustering : find groups of elements with high dissimilarity between groups and small dissimilarity within each group w.r.t. a chosen distance function Main Assumption: TAG SNPs are those that are very similar to many other SNPs in the Training Data Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule cluster the SNPs in the haplotypes space using Hamming Distance (HD) with k-means algorithm, for a proper value of k Select k TAG SNPs as those closest to the HD-centroids of each clusters

Method 1: Set Covering Model The “classical” model: Find a minimal subset of TAG SNPs in such a way that each pair of haplotypes in the training set differ in the value of at least 1 TAG SNPs Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule Select SNPs associated with xi = 1 in the solution of the SC problem The above problem cannot be solved optimally for realistic sizes

Variants of the Set Covering Model The SC problem has a number of constraints quadratic in the number of haplotypes We use variations of the SC model (SCV) that enable to control the number of TAGs and their quality in a more effective way Used iterative herusitic based on reduced costs Minimize the number of TAGs for a given level  of differentiation between haplotypes Maximize the capacity to differentiate between haplotypes for a given number of TAG SNPs 

Some Remarks A good estimation on the number of TAG SNPs to be used in the model can be found efficiently measuring the quality of the clusters for different values of  The quality of the two methods (Clustering and Set Covering) can be compared directly using the same dimensions of the TAG SNPs set SC still non tractable if all SNPs are used (most literature uses the first 1000-1500SNPs). Start with centroids of clustering Add columns with pricing until LP oprimal Add columns with metric on SNPs until F.O. increases Solve IP

Computational results International HapMap Project Data on Chromosoma 21 of human genome YRI : Yoruba in Ibadan, Nigeria. JPT: Japanese in Tokyo, Japan CHB: Han Chinese in Beijing, China CEU : Utah Residents with Northern and Western European Ancestry # haplotypes# SNPs YRI120 38.852 JPT+CHB180 33.878 CEU120 34.103

Computational results Experiments Setting a)Limited to the first block of 1500 SNPs (as in related literature), or b)Using all SNPs (  40.000) c)Used clustering with standard HD with modal centroids and random starting centroids d)Used SCR with fixed  using iterative heuristics based on reduced costs solved with CPLEX e)Reconstruction with majority rule f)Quality of reconstraction: if SNPs value coherent in more than 70% of matching haplotypes (set S), then predict, else declare undetermined g)2/3 of haplotypes used for training, 1/3 for testing

Computational results

Observations Reconstruction error in the range of 20% of the SNPs, improving on previous results (where comparable) 1.SCV method performs better that clustering expecially when all SNPs are used 2.Best results are obrtained with approx. 30 TAG SNPs. Larger values do not reduce the reconstruction error and slow down the computation 3.First time so many SNPs are treated simultaneously 4.Completely correct SNPs are in the range 10-20% With  30 TAGs we can reconstruct correctly  6000 SNPs…

Computational results Work in Progress Use the proposed method to indentify the blocks Use all SNPs on Training Set Apply SCV to select  TAG SNPs Apply majority rule to test set and select those SNPs that are predicted correclty all over the test set Create one block with these SNPs, associate them to TAG set, remove these SNPs from samples Iterate until sample contains only TAG SNPs or when no improvement is obtained …Preliminary results are encouraging … Larger data sets are needed in order to test the method properly

Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *

Similar presentations

Presentation on theme: "Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *

Similar presentations

Presentation on theme: "Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *"— Presentation transcript:

Similar presentations

About project

Feedback