CSE182-L17 Clustering Population Genetics: Basics.

Slides:



Advertisements
Similar presentations
K-means Clustering Given a data point v and a set of points X,
Advertisements

PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Introduction to Bioinformatics Algorithms Clustering.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Introduction to Bioinformatics Algorithms Clustering.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Lecture 09 Clustering-based Learning
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Clustering Unsupervised learning Generating “classes”
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
Microarrays.
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Vineet Bafna CSE280A CSE280Vineet Bafna. We will cover topics from Population Genetics. The focus will be on the use of algorithms for analyzing genetic.
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Unsupervised Learning
Gil McVean Department of Statistics
CSE 280A: Advanced Topics in Computational Molecular Biology
L4: Counting Recombination events
Hierarchical clustering approaches for high-throughput data
Estimating Recombination Rates
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Clustering.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Unsupervised Learning
Presentation transcript:

CSE182-L17 Clustering Population Genetics: Basics

Unsupervised Clustering Given a set of points (in n- dimensions), and k, compute the k “best clusters”. In k-means, clustering is done by choosing k centers (means). Each point is assigned to the closest center. The notion of “best” is defined by distances to the center. Question: How can we compute the k best centers? Cluster s

Distance Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Euclidean) distance from v to the closest point from X. Given a set of n data points V={v 1 …v n } and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(v i, X) 2 / n 1 < i < n v

K-Means Clustering Problem: Formulation Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X This problem is NP-complete in general.

1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points. Output: A single point X that minimizes d(V,X) over all possible choices of X. This problem is easy. However, it becomes very difficult for more than one center. An efficient heuristic method for k-Means clustering is the Lloyd algorithm

K-means: Lloyd’s algorithm Choose k centers at random: –X’ = {x 1,x 2,x 3,…x k } Repeat –X=X’ –Assign each v  V to the closest cluster j d(v,x j ) = d(v,X)  C j= C j  {v} –Recompute X’ x’ j  (∑ v  Cj v) /|C j | until (X’ = X)

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

Conservative K-Means Algorithm Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence. A more conservative method would be to move one point at a time only if it improves the overall clustering cost The smaller the clustering cost of a partition of data points is the better that clustering is Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used)

Microarray summary Microarrays (like MS) are a technology for probing the dynamic state of the cell. We answered questions like the following: –Which genes are coordinately regulated (They have similar expression patterns in different conditions)? –How can we reduce the dimensionality of the system? –Using gene expression values from a sample, can you predict if the sample is normal (state A) or diseased (state B) The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts.

Microarray non-summary We did not cover: –How are the gene expression values measured (the technology)? (CSE183) –How do you control variability across different experiments (normalization)? (CSE183) –What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181)

Population Genetics The sequence of an individual does not say anything about the diversity of a population. Small individual genetic differences can have a profound impact on “phenotypes” –Response to drugs –Susceptibility to diseases Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge.

Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

Population Genetics What is it about our genetic makeup that makes us measurably different? These genetic differences are correlated with phenotypic differences With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals. Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it.

What causes variation in a population? Mutations (may lead to SNPs) Recombinations Other genetic events (Ex: microsatellite repeats) Deletions, inversions

Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once

Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC

STR can be used as a DNA fingerprint Consider a collection of regions with variable length repeats. Variable length repeats will lead to variable length DNA Vector of lengths is a finger- print positions individuals

Recombination

What if there were no recombinations? Life would be simpler Each sequence would have a single parent The relationship is expressed as a tree.

The Infinite Sites Assumption The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. Some phenotypes could be linked to the polymorphisms Some of the linkage is “destroyed” by recombination

Infinite sites assumption and Perfect Phylogeny Each site is mutated at most once in the history. All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i

Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. How can one reconstruct such a tree?

The 4-gamete condition A column i partitions the set of species into two sets i 0, and i 1 A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0i0 i1i1

4 Gamete Condition –There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i 0, or i 1. –Equivalent to –There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof Depending on which edge the mutation j occurs, either i 0, or i 1 should be homogenous. (only if) Every perfect phylogeny satisfies the 4-gamete condition (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0i0 i1i1 i

An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. –It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

Inclusion Property For any pair of columns i,j –i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

Example A B C D E r A BCDE Initially, there is a single clade r, and each node has r as its parent

Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E

Add first column In adding column i –Check each edge and decide which side you belong. –Finally add a node if you can resolve a clade r A B C D E A B C D E u

Adding other columns Add other columns on edges using the ordering property r E B C D A A B C D E

Unrooted case Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case

Handling recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns

Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination –Pr[A,B=0,1] = 0.25 Linkage disequilibrium Case 2:Extensive recombination –Pr[A,B=(0,1)=0.125 Linkage equilibrium AB AB