CSE182-L17 Clustering Population Genetics: Basics.

CSE182-L17 Clustering Population Genetics: Basics

Unsupervised Clustering Given a set of points (in n- dimensions), and k, compute the k “best clusters”. In k-means, clustering is done by choosing k centers (means). Each point is assigned to the closest center. The notion of “best” is defined by distances to the center. Question: How can we compute the k best centers? Cluster s

Distance Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Euclidean) distance from v to the closest point from X. Given a set of n data points V={v 1 …v n } and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(v i, X) 2 / n 1 < i < n v

K-Means Clustering Problem: Formulation Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X This problem is NP-complete in general.

1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points. Output: A single point X that minimizes d(V,X) over all possible choices of X. This problem is easy. However, it becomes very difficult for more than one center. An efficient heuristic method for k-Means clustering is the Lloyd algorithm

K-means: Lloyd’s algorithm Choose k centers at random: –X’ = {x 1,x 2,x 3,…x k } Repeat –X=X’ –Assign each v  V to the closest cluster j d(v,x j ) = d(v,X)  C j= C j  {v} –Recompute X’ x’ j  (∑ v  Cj v) /|C j | until (X’ = X)

x1x1 x2x2 x3x3

Conservative K-Means Algorithm Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence. A more conservative method would be to move one point at a time only if it improves the overall clustering cost The smaller the clustering cost of a partition of data points is the better that clustering is Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used)

Microarray summary Microarrays (like MS) are a technology for probing the dynamic state of the cell. We answered questions like the following: –Which genes are coordinately regulated (They have similar expression patterns in different conditions)? –How can we reduce the dimensionality of the system? –Using gene expression values from a sample, can you predict if the sample is normal (state A) or diseased (state B) The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts.

Microarray non-summary We did not cover: –How are the gene expression values measured (the technology)? (CSE183) –How do you control variability across different experiments (normalization)? (CSE183) –What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181)

Population Genetics The sequence of an individual does not say anything about the diversity of a population. Small individual genetic differences can have a profound impact on “phenotypes” –Response to drugs –Susceptibility to diseases Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge.

Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

Population Genetics What is it about our genetic makeup that makes us measurably different? These genetic differences are correlated with phenotypic differences With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals. Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it.

What causes variation in a population? Mutations (may lead to SNPs) Recombinations Other genetic events (Ex: microsatellite repeats) Deletions, inversions

Single Nucleotide Polymorphisms 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110 Infinite Sites Assumption: Each site mutates at most once

Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC 435335435335

STR can be used as a DNA fingerprint Consider a collection of regions with variable length repeats. Variable length repeats will lead to variable length DNA Vector of lengths is a fingerprint 4 2 3 5 1 3 2 3 1 5 3 positions individuals

Recombination 00000000 11111111 00011111

What if there were no recombinations? Life would be simpler Each sequence would have a single parent The relationship is expressed as a tree.

The Infinite Sites Assumption 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 3 8 5 The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. Some phenotypes could be linked to the polymorphisms Some of the linkage is “destroyed” by recombination

Infinite sites assumption and Perfect Phylogeny Each site is mutated at most once in the history. All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i

Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. How can one reconstruct such a tree?

The 4-gamete condition A column i partitions the set of species into two sets i 0, and i 1 A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0i0 i1i1

4 Gamete Condition –There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i 0, or i 1. –Equivalent to –There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof Depending on which edge the mutation j occurs, either i 0, or i 1 should be homogenous. (only if) Every perfect phylogeny satisfies the 4-gamete condition (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0i0 i1i1 i

An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. –It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

Inclusion Property For any pair of columns i,j –i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r A BCDE Initially, there is a single clade r, and each node has r as its parent

Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

Add first column In adding column i –Check each edge and decide which side you belong. –Finally add a node if you can resolve a clade r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 u

Adding other columns Add other columns on edges using the ordering property r E B C D A 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 1 2 4 3 5

Unrooted case Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case

Handling recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns

Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination –Pr[A,B=0,1] = 0.25 Linkage disequilibrium Case 2:Extensive recombination –Pr[A,B=(0,1)=0.125 Linkage equilibrium AB0101000010101010AB010100001010101000

CSE182-L17 Clustering Population Genetics: Basics.

Similar presentations

Presentation on theme: "CSE182-L17 Clustering Population Genetics: Basics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE182-L17 Clustering Population Genetics: Basics.

Similar presentations

Presentation on theme: "CSE182-L17 Clustering Population Genetics: Basics."— Presentation transcript:

Similar presentations

About project

Feedback