Clustering Overview Algorithm Begin with all sequences in one cluster While splitting some cluster improves the objective function: { Split each cluster.

Slides:



Advertisements
Similar presentations
The Build-up of the Red Sequence at z
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
DATA MINING CLUSTERING ANALYSIS. Data Mining (by R.S.K. Baber) 2 CLUSTERING Example: suppose we have 9 balls of three different colours. We are interested.
MA/CS 375 Fall MA/CS 375 Fall 2002 Lecture 29.
Today we are learning about: Programs & Sequence
Quotient-Remainder Theory, Div and Mod
CS CS 175 – Week 4 Mesh Decimation General Framework, Progressive Meshes.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Advanced Concepts and Algorithms Figures for Chapter 9 Introduction.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Basic Concepts and Algorithms Figures for Chapter 8 Introduction.
Segmentation. Methods Region Growing Split and Merge Clustering.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
CMSC 250 Discrete Structures Summation: Sequences and Mathematical Induction.
Computer Science, An Overview Brookshear © 2000 Addison Wesley Computer Science — An Overview J. Glenn Brookshear Chapter Zero, Figures 1-3 Introduction.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
CS146 Overview. Problem Solving by Computing Human Level  Virtual Machine   Actual Computer Virtual Machine Level L0.
Review of Bézier Curves DeCastlejau Algorithm V2V2 V4V4 V1V1 V3V3 Insert at t = ¾.
Using MS Project. Overview Project Views ▫Gantt Chart  Task Information ▫Resource Sheet ▫Calendar Perspective ▫Another Tool in the Toolbox ▫Results May.
Factoring out the GCF. Greatest Common Factor The greatest common factor (GCF) is the product of what both items have in common. Example:18xy, 36y 2 18xy.
1 Beginning & Intermediate Algebra – Math 103 Math, Statistics & Physics.
Factoring Day 1 GCF Difference of 2 Squares factor by grouping.
SECURITY IMAGING Prof. Charles A. Bouman Vertical Integrated Projects (VIP) Spring 2011, Call-Out.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
Overview of Emerging IT2 At the request of several students, the context and mathematics of several Big Data algorithms will be examined in detail Dissertation.
Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
1 Parallel Sorting Algorithm. 2 Bitonic Sequence A bitonic sequence is defined as a list with no more than one LOCAL MAXIMUM and no more than one LOCAL.
Background 2 Outline 3 Scopus publications 4 Goal and a signal model 5Harmonic signal parameters estimation.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
Cluster Analysis Data Mining Experiment Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Least Common Multiple Greatest Common Factor
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Learning Portfolio Analysis and Mining for SCORM Compliant Environment Pattern Recognition (PR, 2010)
Author:Puneet Maheshwari, Parag Agarwal, Balakrishnan Prabhakaran
Determine protein structure from amino acid sequence
Structuring Interactive Cluster Analysis
Decomposed Process Mining: The ILP Case
Merge sort merge sort: Repeatedly divides the data in half, sorts each half, and combines the sorted halves into a sorted whole. The algorithm: Divide.
High Performance Computing in Teaching
Unit 5 – Series, Sequences, and Limits Section 5
Topological Ordering Algorithm: Example
CS223 Advanced Data Structures and Algorithms
Homework Assignment 1: Use the following data set to test the performance difference of three clustering algorithms: K-means, AP clustering and Spectral.
Groups 36 and 630 Group 640 Group 31 Group 5 Groups 40,41, 655 and 669
Overview of Emerging IT1 & IT2
Overview of Emerging IT1 & IT2
Overview for Estimating Age of Star Clusters
Overview Accomplishments Automatic Queen selection Side by Side Tracks
Jan Hoinka, Phuong Dao, Teresa M Przytycka 
Topological Ordering Algorithm: Example
Figure 11-1.
Course selection and class description overview
Topological Ordering Algorithm: Example
Merge sort merge sort: Repeatedly divides the data in half, sorts each half, and combines the sorted halves into a sorted whole. The algorithm: Divide.
What is a consonant cluster?
Figure Overview.
Figure Overview.
MississaugaTalks! Saif Shaikh March 5, 2016 Code and the City
SEEM4630 Tutorial 3 – Clustering.
Generating Sequences © T Madas.
Basic case splitting algorithm
Objective - To order whole numbers.
Course project work tasks
Topological Ordering Algorithm: Example
Week 7 REU Nolan Warner.
Presentation transcript:

Clustering Overview Algorithm Begin with all sequences in one cluster While splitting some cluster improves the objective function: { Split each cluster in two so that the objective function has the greatest possible improvement at that step Reassign individual sequences into the cluster while doing so improves the objective function } The Objective Function I( X i ; X j |C ) i<j -I ( X ; C ) We want to successively minimize this term at each iteration Measures the mutual information between pairs of positions within each cluster Measures the mutual information between a sample of each position within a cluster and the overall distribution of values of these positions. ß These terms work against each other to approach a steady state after several iterations ß is a factor to adjust the relative importance of the terms. Independent Sites If several populations were placed in one group, then knowing the value of one position would provide information about the value of another position (there would be mutual information between positions). This is because each subpopulation has certain sets of variants that are more common to it than to other populations A samples ij A C A B samples T G T C A G T If samples from two populations were mixed together, knowing that position i is value A or C tells us that position j is probably A, and knowing that i is T or G provides information that position j is probably T. If individuals were separated into their own subpopulations, knowing the value of one position does not provide any more information about the value at another position (so there is no mutual information between positions). A samples ij A C A Knowing the value at position i does not tell us any new information about the value at position j, because the value at position j is always the same. So removing mutual information between positions increases the likelihood that the samples are sorted into their respective subpopulations A ABAB B Higher Mutual information between positions Lower mutual information between position values and cluster Lower mutual information between positions Higher mutual information between position values and cluster Homogeneous Clusters Maximize I(X;C) by increasing the correlation between values within each position for each cluster and the clusters. This term favors many clusters (each sequence having its own cluster). Maximizing I(X;C) increases the chance that any sample of a variant at a position is highly representative of the entire distribution of values for that position in the cluster. GG GT GG GT Probability that sequence GG is in population A is 75%, and probability that GT is in A is 25% A B C Probability that GG is in population B is 100%. Rearrange cluster assignments Population Substructure using Information Theory Edward Shyu Computer Science and Engineering University of California, San Diego Population Substructure and Disease Association Probability that GT is in population C is 100% HIV Evolution Alu Phylogeny GG Abstract Population Substructure arises when subgroups of organisms evolve separately from other subgroups, resulting in genetic variation that is common within subgroups and different across subgroups. Finding these subpopulations based on genetic variation can take many approaches. Distance-based clustering has its limits when subgroups are highly overlapping, and mutation rate equals or exceeds mutation distance between groups. Using methods based on Information Theory (mutual information) enables finding substructure in these cases. Sponsored by the California Institute of Telecommunications and Information Technology Sean ORourke Computer Science and Engineering University of California, San Diego Eleazar Eskin Computer Science and Engineering University of California, San Diego Disease association studies find correlations between genetic variants (such as single nucleotide polymorphisms, SNPs), and phenotypes such as disease traits. These studies assume that the population sample being studied is homogeneous. Mixture of different subpopulations skews association analysis because if any subpopulation has a higher incidence of disease, any variant specific to that population will appear to correlate with that disease. Finding substructure will allow disease association analysis to be performed within subpopulations, reducing the chances of getting false positives in the results. The algorithm was run on set of 1598 SNP positions from 23 African Americans, 24 Asian Americans and 24 European Americans. All individuals were correctly assigned to the original subgroup. The algorithm was run on a reduced set of 80 SNPs and achieved 91.8% accuracy. Another algorithm by Price et al. achieved 90.1% accuracy on the same data. The HIV virus consists of three major groups (M, N, O) and 9 genetic subtypes (A,B,C,D,F,G,H,J,K) within group M. Since HIV has a high mutation rate (6 times that of typical DNA) and high recombination rates, finding substructure using conventional methods is difficult. The polymerase subset of 442 HIV-1 sequences from Los Alamos HIV database was run through the algorithm and the resulting subgroups successfully separated sequences based on geographic location. More subgroups were found in the African continent, where HIV is particularly diverse. Data results from Separation of Overlapping Subpopulations by Mutual Information, by Sean ORourke, Gal Chechick and Eleazar Eskin. Alus are short intersperesed nucleotide elements (SINEs) that, like viruses, copy their DNA and reinsert themselves elsewhere in the genome, (but unlike viruses, do not form a protein coat for their copies to escape the organism). Several active elements have the ability to duplicate themselves, which results in groups of Alus that descended from a particular ancestral Alu. Constructing the phylogeny (family tree) of Alus is difficult because the subgroups overlap extensively. For example, some distances between subgroups have an average hamming distance of 12.8 mutations, while the average member of each population differs from the consensus by 34.8 mutations. The algorithm found the same subgroups as a previous method by Price et al. Image of Alu element duplication and insertion of the copy elsewhere in the genome picture from Alu Repeats and Human Genomic Diversity by Mark A. Batzer and Prescott L. Deininger. Nature Reviews Genetics 3, (2002); doi: /nrg798. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Price AL, Eskin E, Pevzner PA. Genome Res Nov;14(11): A