©CMBI 2006 The WWW of clustering in Bioinformatics. or, How homo-sapiens thinks Clustering hokjes.

Slides:



Advertisements
Similar presentations
Algorithms and applications
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Slides from: Doug Gray, David Poole
Surface normals and principal component analysis (PCA)
x – independent variable (input)
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Introduction to Bioinformatics Algorithms Clustering.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Face Recognition Jeremy Wyatt.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Support Vector Machines
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Radial Basis Function Networks
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Neural Networks Lecture 8: Two simple learning algorithms
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Chapter 2 Dimensionality Reduction. Linear Methods
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Introduction to Computational Geometry Hackson
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Data Mining and Decision Support
Copyright Paula Matuszek Kinds of Machine Learning.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Fitch-Margoliash Algorithm 1.From the distance matrix find the closest pair, e.g., A & B 2.Treat the rest of the sequences as a single composite sequence.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.
©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Deep Learning Amin Sobhani.
Oral Presentation Applied Machine Learning Course YOUR NAME
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Principal Component Analysis (PCA)
Principal Component Analysis
Dimension reduction : PCA and Clustering
Announcements Project 2 artifacts Project 3 due Thursday night
Announcements Artifact due Thursday
Announcements Artifact due Thursday
Unsupervised Learning
Presentation transcript:

©CMBI 2006 The WWW of clustering in Bioinformatics. or, How homo-sapiens thinks Clustering hokjes

©CMBI 2006 Disclaimer I know nothing about the actual clustering techniques; for that you must ask Lutgarde, or Ron, or any of their colleagues. I will talk today about fields, mainly bioinformatics, where clustering is being used.

©CMBI 2006 My daily clustering

©CMBI 2006 Remember bioinformatics 1? The main reason for aligning sequences is that that allows us to transfer information. If there are many sequences available, clustering can help us figure out from which of those sequences we can best transfer that information. Why clustering sequences?

©CMBI 2006 My daily clustering Take, for example, the three sequences: 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR and you see immediately that 2 and 3 are close, while 1 is further away. So the three will look roughly like: 3 2 1

©CMBI 2006 Clustering sequences; start with distances D E Matrix of pair-wise distances between five sequences D and E are the closest pair. Take them, and collapse the matrix by one row/column.

©CMBI 2006 Clustering sequences D E A B

©CMBI 2006 Clustering sequences D E C A B

©CMBI 2006 Clustering sequences D E C A B So it really looks as if we have two clusters, AB and C,DE. But feel free to call it three clusters…

©CMBI 2006 So, nice tree, but what did we actually do? 1)We determined a distance measure 2)We measured all pair-wise distances 3)We collapsed everything in ~1½ dimension 4)We used an algorithm to visualize things 5)We decided on the number of clusters And that, ladies and gentleman, is called clustering…

©CMBI 2006 Back to my daily clustering 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR Actually I cheated a bit. 1 is closer to 3 than to 2 because of the A at position 1. How can we express this in the tree? For example: I will call this tree-flipping

©CMBI 2006 Can we generalize tree-flipping? To generalize tree flipping, sequences must be placed ‘distance-correct’ in 1 dimension: And then connect them, as we did before: So, now most info sits in the horizontal dimension. Can we use the vertical dimension usefully?

©CMBI 2006 The problem is actually bigger 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR d(i,j) is the distance between sequences i and j. d(1,2)=6; d(1,3)=5; d(2,3)=3. But what if a 4th sequence is added with d(1,4)=4, d(2,4)=5, d(3,4)=4? Where would that sequence sit? So a perfect representation would be: 1 3 2

©CMBI 2006 Projection to visualize clusters Fuller projection; Unfolded Dymaxion map Gnomonic projection: Correct distances Political projection Source: Wikepedia Mercator projection

©CMBI 2006 Back to sequences: ASASDFDFGHKMGHS 1 ASASDFDFRRRLRHS 2 ASASDFDFRRRLRIT 5 ASLPDFLPGHSIGHS 3 ASLPDFLPGHSIGIT 6 ASLPDFLPRRRVRIT 4 The more dimensions we retain, the less information we loose. The tree is now in 3D…

©CMBI 2006 Projection to visualize clusters We want to reduce the dimensionality with minimal distortion of the pair-wise distances. One way is Eigenvector determination, or PCA.

©CMBI 2006 PCA to the rescue Now we have made the data one-dimensional, while the second, vertical, dimension is noise. If we did this correctly, we kept as much data as possible.

©CMBI 2006 One problem, though… Can we actually draw a straight line through the points? To me it looks that the best line is not straight but bend; and if that’s true we lost some kind of information! But that is a data-modelling / clustering problem, not a PCA problem…

©CMBI 2006 Back to sequences: In we have N sequences, we can only draw their distance matrix in an N-1 dimensional space. By the time it is a tree, how many dimensions, and how much information have we lost? Perhaps we should cluster in a different way?

©CMBI 2006 Cluster on critical residues? QWERTYAKDFGRGH AWTRTYAKDFGRPM SWTRTNMKDTHRKC QWGRTNMKDTHRVW Gray = conserved Red = variable Green = correlated No information for clustering No noise

©CMBI 2006 Cluster based on correlated residues

©CMBI 2006 Conclusion about sequence clustering Important topics: 1.Distance measure (~ data selection) 2.Data-modelling / algorithm 3.Visualization 4.Dimensionality reduction

©CMBI 2006 We don’t only cluster sequences… Other cluster procedures are found in e.g.: 1.Structure prediction, analysis, etc; 2.Cell-sorting; 3.HIV medication choice; 4.Cancer radiation and chemotherapy regimes; 5.Design of food taste product combinations; and many more, often very different, fields of bio-related topics. The next few slides show some examples.

©CMBI 2006 Other sciences that cluster: Determination of a structure from massive averaging of very low resolution data. This is often iterative.

©CMBI 2006 Micro array data

©CMBI 2006 Brain imaging Donders instituut and One way of measuring is via hemoglobin. Mane patients or test persons must be averaged. Clustering determines who can be averaged. Distance measure? Donders, Waisman, and MPG websites. Cluster with picture deformation

©CMBI 2006 Molecular dynamics These actually are many superposed motions acting at the same time. Essential dynamics, eigenvector analysis in distance space, separates those motions. Bert de Groot

©CMBI 2006 Essential dynamics Daan van Aalten Motion along eigenvector 1

©CMBI 2006 Summary: domain knowledge is needed These examples make clear that domain knowledge is needed to cluster biological data. Everybody can cluster thisBut this?

©CMBI 2006 Even informatics now knows: Michele CeccarelliMichele Ceccarelli and Antonio Maratea (2006)Antonio Maratea Improving Fuzzy Clustering of Biological Data by Side Information* International Journal on Approximate Reasoning :. Abstract Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. * “ Side knowledge” is informatics-speak for domain knowledge

©CMBI 2006 Cluster techniques: K-means Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data. But how many clusters? D E C A B Elbow rule-of-thumb

©CMBI ….. Cluster techniques: Neural nets For example: self organizing maps. Every data-point must be represented by a vector of length N. Every square in this map becomes a random vector of length N. For every data-point the in-product is calculated with every map-vector. The best fitting map- vector is averaged with 1/q times the data vector. The neighbours in the map are averaged with 1/p.q times the data vector. This is iterated N times with normalisations in between. 1/q 1/p.q

©CMBI 2006 Support Vector Machines The idea is to split a dataset in two parts, for example binding ligands in green and non-binding ones in red. The parameters P and Q describe the ligands. The SVN finds the best separating line. Obviously this must be used in N dimensional space with an N-1 dimensional hyper plane. I believe that a mathematical theorem exists that tells us that this still works if N is (much) bigger than the number of data points. P Q

©CMBI 2006 Summary There are many ways to describe the data. There are many clustering techniques. There are many visualisation methods. What to use depends each time on the biological question. Domain knowledge is a prerequisite for success.