Clustering and Classification – Introduction to Machine Learning BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Minimum Redundancy and Maximum Relevance Feature Selection
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Introduction to Bioinformatics
Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
What is Cluster Analysis
Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Biomedical Image Analysis and Machine Learning BMI 731 Winter 2005 Kun Huang Department of Biomedical Informatics Ohio State University.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Gene expression profiling identifies molecular subtypes of gliomas
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
CSE 185 Introduction to Computer Vision Pattern Recognition.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
This week: overview on pattern recognition (related to machine learning)
Whole Genome Expression Analysis
Data mining and machine learning A brief introduction.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
More on Microarrays Chitta Baral Arizona State University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Microarrays.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Classification Ensemble Methods 1
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
JMP Discovery Summit 2016 Janet Alvarado
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Data Mining: Concepts and Techniques (3rd ed
Dimension reduction : PCA and Clustering
Feature Selection Methods
CSE572: Data Mining by H. Liu
Presentation transcript:

Clustering and Classification – Introduction to Machine Learning BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University

How do we use microarray? Profiling Clustering Cluster to detect patient subgroups Cluster to detect gene clusters and regulatory networks

Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

- Clustering or classification? - Is training data available? - What domain specific knowledge can be applied? - What preprocessing of data is needed? - Log / data scale and numerical stability - Filtering / denoising - Nonlinear kernel - Feature selection (do I need to use all the data?) - Is the dimensionality of the data too high?

How do we process microarray data (clustering)? - Feature selection – genes, transformations of expression levels. - Genes discovered in the class comparison (t-test). Risk: missing genes. - Iterative approach : select genes under different p- value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA).

- Dimensionality Reduction - Principal component analysis (PCA) - Singular value decomposition (SVD) - Karhunen-Loeve transform (KLT) Basis for P SVD

- Principal Component Analysis (PCA) - Other things to consider - Numerical balance/data normalization - Noisy direction - Continuous vs. discrete data - Principal components are orthogonal to each other, however, biological data are not - Principal components are linear combinations of original data - Prior knowledge is important - PCA is not clustering!

- Dimensionality reduction: linear discriminant analysis (LDA) B A w. (From S. Wu’s website)

Linear Discriminant Analysis B A w. (From S. Wu’s website)

Visualization of Microarray Data Multidimensional scaling (MDS) High-dimensional coordinates unknown Distances between the points are known The distance may not be Euclidean, but the embedding maintains the distance in a Euclidean space Try different dimensions (from one to ???) At each dimension, perform optimal embedding to minimize embedding error Plot embedding error (residue) vs. dimension Pick the knee point

Visualization of Microarray Data Multidimensional scaling (MDS)

Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

Distance Measure (Metric?) -What do you mean by “similar”? -Euclidean -Uncentered correlation -Pearson correlation

Distance Metric -Euclidean _atLip _atAp1s d E (Lip1, Ap1s1) = 12883

Distance Metric -Pearson Correlation _atLip _atAp1s d P (Lip1, Ap1s1) = 0.904

Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1 to -1.

Distance Metric -Uncentered Correlation _atLip _atAp1s d u (Lip1, Ap1s1) =  About 33.4 o

Distance Metric -Difference between Pearson correlation and uncentered correlation _atLip _atAp1s Pearson correlation Baseline expression possible Uncentered correlation All are considered signals

Distance Metric -Difference between Euclidean and correlation

Distance Metric -Missing: negative correlation may also mean “close” in signal pathway (1-|PCC|, 1-PCC^2)

Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

How do we process microarray data (clustering)? - Unsupervised Learning – Hierarchical Clustering

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters.

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters.

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage – Prone to chaining and sensitive to noise Complete linkage – Tends to produce compact clusters Average linkage – Sensitive to distance metric

-Unsupervised Learning – Hierarchical Clustering

Dendrograms Distance – the height each horizontal line represents the distance between the two groups it merges. Order – Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria.

-Unsupervised Learning - K-means -Vector quantization -K-D trees -Need to try different K, sensitive to initialization

-Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric

-Unsupervised Learning - K-means -Number of class K needs to be specified -Does not always converge -Sensitive to initialization

-Unsupervised Learning - K-means

-Unsupervised Learning -Self-organized maps (SOM) -Neural network based method -Originally used as a visualization method for visualize (embedding) high-dimensional data -Also related vector quantization -The idea is to map close data points to the same discrete level

-Issues -Lack of consistency or representative features (5.3 TP PTEN doesn’t make sense) -Data structure is missing -Not robust to outliers and noise D’Haeseleer 2005 Nat. Biotechnol 23(12):

Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

-Model-based clustering methods (Han) Pan et al. Genome Biology :research doi: /gb research0009

-Structure-based clustering methods

-Supervised Learning -Support vector machines (SVM) and Kernels -Only (binary) classifier, no data model

-Supervised Learning - Support vector machines (SVM) and Kernels -Kernel – nonlinear mapping

-Supervised Learning - Naïve Bayesian classifier -Bayes rule -Maximum a posterior (MAP) Prior prob. Conditional prob.

Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

-Accuracy vs. generality -Overfitting -Model selection Model complexity Prediction error Training sample Testing sample (reproduced from Hastie et.al.)

-Assessing the Validity of Clusters -Most clustering algorithms do not assume any structure or a prior relationship among the genes. However, the found clusters should more or less reflect the structures (e.g., pathways). (An interesting research problem is to develop new algorithms that can accommodate such relationships.) -If different patients are grouped into clusters, it implies that there are subtypes for the disease, which is a big claim and must be validated using other methods (e.g., pathology). -Relationship with external variables is important. E.g., clustering on cells from different tissue types may correspond to the relationship among the tissues.

-Assessing the Validity of Clusters -Where should we cut the dendrograms? -Which clustering results should we believe, i.e., different (or even the same) clustering algorithms may find different clustering results? -Many tests are flawed, e.g., circular reasoning: using genes with significant different between two classes as features for clustering, then use the clusters to detect signatures which are genes significantly changed.

-Assessing the Validity of Clusters -Most clustering algorithms can find clusters even from random data. -The clusters found by clustering algorithms should exhibit greater intra- cluster similarity (homogeneity) and larger inter-cluster distance (separation). -How to be sure that the clustering is not from random data? -How to find good partition among any possible partitions of the data? -How to assess the reproducibility of the partitioning?

-Assessing the Validity of Clusters -Global tests of clustering (meaningful cluster vs. random cluster) -Check the distribution of the nearest neighbor distances (NN) and pairwise distances, uniform distribution and multiple distribution are very different NN Pairwise

-Assessing the Validity of Clusters -Reproducibility of clustering -Global perturbation methods (McShane et al, Bioinformatics, 2002, Using only the first three principal components (the observation is that they convey the clustering information well enough -Adding Gaussian noise and check if the clustering relationship is still preserved -Indices R and D. R – the ratio of same cluster data pairs that are preserved after the perturbation. D - discrepancy between best-matched clusters

How do we process microarray data (clustering)? - Cross-validation: assessment of the classifier. Note the key thing is to strike the balance between accurate classification on training data and the prediction power. - Training vs. testing (10%) - Leave-one-out bootstraping: for small sample size, ratio on the correct prediction of the left-out sample.

Validation cDNA or Affymetrix chips measure mRNA levels, which may not reflect final protein concentrations Various splice variants exist, the expressed protein may not be active Post-translational modification Quantitative real-time PCR (RT-PCR) is widely used for this purpose Other high-level consideration – correlation does not mean causation

Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

– Data Mining is searching for knowledge in data –Knowledge mining from databases –Knowledge extraction –Data/pattern analysis –Data dredging –Knowledge Discovery in Databases (KDD)

−The process of discovery Interactive + Iterative  Scalable approaches

Popular Data Mining Techniques – Clustering: Most dominant technique in use for gene expression analysis in particular and bioinformatics in general. –Partition data into groups of similarity – Classification: –Supervised version of clustering  technique to model class membership  can subsequently classify unseen data. – Frequent Pattern Analysis – A method for identifying frequently re-curring patterns (structural and transactional). – Temporal/Sequence Analysis –Model temporal data  wavelets, FFT etc. – Statistical Methods –Regression, Discriminant analysis

Summary −A good clustering method will produce high quality clusters with −high intra-class similarity −low inter-class similarity −The quality of a clustering result depends on both the similarity measure used by the method and its implementation. −Other metrics include: density, information entropy, statistical variance, radius/diameter −The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Recommended Literature 1. Bioinformatics – The Machine Learning Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press, Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, Pattern Classification by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley & Sons, The Elements of Statistical Learning by T. Hastie, R. Tibshirani, J. Friedman, Springer-Verlag, 2001