4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Dimension reduction (1)
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
CSE 185 Introduction to Computer Vision Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
A Short Overview of Microarrays Tex Thompson Spring 2005.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data Mining and Decision Support
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining ICCM
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
Data Mining, Neural Network and Genetic Programming
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Constrained Clustering -Semi Supervised Clustering-
LECTURE 10: DISCRIMINANT ANALYSIS
Classification of unlabeled data:
Principal Component Analysis (PCA)
Basic machine learning background with Python scikit-learn
Machine Learning Dimensionality Reduction
Singular Value Decomposition
Probabilistic Models with Latent Variables
Clustering.
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
Principal Component Analysis
Dimension reduction : PCA and Clustering
Dimensionality Reduction
INTRODUCTION TO Machine Learning
Feature space tansformation methods
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: DISCRIMINANT ANALYSIS
Nearest Neighbors CSC 576: Data Mining.
Feature Selection Methods
EM Algorithm and its Applications
Lecture 16. Classification (II): Practical Considerations
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.

Lecture 4.01 Overview What is it! Dimensionality Normalization Dimensionality reduction Clustering –Metrics; –K-mean; –Hierarchical clustering. Deriving significance.

Lecture 4.02 What is data mining? Discover hidden facts in databases. Uses a combination of machine learning, statistical analysis, modeling techniques and database technology. Establishes relationships and identify patterns. “A technique geared for the user who typically does not know exactly what he’s searching for.”

Lecture 4.03 Normalization Why normalize? Most used approach for spotted array data: loess. –Local regression is performed on a “sliding window” over the data (next slide). The Durbin-Rocke normalization: –A variance-stabilizing transformation for gene-expression microarray data (2004) Bioinformatics, 18:S105-S110

Lecture 4.04 Loess normalization From:

Lecture 4.05 Dimensionality Data mining algorithms work on a set of labeled or unlabeled data points. –Labeled: supervised learning (neural network, HMM). –Unlabeled: unsupervised learning (k-mean, SOM). Each data point is a vector in a continuous or discrete space. –Dimensionality of a dataset is the length of those vectors. The algorithm tries to identify some parameters to model the distribution of data points. –The dimensionality of the parameter’s space is necessarily proportional to the dimensionality of the dataset. –Unicity of the ideal parameters require that a number of independent data points equal or greater than the number of parameters.

Lecture 4.06 Dimensionality (cont.) The choice of how to convert your data to vectors will have great impact on your chance of success: –Too many dimensions: will require too many parameters and thus more data points that you may have. The algorithm may over-train. –Too few dimensions: you may loose some information that can be essential to identify a feature (pattern, class, etc.). The case of microarray data, data mining on: –samples, –genes, –significantly over/under expressed genes... Curse of dimensionality... (Bellman, 1961)

Lecture 4.07 Dimensionality reduction Different methods: –Keeping just the relevant dimensions: guesswork. Keeping just the relevant genes! –Principal component analysis (PCA): Rotation of the dataset that maximizes the dispersion across the first axes. –Singular value decomposition (SVD): Press et al., p. 59. –Neural networks: map input vector onto itself, hidden units correspond to the reduced vector.

Lecture 4.08 Principal Component Analysis Subtract the mean (centers the data). Compute the covariance matrix, . Compute the eigenvectors of , sort them according to their eigenvalues and keep the M first vectors. Project the data points on those vectors. Also called the Karhunen-Loeve transformation.

Lecture 4.09 Neural networks for dimensionality reduction Neural Networks for Pattern Recognition (Bishop, 1995), p. 314.

Lecture Clustering Unsupervised learning algorithm: –No information is assumed to be known for the different samples. Identifies classes within the data provided: –Genes that seem to react the same way in different conditions; –Different conditions that seem to induce the same response. Applications: –Parameters of the discovered classes can be used to classify new data points. –Parameters by themselves may be viewed as the measurement of a data feature. –Exploratory step.

Lecture Metrics A metric quantifies the differences between to elements of a data set. –Similarity metrics; –Dissimilarity or distance metrics. Example of metrics: –Manhattan: –Euclidean:

Lecture K-mean Input: a set of vector { x i }, a metric D, and the number of classes to identify, n. 1.Each vector receives a random label in [1, n ]. 2.Compute the average of each class:  j 3.For each vector x i update its label to the closest class center according to D (x i,  j ). 4.If at least one vector changed its label, repeat 2-4. Output: each vector is labeled and each class is defined by its center.

Lecture K-mean (cont.) Should not be sensitive to random initialization. –The algorithm can be repeated with different initialization to provide a distribution of solution. (beware of “cluster swapping”!) Perform a linear segmentation of the data space. –Similar to a Voronoï diagram: –Clusters should be linearly separable.

Lecture Hierarchical clustering Input: a set of vector { x i }, a metric D, and a group metric F. 1.Create a set of groups G = { g i } such that each g i = ( x i ). 2.Remove the two closest groups g i and g j from G and insert ( D ( g i, g j ), g i, g j ). 3.If | G | > 1, repeat 2-3. Output: a tree structure. F ( a, b ) typically returns the max( D ( x i, x j ) ) where x i in a and x j in b.

Lecture Hierarchical clustering (cont.) The resulting may not be unique. The formation of group is very sensitive to noise in the data. The separation of two data points in two cluster does not mean they are necessarily distant! There is a “hidden” metric used to compute the distance between two groups: –minimum; –maximum; –average.

Lecture Self organizing maps (SOM) Timo Honkela (Description of Kohonen's Self-Organizing Map)

Lecture Self organizing maps (SOM) (cont.) Input: a set of vector { xi }, and an array of random model vectors m i (0). 1.Randomly initialize each model vector. 2.For each xi: a)Identify the model, m i (t), that is closest according to the metric. b) c) 3. Repeat 2. until convergence of E. T. Kohonen, Self-Organizing Maps, Springer, 1995.

Lecture Deriving significance Traditionally, a 2-fold difference in spot intensities between two treatments was assumed to be “significant”. What is significant? –Did not happen at random; –Is not an artifact. All methods involve modeling a random process and using this model to quantify the likelihood that an observed value was generated by that random process. –Corresponds to the classic hypothesis test in statistics.

Lecture Deriving significance (cont.) Modeling the random process is often done by assuming normality: –Microarray intensities are not normally distributed. –Ratio are never normally distributed. –log-ratio are close enough to a normal distribution. Outliers are not produced by the random process and they will affect the parameterization of the selected distribution. –They should be removed or accounted for in some way. –Hard cutoffs may work, but they introduce biases... Information about this random process requires a significant amount of data! –Can be done on a gene or sample basis.

Lecture v Deriving significance (cont.) An example: –using the log-ratio for one specific gene across 800 experiments. –EM algorithm (a variant of the k-mean) is used to smoothly remove the outliers. –The resulting normal distribution is an approximation of the expected variation on that gene.

Lecture Other approaches... EM algorithm: –Used to approximate an unknown distribution; –Can be used to return a probabilistic classifier. Neural networks HMM: –Could be used for time-series applications. Support Vector Machines: –Powerful supervised learning technique; –Known to work well in high-dimension space; –Theory is “scary”! And thus is often used as a black box. Statistics! –Avoid a dogmatic approach...

Lecture References C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, R. Schalkoff, Pattern recognition: statistical, structural and neural approaches, John Wiley & Sons, N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, W. H. Press et al., Numerical Recipes in C - The Art of Scientific Computing, Cambridge University Press, 1992.