Methods for Micro-Array Analysis Data Mining & Machine Learning Approaches.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
CMPUT 466/551 Principal Source: CMU
What is Statistical Modeling
Mutual Information Mathematical Biology Seminar
Data Mining Techniques Outline
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
What is Cluster Analysis
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Image Classification 영상분류
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Principle Component Analysis and its use in MA clustering Lecture 12.
Classification Ensemble Methods 1
Data Mining and Decision Support
Principal Components Analysis ( PCA)
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Estimating standard error using bootstrap
Unsupervised Learning
CSE 4705 Artificial Intelligence
JMP Discovery Summit 2016 Janet Alvarado
PREDICT 422: Practical Machine Learning
Chapter 7. Classification and Prediction
Principal Component Analysis (PCA)
Machine Learning Basics
Data Mining Practical Machine Learning Tools and Techniques
PCA, Clustering and Classification by Agnieszka S. Juncker
Descriptive Statistics vs. Factor Analysis
Dimension reduction : PCA and Clustering
Generally Discriminant Analysis
Somi Jacob and Christian Bach
Parametric Methods Berlin Chen, 2005 References:
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Feature Selection Methods
Multivariate Methods Berlin Chen, 2005 References:
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Unsupervised Learning
Presentation transcript:

Methods for Micro-Array Analysis Data Mining & Machine Learning Approaches

What is Micro-Array Analysis?

Analysis of Micro-Array Data Challenges posed  Typical characteristic of micro array data is the large number of variables relative to the number of observations.  Hidden knowledge in these data has to be discovered  Eg.Gene expression data from 72 leukemia patients (samples) with 7,070 genes (variables)  The study of the variability of gene expression patterns Problems  How to analyze micro-array data with the following requirements met simultaneously ?  Efficiency  Accuracy  Automation

Typical Micro-Array data set Suppose that the identical micro-array experiment is repeted p times (e.g. colon cancer cells from p patients compired with p wild tipes). Then we obtain a data set (mij; i=1,…,G, j=1,…,p), in which mij is the expression ratio in gene I in jth experiment. Usually generate large data sets with expression values for thousands of genes (2000~20000) but not more than a few dozens of samples For example: DatasetNumber of genes SamplesReferences Leukemia ALL versus AML :25 bone marrow samples Golub et al. (1999) Lung cancer (malignant pleural mesothelioma (MPM) versus adenocarcinoma of the lung (ADCA) ) :150 tissue samles Gordon et al (2002) Prostate Cancer (Tumor versus Normal classification) :59 prostate tumor samples and normal samples Singh et al. (2002) mij …p … …1.34 …………………………… G

Main objectives in micro-array data analyses 1. To find the genes that are differently expressed (DF) in the two samples (e.g. the given colon cancer sample / the wild type; cells that submited a given treatment / no treatment ). Although biologists can discover DF genes even with p=1, it has been realized lately that making independent replications is a good practise. Questions that could be asked: - Which genes expression is modified by the condition? (it has been reported that many diseases, especially tumors, have never been caused by a single gene mutation but are the result of a series of gene changes) - Has the treatment changed the expression level of specific (target) genes / gene sequences to noticeably different levels? If so is it important (i.e. is the patient’s condition improved due to this change in expression levels) 2. To find genes that behave similarly in different conditions (i.e. clustering the row vectors) and to find subgroups of samples (or patients’ tissues), that are similar to each other (i.e. clustering the column vectors). - novel discovery of genes in related biological pathway or having related functions - clinically important subgroups of patients 3. Classification - For example: Golub et al. (1999) – 2 types of leukemia / based on gene expression profile of each sample 4. Validation of the models, assessment of robustness/ predicting power of the classifiers (models)

Main objectives in micro-array data analyses

1.Finding differently expressed genes. Parametrical methods: t-test Standard t test H0 - no difference between the treatment and the controlled samples H1 - treatment has an influence. Knowing the probability distribution of the T variable under H0 (Student law of p-1 ddl), the actual T is computed and compared to this distribution. At a smaller p-value it is less likely to see extreme differences by chance.

1.Finding differently expressed genes. t-test Advantages – simple and implemented in all comercial microarray analysis packages Disadvantages – distributional assumptions and the problem of multiple testing (due to the small number of samples, we can not assume normality of the mean of the samples). -> what is the “false descovery rate” ? Alternatives – empirical Bayes and parametric Bayes

1.Finding differently expressed genes. Fold approach If the average expression level of the genes is examined If it changed by a certain number of folds, the gene is declared changed (on or off) Disadvantage: does not reveal the desired correlation between the gene and its function. Does not find related genes.

Data Mining Intermediate representation Observational Data Background Knowledge Pre-Processing And Representation Knowledge discovery Feedback

Micro-Array analysis employs machine learning algorithms and techniques to mine useful data. Unsupervised data analysis  Principal Component Analysis (PCA)  Hierarchical Clustering  Non-Hierarchical Clustering  K – means  Self organizing maps (a type of neural networks) Supervised data analysis  Decision Trees - C5.0 implementation  Artificial Neural Networks – Back-propagation algorithm Two complementary techniques  Cross-validation  Multi-model approaches (boosting, bagging, stacking) Gene Expression grouping and classification. Overview of existing approaches

Principal Component Analysis (PCA) This is a technique for finding major combinations of data (I.e. genes that are regularly up- and down- regulated together) Objectives  Graphically resume a large rectangular table of numbers, R, simplify its comprehension, find pertinent features.  Reduce the dimensionality of the data set, (e.g. co-regulated genes)  Graphically resume: - The correlations between the variables. - Find new meaningful underlying variables (dimensions), resuming the initial variables in this way. MAXIMIZE THE INTRA-CLASS VARIANCE MINIMIZE THE INTER-CLASS VARI12ANCE - The proximities and the principal oppositions between the individuals Simple example:  Imagine a micro-array data set consisting of only 2 experiments (2 samples)  Graphically represent the data.

Principal Component Analysis(PCA) Principal component analysis of a two-dimensional data cloud. The line shown is the direction of the first principal component, which gives an optimal (in the mean-square sense) linear reduction of dimension from 2 to 1 dimensions.

Principal Component Analysis(PCA) Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. Illustration for the case of 2 samples: The variance of the sample x is given by: The variance of the sample y is given by: The covariance of between x and y: Then we can write: This matrix is square and symmetric, admits a characteristic polynomial and is diagonalizable. Also admits a basis of orthogonal eigen vectors

Principal Component Analysis(PCA) Then, it exists a matrix U so that: 1. The 2 eigen vectors – orthogonal. Represent a new system of INDEPENDENT coordinates. The quantities u 11 and u 12 are actually the coordinates of the new axis expressed in a vectorial format. Same for u 21 and u Each coefficient indicates the weight of a particular experiment within this component ! (how much participates this experiment at the generation of this pattern) 3. A translation and a rotation of the coordinate system.

Principal Component Analysis(PCA) The first principal component - as much of the variability in the data as possible, Each succeeding component - as much of the remaining variability as possible. Imagine cloud of data in many dimentions  benefits ! The projection of a point A (x, y) on a axis u (u1, u2) is obtained by performing the scalar product of the coordinates of this point and the vectorial coordinates of the axis: projection= x*u1+y*u2. Now, our the points are the genes. It is intersting to plot the eigen values, which expresses the way that the variability of data is repartised in the new coordinate system. The relative sizes of the major and minor axes in the ellipse.

Principal Component Analysis(PCA) Application to sporulation time-series: observations of differential expression for thousands of genes across multiple conditions  Usually, the first component has all positive coefficients, indicating a weighted average of all experiments  The second principal coefficient has negative values at early time points and positive values for the latter time points, indicating a measure of change in expression mijt1t2t3t4t5t6t7…t … …1.34 …………………………… G

Machine Learning for Micro-Array Analysis: clustering Cluster analysis:  Identification of new subgroups or classes of some biological entity (e.g.,tumors)

Hierarchical Clustering Hierarchical cluster methos differ in: the distance measure selected the manner in which the distances are computed between the growing clusters and the remaining members of the data set  Single Linkage. Disadvantage - loose clusters  Complete Linkage. Disadvantage – to compact clusters of very similar size.  Average Linkage Unweighted pair-group method average (UPGMA) : To groups of the lowest average distance are joined to form a new cluster.

Hierarchical Clustering Euclidian and Manhattan: sensitive to absolute expression levels. Reveal genes that have similar expression levels. A and B – have aproximately the same expression levels Correlation coefficient with centering: sensitive to expression profiles. Reveal genes that have similar expression profiles. D and E – enhanced A and C – repressed Absolute correlation coefficient: A, C, D, E – may be involved in the same biological pathway

K-means Clustering 1. Randomly assign data to the clusters. Suppose there are m genes per cluster. 2. Calculate an average expression vector for each cluster i. This Corresponds to the centroid of the cluster. 3. Calculate a mean interclass distance between each point and the centroid, for each cluster. 4. Move the data from one class to another. Aim of minimizing the averall interclass distance measure. ADVANTAGES: easy to implement. DISADVANTAGES: computationally intensive. outcome determined by such factors as distance metrics chose.

Models that rely heavily on the empirical analysis of large data sets rather than on prior domain knowledge Non-parametric Approaches: Decision trees, Neural networks, Genetic algorithms, and Nearest neighbor methods. Fundamental assumption: Consistently observed relationships or patterns in large data sets will recur in future observations. Advantages:  Does not require a thorough understanding of the underlying system or problem  Can be used to build arbitrarily complex models, that are highly non- linear and not restricted by human comprehension. Non-parametric models

Strengths:  C learly indicates which attributes are most important for prediction or classification. Weaknesses:  Limited ability to handle estimation or regression tasks where the goal is to predict the value of a continuous variable  Error-prone when the number of training examples per class is small Decision Tree

Neural Networks Strengths  Ability to handle a range of problem tasks including classification (discrete outputs) and estimation or regression tasks (continuous outputs)  Provision of an indication (through sensitivity analysis) of which attributes are most important for prediction or classification Weaknesses  The risk of premature convergence to an inferior solution (this is normally addressed by performing a sensible cross- validation procedure)

Problem with the regular models Instability of Prediction Method  Sensitivity of the final model to small changes in the training set. Unstable machine learning methods  Decision trees Stable methods  k-nearest neighbor  Neural models Multi-Model Approaches Now, let us see an approach to address the instability problem….

Machine Learning for Micro-Array Analysis Cross validation  To test the robustness of the classifier Algorithm choice depends on  Attributes  Ratio of the training data [ TP,TN;if TP is small- over-fitting occurs] Combined approaches  Limited amount of training data, the individual classifier may not represent the true hypotheses.  Combined classifier may produce a good approximation to the true hypotheses.

Multi-Model Approaches What they do? Creates and Combines multiple classifiers How are they different from each other? Differ in how the classifiers are trained and in how their outputs are combined. How they improve accuracy? They improve accuracy by focusing the learning process on examples in the data that are harder to model than others. Common methods for constructing multi-model systems Boosting, Bagging, and Stacking

Boosting

Boosting Algorithm Step 1: Form the Learning set and validation set (with uniform and without replacement sampling). Step 2: N different training set replicas are sampled adaptively (with non-uniform sampling probabilities and with replacement) Step 3: Build each classifier, f' i (x), based on the training set. Step 4: Establish each classifier’s performance by testing it against the learning set. Step 5: Calculate a weight for each classifier based on its performance Step 6: Combine model by means of a weighted voting scheme, where each individual prediction model carries a different weight.