Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Machine Learning and Data Mining Clustering
Classification and Decision Boundaries
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data mining and machine learning A brief introduction.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Presented by Tienwei Tsai July, 2005
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Prepared by: Mahmoud Rafeek Al-Farra
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Hierarchical Clustering
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Molecular Classification of Cancer
Nearest-Neighbor Classifiers
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
COSC 4335: Other Classification Techniques
Text Categorization Berlin Chen 2003 Reference:
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Observations and Challenges in the Information Age A huge volume of information has been and is being digitized and stored in the computer. Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.

An Example of Data Mining Given the data set shown on next slide, can we figure out a set of rules that predict the classes of objects?

Data Set DataClassDataClassDataClass ( 15,33 ) O ( 18,28 ) × ( 16,31 ) O ( 9,23 ) × ( 15,35 ) O ( 9,32 ) × ( 8,15 ) × ( 17,34 ) O ( 11,38 ) × ( 11,31 ) O ( 18,39 ) × ( 13,34 ) O ( 13,37 ) × ( 14,32 ) O ( 19,36 ) × ( 18,32 ) O ( 25,18 ) × ( 10,34 ) × ( 16,38 ) × ( 23,33 ) × ( 15,30 ) O ( 12,33 ) O ( 21,28 ) × ( 13,22 ) ×

Distribution of the Data Set 。 。 。 。 。 。 。 。 。。 × × × × × × × × × × × × × ×

Rule Based on Observation

Rule Generated by a RBF(Radial Basis Function) Network Based Learning Algorithm Let and If then prediction=“O”. Otherwise prediction=“X”.

(15,33)(11,31)(18,32)(12,33)(15,35)(17,34)(14,32)(16,31)(13,34)(15,30) (9,23)(8,15)(13,37)(16,38)(18,28)(18,39)(25,18)(23,33)(21,28)(9,32)(11,38)(19,36)(10,34)(13,22)

Identifying Boundary of Different Classes of Objects

Boundary Identified

Data Mining / Knowledge Discovery The main theme of data mining is to discover unknown and implicit knowledge in a large dataset. There are three main categories of data mining algorithms: Classification; Clustering; Mining association rule/correlation analysis.

Data Classification In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes. The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.

Instance-Based Learning In instance-based learning, we take k nearest training samples of a new instance (v 1, v 2, …, v m ) and assign the new instance to the class that has most instances in the k nearest training samples. Classifiers that adopt instance-based learning are commonly called the KNN classifiers.

Example of the KNN If an 1NN classifier is employed, then the prediction of “  ” = “X”. If an 3NN classifier is employed, then prediction of “  ” = “O”.

Applications of Data Classification in Bioinformatics In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.

For example, in the Leukemia data set, there are 72 samples and 7129 genes. 25 Acute Myeloid Leukemia(AML) samples. 38 B-cell Acute Lymphoblastic Leukemia samples. 9 T-cell Acute Lymphoblastic Leukemia samples.

Model of Microarray Data Sets Gene 1 Gene 2 ‧‧‧‧‧‧ Gene n Sample 1 Sample 2 Sample m

Alternative Data Classification Algorithms Decision tree (Q4.5 and Q5.0); Instance-based learning(KNN); Naïve Bayesian classifier; Support vector machine(SVM); Novel approaches including the RBF network based classifier that we have recently proposed.

Accuracy of Different Classification Algorithms Data set classification algorithms RBFSVM1NN3NN Satimage (4335,2000) Letter (15000,5000) Shuttle (43500,14500) Average

Comparison of Execution Time(in seconds) RBF without data reduction RBF with data reduction SVM Cross validation Satimage Letter Shuttle Make classifier Satimage Letter Shuttle Test Satimage Letter Shuttle

More Insights SatimageLetterShuttle # of training samples in the original data set # of training samples after data reduction is applied % of training samples remaining 40.92%51.96%1.44% Classification accuracy after data reduction is applied # of support vectors in identified by LIBSVM

Data Clustering Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Data clustering is also called unsupervised learning.

The Agglomerative Hierarchical Clustering Algorithms The agglomerative hierarchical clustering algorithms operate by maintaining a sorted list of inter-cluster distances. Initially, each data instance forms a cluster. The clustering algorithm repetitively merges the two clusters with the minimum inter-cluster distance.

Upon merging two clusters, the clustering algorithm computes the distances between the newly-formed cluster and the remaining clusters and maintains the sorted list of inter-cluster distances accordingly. There are a number of ways to define the inter-cluster distance: minimum distance (single-link); maximum distance (complete-link); average distance; mean distance.

An Example of the Agglomerative Hierarchical Clustering Algorithm For the following data set, we will get different clustering results with the single- link and complete-link algorithms

Result of the Single-Link algorithm Result of the Complete-Link algorithm

Remarks The single-link and complete-link are the two most commonly used alternatives. The single-link suffers the so-called chaining effect. On the other hand, the complete-link also fails in some cases.

Example of the Chaining Effect Single-link (10 clusters) Complete-link (2 clusters)

Effect of Bias towards Spherical Clusters Single-link (2 clusters)Complete-link (2 clusters)

K-Means: A Partitional Data Clustering Algorithm The k-means algorithm is probably the most commonly used partitional clustering algorithm. The k-means algorithm begins with selecting k data instances as the means or centers of k clusters.

The k-means algorithm then executes the following loop iteratively until the convergence criterion is met. repeat { assign every data instance to the closest cluster based on the distance between the data instance and the center of the cluster; compute the new centers of the k clusters; } until(the convergence criterion is met);

A commonly-used convergence criterion is

Illustration of the K-Means Algorithm---(I) initial center

Illustration of the K-Means Algorithm---(II) x x x new center after 1 st iteration

Illustration of the K-Means Algorithm---(III) new center after 2 nd iteration

A Case in which the K-Means Algorithm Fails The K-means algorithm may converge to a local optimal state as the following example demonstrates: Initial Selection

Remarks As the examples demonstrate, no clustering algorithm is definitely superior to other clustering algorithms with respect to clustering quality.

Applications of Data Clustering in Microarray Data Analysis Data clustering has been employed in microarray data analysis for identifying the genes with similar expressions; identifying the subtypes of samples.

Feature Selection in Microarray Data Analysis In microarray data analysis, it is highly desirable to identify those genes that are correlated to the classes of samples. For example, in the Leukemia data set, there are 7129 genes. We want to identify those genes that lead to different disease types.

Furthermore, Inclusion of features that are not correlated to the classification decision may result in lower classification accuracy or poor clustering quality. For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “  ”, if a 3NN classifier is employed.

It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x- axis was selected, then the 3NN classifier would predict the class of “  ” correctly. x=10 x y

Univariate Analysis in Feature Selection In the univariate analysis, the importance of each feature is determined by how objects of different classes are distributed in this particular axis. Let and denote the feature values of class-1 and class-2 objects, respectively. Assume that the feature values of both classes of objects follow the normal distribution.

Then, is a t-distribution with degree of freedom = (m+n-2), where If the t statistic of a feature is lower than a threshold, then the feature is deleted.

Multivariate Analysis The univariate analysis is not able to identify crucial features in the following example.

Therefore, multivariate analysis has been developed. However, most multivariate analysis algorithms that have been proposed suffer high time complexity and may not be applicable in real-world problems.

Summary Data clustering and data classification have been widely used in microarray data analysis. Feature selection is the most challenging issue as of today.