Functional Genomics and Microarray Analysis (2)

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Data Mining in Micro array Analysis
BioInformatics (3).
Classification and Prediction
Basic Gene Expression Data Analysis--Clustering
Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Classification Algorithms
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Introduction to Bioinformatics
Mutual Information Mathematical Biology Seminar
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Induction of Decision Trees
Classification Continued
Classification.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Chapter 5 Data mining : A Closer Look.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
Module 04: Algorithms Topic 07: Instance-Based Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Unsupervised Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
DECISION TREES An internal node represents a test on an attribute.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Chapter 6 Classification and Prediction
Classification and Prediction
Clustering.
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
©Jiawei Han and Micheline Kamber
Data Mining CSCI 307, Spring 2019 Lecture 21
Unsupervised Learning
Presentation transcript:

Functional Genomics and Microarray Analysis (2)

Data Clustering Lecture Overview Introduction: What is Data Clustering Key Terms & Concepts Dimensionality Centroids & Distance Distance & Similarity measures Data Structures Used Hierarchical & non-hierarchical Hierarchical Clustering Algorithm Single/complete/average linkage Dendrograms K-means Clustering Other Related Concepts Self Organising Maps (SOM) Dimensionality Reduction: PCA & MDS

Introduction Analysis of Gene Expression Matrices In a gene expression matrix, rows represent genes and columns represent measurements from different experimental conditions measured on individual arrays. The values at each position in the matrix characterise the expression level (absolute or relative) of a particular gene under a particular experimental condition. Samples Genes Gene expression levels Gene Expression Matrix

Introduction Identifying Similar Patterns The goal of microarray data analysis is to find relationships and patterns in the data to achieve insights in underlying biology. Clustering algorithms can be applied to the resulting data to find groups of similar genes or groups of similar samples. e.g. Groups of genes with “similar expression profiles (Co-expressed Genes) --- similar rows in the gene expression matrix or Groups of samples (disease cell lines/tissues/toxicants) with “similar effects” on gene expression --- similar columns in the gene expression matrix

Introduction What is Data Clustering Clustering of data is a method by which large sets of data is grouped into clusters (groups) of smaller sets of similar data. Example: There are a total of 10 balls which are of three different colours. We are interested in clustering the balls into three different groups. An intuitive solution is that balls of same colour are clustered (grouped together) by colour Identifying similarity by colour was easy, however we want to extend this to numerical values to be able to deal with gene expression matrices, and also to cases when there are more features (not just colour).

Introduction Clustering Algorithms A clustering algorithm attempts to find natural groups of components (or data) based on some notion similarity over the features describing them. Also, the clustering algorithm finds the centroid of a group of data sets. To determine cluster membership, many algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.

Key Terms and Concepts Dimensionality of gene expression matrix Clustering algorithms work by calculating distances (or alternatively similarity in higher-dimensional spaces), i.e. when the elements are described by many features (e.g. colour, size, smoothness, etc for the balls example) A gene expression matrix of N Genes x M Samples can be viewed as: N genes, each represented in an M-dimensional space. M samples, each represented in N-dimensional space We will show graphical examples mainly in 2-D spaces i.e. when N= 2 or M=2 Samples Genes Gene expression levels Gene Expression Matrix

Key Terms and Concepts Centroid and Distance + gene A gene B centroid In the first example (2 genes & 25 samples) the expression values of 2 Genes are plotted for 25 samples and Centroid shown) In the second (2 genes & 2 samples) example the distance between the expression values of the 2 genes is shown

Key Terms and Concepts Centriod and Distance Cluster centroid : The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. Distance: Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The commonly used distance measure is the Euclidean metric which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :

Key Terms and Concepts Distance/Similarity Measures Euclidean (L2) distance Manhattan (L1) distance Lm: (|x1-x2|m+|y1-y2|m)1/m L∞: max(|x1-x2|,|y1-y2|) Inner product: x1x2+y1y2 Correlation coefficient Spearman rank correlation coefficient For simplicity we will concentrate on Euclidean and Manhattan distances in this course (x1, y1) (x2,y2)

Key Terms and Concepts Distance Measures: Minkowski Metric

Key Terms Commonly Used Minkowski Metrics

Key Terms and Concepts Distance/Similarity Matrices Gene Expression Matrix N Genes x M Samples Clustering is based on distances, this leads to a new useful data structure: Similarity/Dissimilarity matrix Represents the distance between either N Genes (NxN) or M Samples (MxM) Only need half the matrix, since it is symmetrical

Key Terms Hierarchical vs. Non-hierarchical Hierarchical clustering is the most commonly used methods for identifying groups of closely related genes or tissues. Hierarchical clustering is a method that successively links genes or samples with similar profiles to form a tree structure – much like phylognentic tree. K-means clustering is a method for non-hierarchical (flat) clustering that requires the analyst to supply the number of clusters in advance and then allocates genes and samples to clusters appropriately.

Hierarchical Clustering Algorithm Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Hierarchical Cluster Analysis Scan matrix for minimum Join genes to 1 node 1 Nun kommen wir zum eigentlichen Clustering. Als ersten Schritt suchen wir das Maximum in unserer Korrelationsmatrix. Die beiden Gene, zwischen welchen wir das Maximum gefunden haben, werden verknüpft. Die Beobachtungen der beiden verknüpften Gene werden zu einer Beobachtung gemittelt. Nun muss die Korrelationsmatrix up to date gebracht werden. Mit der neu erstellten Korrelationsmatrix wird nun wieder von vorne begonnen, mit der Suche des Maximums. Dieser Zyklus wird solange durchlaufen bis alle Gene zu einem Baum zusammengefügt sind. 2 3 Update matrix

Hierarchical Clustering Distance Between Two Clusters Single-Link Method / Nearest Neighbor Complete-Link / Furthest Neighbor Their Centroids. Average of all cross-cluster pairs. Whereas it is straightforward to calculate distance between two points, we do have various options when calculating distance between clusters. Min distance Average Max

Key Terms Linkage Methods for hierarchical clustering Single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. Complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. Average-link clustering we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Single-Link Method Euclidean Distance a a,b b a,b,c a,b,c,d c d c d d (1) (2) (3) Distance Matrix

Complete-Link Method Euclidean Distance a a,b a,b b a,b,c,d c,d c d c (1) (2) (3) Distance Matrix

Key Terms and Concepts Dendrograms and Linkage The resulting tree structure is usally referred to as a dendrogram. In a dendrogram the length of each tree branch represents the distance between clusters it joins. Different dendrograms may arise when different Linkage methods are used 2 4 6 Single-Link Complete-Link

Two Way Hierarchical Clustering Note we can do two way clustering by performing clustering on both the rows and the columns It is common to visualise the data as shown using a heatmap. Don’t confuse the heatmap with the colours of a microarray image. They are different ! Why?

K-Means Clustering Basic Ideas : using cluster centroids (means) to represent cluster Assigning data elements to the closet cluster (centroid). Goal: Minimise square error (intra-class dissimilarity)

K-means Clustering Algorithm 1) Select an initial partition of k clusters 2) Assign each object to the cluster with the closest centroid 3) Compute the new centeroid of the clusters: 4) Repeat step 2 and 3 until no object changes cluster

The K-Means Clustering Method Example

Summary Clustering algorithms used to find similarity relationships between genes, diseases, tissue or samples Different similarity metrics can be used – mainly Euclidean and Manhattan) Hierarchical clustering Similarity matrix Algorithm Linkage methods K-means clustering algorithm

Data Classification Lecture Overview Introduction: Diagnostic and Prognostic Tools Data Classification Classification vs. Classification Examples of Simple Classification Algorithms Centroid-based K-NN Decision Trees Basic Concept Algorithm Entropy and Information Gain Extracting rules from trees Bayesian Classifiers Evaluating Classifiers

Introduction Predictive Modelling Diagnostic Tools: One of the most exciting areas of Microarray research is the use of Microarrays to find groups of gene that can be used diagnostically to determine the disease that an individual is suffering. Tissue Classification Tools: a simple example is given measurements from one tissue type is to be able to ascertain whether the tissue has markers of cancer or not, and if so which type of cancer. Prognostic Tools: Another exciting area is given measurements from an individual’s sample is to prognostically predict the success of a course of a particular therapy In both cases we can train a classification algorithm on previously collected data so as to obtain a predictive modelling tool. The aim of the algorithm is to find a small set of features and their values (e.g. set of genes and their expression values) that can be used in future predictions (or classification) on unseen samples

Classification: Obtaining a labeled training data set Goal: Identify subset of genes that distinguish between treatments, tissues, etc. Method Collect several samples grouped by type (e.g. Diseased vs. Healthy) or by treatment outcome (e.g. Success vs. Failure). Use genes as “features” Build a classifier to distinguish treatments ID G1 G2 G3 G4 Cancer 1 11.12 1.34 1.97 11.0 No 2 12.34 2.01 1.22 11.1 No 3 13.11 1.34 1.34 2.0 Yes 4 13.34 11.11 1.38 2.23 Yes 5 14.11 13.10 1.06 2.44 Yes 6 11.34 14.21 1.07 1.23 No 7 21.01 12.32 1.97 1.34 Yes 8 66.11 33.3 1.97 1.34 Yes 9 33.11 44.1 1.96 11.23 Yes To Predict categorical class labels construct a model based on the training set, and then use the model in classifying new unseen data

Classification: Generating a predictive model The output of a classifier is a predictive model that can be used to classify unseen based on the values of their gene expressions. The model shown below is a special type of classification models, known a Decision Tree. G1 >22 G3 G4 <=12 >12 No Yes <=52 >52 <=22

Classification Overview Task: determine which of a fixed set of classes an example belongs to Inductive Learning System: Input: training set of examples annotated with class values. Output:induced hypotheses (model/concept description/classifiers) Learning : Induce classifiers from training data Inductive Learning System Training Data: Classifiers (Derived Hypotheses)

Classification Overview Using a Classifier for Prediction Using Hypothesis for Prediction: classifying any example described in the same manner as the data used in training the system (i.e. same set of features) Classifier Decision on class assignment Data to be classified

Classification Examples in all walks of life Outlook Sunny Overcast Rain Humidity Yes Wind High Normal No true false Classification Examples in all walks of life The values of the features in the table can be categorical or numerical. However, we only deal with categorical variables in this course The Class Value has to be Categorical.

Classification vs. Clustering known number of classes based on a training set used to classify future observations unknown number of classes no prior knowledge used to understand (explore) data Classification is a form of supervised learning Clustering a form of unsupervised learning Als dritte Methode werde ich hier etwas über Pattern recognition erzählen. Bei dieser Methode beschäftigt man sich mit Entscheidungsfindungsprozessen. Diese Prozesse will man zuerst verstehen um sie dann mithilfe von Computern zu automatisieren. Die Methode des Pattern recognitions lässt sich in die 2 Klassen supervised und unsupervised unterteilen. In der Kategorie der supervised pattern recognition geht man von einer bekannten Anzahl Klassen aus. Bei der unsupervised PR ist die Anzahl Klassen unbekannt. Supervised PR basiert auf einem sogenannten training set. Dies ist eine Reihe von Beobachtungen, bei denen man die Einteilung in die Klassen bereits kennt. Aufgrund dieser vorher bekannten Zuordnung werden die eigentlichen Beobachtungen mit unbekannter Klasseneinteilung den Klassen zugeordnet. In der Variante der unsupervised PR geht man von keinem a priori Wissen aus. Für die Klassierung von zukünftigen Beobachtungen in vorgegebene Klassen wird supervised PR verwendet. Die Clusteranalyse, wie sie gerade vorgestellt wurde, gilt als eine Form der unsupervised PR. Somit möchte ich nicht weiter auf unsupervised PR eingehen. Im folgenden soll die Form der supervised PR vorgestellt werden.

Typical Classification Algorithms Centroid Classifiers kNN: k Nearest Neigbours Bayesian Classification: Naïve Bayes and Bayesian Networks Decision trees Neural Networks Linear Discriminant Analysis Support Vector Machines …..

Types of Classifiers Linear vs. non linear A linear discriminant in 2-D is a straight line. In N-D it is a hyperplace Types of Classifiers Linear vs. non linear * o G2 G1 a*G1 + b*G2 > t -> o ! Linear Classifier: Non Linear Classifier: G2 Linear Classifiers are easier to develop e.g Linear Discriminant Analysis (LDA) Method, which tries to find a good regression line by minimising the squared errors of the training data Linear Classifiers, however, may produce models that are not perfect on the training data. Non-linear classifiers tend to be more accurate, may over-fit the data By over-fitting the data, they may actually perform worse on unseen data

Types of Classifiers K-Nearest Neighbour Classifiers K-NN works by assigning a data point to the class of its k closest neighbors (e.g. based on Euclidean or Manhattan distance). K-NN returns the most common class label among the k training examples nearest to x. We usually set K > 1 to avoid outliers Variations: Can also use a radius threshold rather than K. We can also set a weight for each neighbour that takes into account how far it is from the query point . _ + x Model Training: None. Classification: Given a data point, Locate K nearest points. Assign the majority class of the K points

Types of Classifiers Decision Trees Outlook Sunny Overcast Rain Humidity Yes Wind High Normal No true false Types of Classifiers Decision Trees Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation At start, all the training examples are at the root Partition examples recursively based on selected attributes Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

Types of Classifiers Decision Tree Construction Outlook Sunny Overcast Rain Humidity Yes Wind High Normal No true false Types of Classifiers Decision Tree Construction General idea: Using the training data, choose the best feature to be used for the logical test at the root of the tree. Partition training data into sub-groups based on the values of the logical test Recursively apply the same procedure (select attribute and split) and terminate when all the data elements in one branch are of the same class. Key to Success is how to choose the best feature at each step The basic approach to select a attribute is to examine each attribute and evaluate its likelihood for improving the overall decision performance of the tree. The most widely used node-splitting evaluation functions work by reducing the degree of randomness or ‘impurity” in the current node.

Decision Tree Construction Algorithm Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

Decision Tree Example In the simple example shown, the expression values which are usually numbers have been made into discrete values. There are more complex methods that can deal with numeric features, but are beyond this course In the example, I have chosen to use 3 discrete ranges for Gene1, two ranges (high/low) for genes 2 and , and expressed (yes/no) for gene 3.

Decision Trees Using Information Gain Select the attribute with the highest information gain Assume there are two classes, P and N Let the set of examples S contain p elements of class P and n elements of class N The amount of information (entropy) :

Information Gain in Decision Tree Construction Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} If Si contains pi examples of P and ni examples of N, the expected information (total entropy) in all subtrees Si generated by the partition via A is The encoding information that would be gained by branching on A

Attribute Selection by Information Gain Computation Class P: diseased = “yes” Class N: diseased = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for G1: Hence Similarly

Extracting Classification Rules from Trees Decision Trees can be simplified by representing the knowledge in the form of IF-THEN rules that are easier for humans to understand One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Example IF G1 = “<=30” AND G3 = “no” THEN diseased = “no” IF G1 = “<=30” AND G3 = “yes” THEN diseased = “yes” IF G1 = “31…40” THEN diseased = “yes” IF G1 = “>40” AND G4 = “high” THEN diseased = “yes” IF G1 = “>40” AND G4 = “low” THEN diseased = “no”

Further Notes We have mainly used examples with two classes in our examples, however most classification algorithms can work on many class values so long as they are discrete. We have also mainly concentrated on examples that work on discrete feature values Note that in many cases, the data may be of very high dimensionality, and this may cause problems for the algorithms, and might need to use dimensionality reduction methods.

Summary Classification algorithms can be used to develop diagnostic and prognostic tools based on collected data by generating predictive models that can label unseen data into existing classes. Simple classification methods: LDA, Centroid-based classifiers and k-NN Decision Trees: Decision Tree Induction works by choosing the best logical test for each tree node one at a time, and recursively splitting the data and applying same procedure Entropy and Information Gain are the key concepts to apply Not all classifiers generate 100% accuracy, confusion matrices can be used to evaluate their accuracy.