Presentation is loading. Please wait.

Presentation is loading. Please wait.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

Similar presentations


Presentation on theme: "CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:"— Presentation transcript:

1 CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: phacyz@nus.edu.sg http://bidd.nus.edu.sg Room 07-24, level 8, S16, National University of Singapore phacyz@nus.edu.sg http://bidd.nus.edu.sgphacyz@nus.edu.sg http://bidd.nus.edu.sg

2 2 Machine Learning Method Inductive learning: Example-based learning Descriptor Positive examples Negative examples

3 3 Machine Learning Method A=(1, 1, 1) B=(0, 1, 1) C=(1, 1, 1) D=(0, 1, 1) E=(0, 0, 0) F=(1, 0, 1) Feature vectors: Descriptor Feature vector Positive examples Negative examples

4 4 Machine Learning Method Feature vectors in input space: A=(1, 1, 1) B=(0, 1, 1) C=(1, 1, 1) D=(0, 1, 1) E=(0, 0, 0) F=(1, 0, 1) Z Input space X Y B A E F Feature vector

5 5 Vector A= (a1, a2, a3, …, aN) Task of machine learning transformed into the job for finding of a border-Line for optimal separation of the known positive and negative samples in a training-set Positive Negative Machine Learning Method

6 6 Patient_X= (gene_1, gene_2, gene_3, …, gene_N) N (number of dimensions) is normally larger than 2, so we can’t visualize the data. Cancerous Healthy Classifying Cancer Patients vs. Healthy Patients from Microarray

7 7 Cancerous Healthy Gene_1 expression level For simplicity, pretend that we are only looking at expression levels of 2 genes. -5-5 0 5 -5-5 0 5 Gene_2 expression level Up-regulated Down-regulated

8 8 Classifying Cancer Patients vs. Healthy Patients from Microarray Cancerous Healthy Gene_1 expression level Question: How can we build a classifier for this data? -5-5 0 5 -5-5 0 5 Gene_2 expression level

9 9 Classifying Cancer Patients vs. Healthy Patients from Microarray Cancerous Healthy Gene_1 expression level Simple Classification Rule: IF gene_1 <0 AND gene_2 <0 THEN person=healthy IF gene_1 >0 AND gene_2 >0 THEN person=cancerous -5-5 0 5 -5-5 0 5 Gene_2 expression level

10 10 Classifying Cancer Patients vs. Healthy Patients from Microarray Simple Classification Rule: IF gene_1 <0 AND gene_2 <0 AND … gene 5000 < Y THEN person=healthy IF gene_1 >0 AND gene_2 >0 … gene 5000 >W THEN person=cancerous If we move away from our simple example with 2 genes to a realistic case with say 5000 genes, then 1.What will these rules look like? 2.How will we find them? Gets a little complicated, unwieldy…

11 11 Classifying Cancer Patients vs. Healthy Patients from Microarray Cancerous Healthy Gene_1 expression level -5-5 0 5 -5-5 0 5 Gene_2 expression level Reformulate the previous rule SIMPLE RULE: If data point lies to the ‘left’ of the line, then ‘healthy’. If data point lies to ‘right’ of line then ‘cancerous’ It is easier to generalize this line to 5000 genes than it is a list of rules. Also easier to solve mathematically.

12 12 Extension to More Than 2 Genes (dimensions) Cancerous Healthy -5-5 0 5 -5-5 0 5 Line in 2D: x 1 C 1 + x 2 C 2 = T If we had 3 genes, and needed to build a ‘line’ in 3-dimensional space, then we would be seeking a plane. Plane in 3D: x 1 C 1 + x 2 C 2 + x 3 C 3 = T If we were looking in more than 3 dimensions, the ‘plane’ is called a hyperplane. A hyperplane is simply a generalization of a plane to dimensions higher than 3. Hyperplane in N-dimensions: x 1 C 1 + x 2 C 2 + x 3 C 3 + … + x N C N = T

13 13 Classification Methods (1)

14 14 Classification Methods (1)

15 15 Classification Methods (2)

16 16 Classification Methods (2)

17 17 Classification Methods (2)

18 18 Classification Methods (2)

19 19 Classification Methods (3)

20 20 Classification Methods (3) K Nearest Neighbor Method

21 21 Classification Methods (4)

22 22 Classification Methods (4)

23 23 Classification Methods (4)

24 24 Classification Methods (5) SVM What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. Advantages of SVM: Diversity of class members (no racial discrimination). Low over-fitting risk Easier to find “optimal” parameters for better class differentiation performance

25 25 Classification Methods (5) SVM Method Border New border Project to a higher dimensional space Protein family members Nonmembers Protein family members Nonmembers

26 26 Classification Methods (5) SVM method Support vector New border Protein family members Nonmembers

27 27 What is a good Decision Boundary? Consider a two-class, linearly separable classification problem Many decision boundaries! –The Perceptron algorithm can be used to find such a boundary –Different algorithms have been proposed Are all decision boundaries equally good? Class 1 Class 2

28 28 Examples of Bad Decision Boundaries Class 1 Class 2 Class 1 Class 2

29 29 Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible –We should maximize the margin, m –Distance between the origin and the line w t x=k is k/||w|| Class 1 Class 2 m

30 30 SVM Method Protein family members Nonmembers New border Support vector

31 31 SVM Method Border line is nonlinear

32 32 SVM method Non-linear transformation: use of kernel function

33 33 SVM method Non-linear transformation

34 34 Mathematical Algorithm of SVM

35 35 Mathematical Algorithm of SVM

36 36 Empirical errorComplexity tradeoff Mathematical Algorithm of SVM

37 37 Map data to higher dimensional space, feature space Construct linear classifier in this space Which can be written as Mathematical Algorithm of SVM Nonlinear decision boundaries

38 38 Mathematical Algorithm of SVM

39 39 SVM Performance Measure

40 40 SVM Performance Measure

41 41 SVM Performance Measure

42 42 SVM Performance Measure Sensitivity P+ =TP/(TP+FN) accuracy for positive samples Specificity P- =TN/(TN+FP) accuracy for negative samples Overall prediction accuracy Matthews correlation coefficient

43 43 Why SVM Works? The feature space is often very high dimensional. Why don’t we have the curse of dimensionality? A classifier in a high-dimensional space has many parameters and is hard to estimate Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier Typically, a classifier with many parameters is very flexible, but there are also exceptions –Let x i =10 i where i ranges from 1 to n. The classifier can classify all x i correctly for all possible combination of class labels on x i –This 1-parameter classifier is very flexible

44 44 Why SVM works? Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the flexibility (capacity) of a classifier –This is formalized by the “VC-dimension” of a classifier Consider a linear classifier in two-dimensional space If we have three training data points, no matter how those points are labeled, we can classify them perfectly

45 45 VC-dimension However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect We can see that 3 is the critical number The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible

46 46 VC-dimension The VC-dimension of the nearest neighbor classifier is infinity, because no matter how many points you have, you get perfect classification on training data The higher the VC-dimension, the more flexible a classifier is VC-dimension, however, is a theoretical concept; the VC-dimension of most classifiers, in practice, is difficult to be computed exactly –Qualitatively, if we think a classifier is flexible, it probably has a high VC-dimension


Download ppt "CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:"

Similar presentations


Ads by Google