1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want.

1 Multi-Class Cancer Classification Noam Lerner

2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want to be able to determine if a certain sample belongs to a certain type of cancer.

3 Our problem Say that we have p genes, and N samples. Normally, p<N, so it’s easy to classify samples. What if N<p?

4 The algorithm scheme Gene screening. Dimension reduction. Classification. We’ll present 3 variations of this algorithm scheme.

5 Before gene screening - Classes Normally, a class of genes is a set of genes that behave similarly under certain conditions. Example: One can divide genes into a class of genes that indicate a certain type of cancer, and to another class of genes, that do not indicate. Taking it one step further:

6 Multi-classes Diving a group of genes into two or more classes is called a “Multi-class”. What is it good for?  Distinguishing between types of cancer.  Example: Leukemia: AML B-ALL T-ALL

7 Gene Screening Generally, gene screening is a method that is used to disregard unimportant genes. Example: gene predictors.

8 The Gene Screening process Suppose we have G classes that represent G types of cancer. (We know which genes belong in each class). We compare every two classes pair-wise and see if the expression is greater than a certain critical score. ( is the mean of the r-th set of the multi set)

9 What is the critical score? MSE – mean squared error. - the size of the r-th multi set. t – arises from student’s t-distribution.

10 Student’s t-distribution t distribution is used to estimate the mean and variance of a normally distributed population when the sample size is small. Fact: The t distribution depends on the size of the population, but not on the mean nor in the variance of the items in the population. The lack of dependence is what makes the t-distribution important in both theory and practice. Anecdote: William S. Gosset published a paper on this subject, under the pseudonym student, and that’s how the distribution got its name.

11 The student t test The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups. It is assumed that the two groups have the same variance.

12 The student t test (cont.) Consider the next three situations:

13 The student t test (cont.) The first thing to notice about the three situations is that the difference between the means is the same in all three. We would want to conclude that the two groups are similar in the high-variability case, and the two groups are distinct in the low-variability case. Conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The student t- test does just this.

14 The student t test (cont.) We say that two classes passed the student t test if the t is greater than a certain parameter. Risk level: usually Degrees of freedom: Look it up in a table.

15 Dimension Reduction It appears that we need more than Gene Screening. Reminder: We have p genes, N samples, N<p. Most classification methods (the next phase of the algorithm) assume that p<N. The solution: dimension reduction: reducing the gene space dimension from p to K where K << N.

16 Dimension Reduction (cont.) This is done by constructing K gene components and then classifying the cancers based on the constructed K gene components. Multivariate Partial Least Squares (MPLS) is a dimension reduction method. Example:

17 Example Reducing dimension from 35 to 3 (5 classes).

18 Example (cont.) This is the NCI60 data set that contains 5 various types of cancer.

19 MPLS Suppose we have G classes. Suppose y indicates the cancer classes 1,…,G. We define a row for every sample: Fix a K (our desired reduction dimension)

20 MPLS (cont.) Suppose X is the gene expression values matrix. Suppose t 1,…,t K are linear combination of X. Then, the MPLS finds (easily) two unit vectors: w and c such that the following expression is maximized: Then, the MPLS extracts t 1,…,t K, and we are done.

21 Why maximizing the covariance? If cov(x,y)>0 then y increases as x increases. If cov(x,y)<0 then y decreases as x increases. By maximizing the covariance, we get that Yc increases as Xw increases. That way, we get a good estimation of Yc by Xw, and we have found our MPLS components: t 1,…,t K.

22 Classification After we have reduced the dimension of the gene space, we need to actually classify the sample(s). It’s important to pick a classification method that will work properly after dimension reduction. We’ll present two different methods: PD and QDA.

23 PD (Polychotomous Discrimination) Recall the indicator y that indicates the cancer classes 1,…,G. Set a vector. Then, the distribution of y depends on x. (We think of y as a random variable). We also suppose that

24 PD (cont.) We define After a few mathematical transitions we get that This is the probability that a sample with gene expression profile x is of cancer class r.

25 PD (cont.) By looking at the previous formula through a certain mathematical model, we can maximize a parameter, that holds all the data. The parameter can be maximized only if there are more samples (N) than parameters (p), and by using dimension reduction, we got just that.

26 PD (cont.) So, instead of looking at we’ll look at the corresponding gene component profile,. Now, let’s look at the new probabilities, that rely on the new:. Finally, we’ll say that (and therefore ) belong to the r-th cancer class if A more detailed explanation on PD is given on the presentation’s appendix.

27 QDA (Quadratic Discriminant Analysis) Recall the indicator y that indicates the cancer classes 1,…,G. Consider the following multivariate normal model (for each cancer class):

28 QDA (cont.) Suppose is the classification of the r-th cancer class, then Where   is ‘s pdf function.

29 QDA (cont.) Again, instead of looking at we’ll look at the corresponding gene component profile,, and get the desired classification.

30 Review - the big picture Gene screening – allows us to get rid of genes that won’t tell us anything. Dimension reduction – allows us to reduce the the gene space – and work on the data. Classification – allows us to decide if a sample has a cancer of a certain multi- class.

31 Just before the algorithm We would want a way to assess if we generated a correct classification. In order to do that – we use LOOCV.

32 LOOCV LOOCV stands for Leave Out One Cross Validation. In this process, we remove one data point from our data, run our algorithm, and try to estimate the removed data point using our results, as if we didn’t know the original data point. Then, we assess the error. This step is repeated for every data point, and finally, we accumulate the errors in some sort for a final error estimation.

33 The 1 st algorithm variation 1. Gene screening: select a set S of m genes, giving an expression matrix X of size N x m. 2. Dimension reduction: Use MPLS to reduce X to T where T is of size N x K. 3. Classification: For i=1 to N do 1. Leave out sample (row) i of T, 2. Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.

34 The 2 nd algorithm variation 1. Gene screening: select a set S of m genes, giving an expression matrix X of size N x m. 2. For i=1 to N do 1. Leave out sample (row) i of the expression matrix X creating X -i 2. Dimension reduction: Use MPLS to reduce X -i to T -i where T -i is of size N x K. 3. Classification: Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.

35 Class question Q: What is the difference between the 1 st and 2 nd variations? A1: In the 1 st variation, steps 1 and 2 are fixed with respect to LOOCV. Therefore, the effect of gene screening and dimension reduction on the classification cannot be assessed. A2: In the 2 nd variation, we can assess the effect of the dimension reduction.

36 More on the 1 st variation Results show that the 1 st variation does not yield good results. (The classification error rates were more optimistic than the expected error rates. Taking it to the next level:

37 The 3 rd algorithm variation 1. For i=1 to N do 1. Leave out sample (row) i of the original expression matrix X 0. 2. Gene screening: select a set S -i of m genes, giving an expression matrix X -i of size N-1 x m. 3. Dimension reduction: Use MPLS to reduce X -i to T -i where T -i is of size N-1 x K. 4. Classification: Fit the classifier to the remaining N-1 samples and use the fitted classifier to predict the left out sample i.

38 Class question Q: What is the difference between the 2 nd and 3 rd variations? A: The gene screening stage is fixed with respect to LOOCV in the 2 nd variation, and isn’t in the 3 rd variation. That allows us to assess the error in the gene screening stage in the 3 rd variation.

39 About the 3 variations The 3 rd variation is the only one that allows us to check the correctness of out model. Why? Because this is the only variation where we use LOOCV to delete a sample from our input matrix, and then try to estimate it. In the other two variations – we estimate a sample after we used it in our process.

40 Results Acute Leukemia Data  Number of samples: N = 72.  Number of genes: p = 3490.  The multi-class: AML: 25 samples. B-ALL: 38 samples. T-ALL: 9 samples.  New reduced dimension: K = 3.

41 Results (cont.) Notations: Numbers in brackets: the number of times we demanded that the pairwise absolute mean difference will pass the critical score. Numbers not in brackets: the number of genes that passed. In A2 – the three numbers are the min- mean-max number of genes selected. (The Gene screening process selects differently every time) Data: the error rate. Best result: QDA.

42 Article Criticism The article does present a model that seems to be appropriate to solve the problem. However, results show that there is a certain error rate. (About 1/20). The article was not clear on several subjects. Non the less, it was interesting to read.

43 Questions? Thank you for listening.

44 References The article: Multi-class cancer Classification via partial least squares with gene expression profiles by Danh V. Nguyen and David M. Rocke. Student’s t distribution – http://en.wikipedia.org/wiki/T_distribution http://en.wikipedia.org/wiki/T_distribution Student t test: http://www.socialresearchmethods.net/kb/stat_t.ht m http://www.socialresearchmethods.net/kb/stat_t.ht m LOOCV: http://www- 2.cs.cmu.edu/~schneide/tut5/node42.htmlhttp://www- 2.cs.cmu.edu/~schneide/tut5/node42.html

45 Appendix - Polychotomous Discrimination – explicit explanation. Why we define? To avoid calculating Explanation: Remember that So: So we don’t have to calculate

46 PD We assume we can write Remembering that we can get to This is our polychotomous regression model. Next, we assign beta to that formula (Replacing with ).

47 PD Next, we define This holds our whole model. Now we want to maximize beta using MLE – Maximum Likelihood Estimation. We’ll describe how to do that.

48 PD Defining a notation: Now, re-writing the formula from the two slides back: So, by taking log, we get: Next, define a row of indicators for a sample Where and Where states if the sample belongs to a type cancer

49 PD Now, Define a matrix: Notice that Meaning that in every row of, the sample was classified to exactly one cancer class. Using these notations, we conclude that the likelihood for N independent samples is:

50 PD Taking log, we get the log-likelihood (which is easier to compute).

51 PD Next, remembering that we get that Now, this expression can be maximized to achieve the MLE using the Newton-Raphson method. One of the cases that the MLE exists is if there exists a vector such that where index set identifying all samples in class r.

52 Appendix References Article appendices: http://dnguyen.ucdavis.edu/.html/SUP_cla2/Sup plementalAppendix.pdf http://dnguyen.ucdavis.edu/.html/SUP_cla2/Sup plementalAppendix.pdf Newton Raphson method - http://en.wikipedia.org/wiki/Newton- Raphson_method http://en.wikipedia.org/wiki/Newton- Raphson_method On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. (A. Albert; J. A. Anderson 1984). http://www.qiji.cn/eprint/abs/2376.html http://www.qiji.cn/eprint/abs/2376.html

53 Abstract This presentation deals with multi-class cancer classification – The process of classifying samples into multiple types of cancer. The article describes a 3-phase algorithm scheme to demonstrate the classification. The 3 phases are Gene Selection, Dimension reduction and Classification. We present one example of gene selection method, one example of a dimension reduction method (MPLS), and two classification methods (PD and QDA), which we then compare between. The presentation also presents concepts like class, multi-class, t-test, and LOOCV.

1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want.

Similar presentations

Presentation on theme: "1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want.

Similar presentations

Presentation on theme: "1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want."— Presentation transcript:

Similar presentations

About project

Feedback