Computational Biology Lecture #10: Analyzing Gene Expression Data

1 Computational Biology Lecture #10: Analyzing Gene Expression Data
Bud Mishra Professor of Computer Science and Mathematics 11 ¦ 26 ¦ 2001

2 The Computational Tasks
Clustering Genes: Which genes are regulated together? Classifying Genes: Which functional class does a particular gene fall into? Classifying Gene Expressions: What can be learnt about a cell from the set of all mRNA expressed in a cell? Classifying diseases: Does a patient have ALL or AML (classes of Leukemia)? Inferring Regulatory Networks: What is the “circuitry” of the cell? 11/20/2018 ©Bud Mishra, 2001

3 Support Vector Machine
Classification Microarray Expression Data Brown, Grundy, Lin, Cristianini, Sugnet, Ares & Haussler ’99 Analysis of S. cerevisiae data from Pat Brown’s Lab (Stanford) Instead of clustering genes to see what groupings emerge Devise models to match genes to predefined classes 11/20/2018 ©Bud Mishra, 2001

79 measurements for each of 2467 genes
Data collected at various times during Diauxic shift (shutting down genes for metabolizing sugar, activating genes for metabolizing ethanol) Mitotic cell division cycle Sporulation Temperature shock Reducing Shock 11/20/2018 ©Bud Mishra, 2001

5 Genome-wide Cluster Analysis
Each measurement Gi represents log (redi/greeni) Where red is the test expression level and green is the reference expression level for gene G in the ith experiment. The expression profile of a gene is the vector of measurements across all experiments: h G1, …, Gm i. 11/20/2018 ©Bud Mishra, 2001

6 The Data G1,1 L G1,n G2,1 L G2,n M O M Gm,1 L Gm,n ©Bud Mishra, 2001
n genes measured in m experiments: G1,1 L G1,n G2,1 L G2,n M O M Gm,1 L Gm,n Vector for a gene Class1 Classn 11/20/2018 ©Bud Mishra, 2001

7 The Classes ©Bud Mishra, 2001
From the MIPS yeast genome database (MYGD) Tricarboxylic acid pathway (Krebs cycle) Respiration chain complexes Cytoplasmic riosomal potins Proteasome Histones Helix-turn-helix (control) Classes come from biochemical/genetic studies of genes 11/20/2018 ©Bud Mishra, 2001

8 Gene Classification ©Bud Mishra, 2001 Learning Task
Given: Expression profiles of genes and their class tables Do: Learn models distinguishing genes of each class from genes in other classes Classification Task Given: Expression profile of a gene whose class is not unknown Do: Predict the class to which this gene belongs 11/20/2018 ©Bud Mishra, 2001

9 The Approach ©Bud Mishra, 2001
Brown et al. applies a variety of algorithms to this task: Support vector machines (SVMs) [Vapnik ’95] Decision Trees Parzen Windows Fisher linear discriminant 11/20/2018 ©Bud Mishra, 2001

10 Support Vector Machines
Consider the genes in our example as m points in an n-dimensional space (m genes, n experiments) Experiment 1 Experiment 2 11/20/2018 ©Bud Mishra, 2001

11 Support Vector Machines
Leaning in SVMs involves finding a hyperplane (decision surface) that separates the examples of one class from another. Experiment 1 Experiment 2 11/20/2018 ©Bud Mishra, 2001

12 Support Vector Machines
For the ith example, let xi be the vector of expression measurements, and yi be +1, if the example is in the class of interest; and –1, otherwise The hyperplane is given by: w ¢ x + b = 0 where b = constant and w= vector of weights 11/20/2018 ©Bud Mishra, 2001

13 Support Vector Machines
The function used to classify examples is then yP = sgn(w ¢ x + b) where yP = predicted value of y. 11/20/2018 ©Bud Mishra, 2001

14 Support Vector Machines
There may be many such hyperplanes.. Which one should we choose? Experiment 1 Experiment 2 11/20/2018 ©Bud Mishra, 2001

15 Maximizing the Margin ©Bud Mishra, 2001 Key SVM idea
Pick the hyperplane that maximizes the margin—the distance to the hyperplane from the closest point Motivation: Obtain tightest possible bounds on the error rate of the classifier. Experiment 2 Experiment 1 11/20/2018 ©Bud Mishra, 2001

16 SVM: Finding the Hyperplane
Can be formulated as an optimization task Minimize åi=1n wi2 Subject to 8 i: yi[w ¢ x + b] ¸ 1 11/20/2018 ©Bud Mishra, 2001

17 Learning Algorithm for Separable Problems
Vapnik & Lerner ’63; Vapnik & Chervonenkis ’64 Class of hyperplanes: w ¢ x + b =0, w 2 Rn, b 2 R Decision Function: f(x) = sgn ( w ¢ x + b) Construct f from empirical data (“Generalized Portrait”) Among all hyperplanes separating the data, there exists a unique one yielding maximum margin of sepration between classes maxw,b min { |x –xi|2 : x 2 Rn, w¢ x +b = 0, i=1,..n} 11/20/2018 ©Bud Mishra, 2001

x1
{x | w¢ x + b = +1}
w
w ¢ (x1 – x2) = 2
) (x1 - x2) 1w = 2 /|w|2
w ¢ (x1 – x2) = 2 ) (x1 - x2) 1w = 2 /|w|2 11/20/2018 ©Bud Mishra, 2001

19 Construction of Optimal Hyperplane
Margin = (x1 – x2) ¢ 1w = 2/|w|2 Optimization Problem: Maximize margin (=2/|w|2) with a hyperplane separating the classes: Minimize åi=1m wi2 Subject to 8i 2 {1,..n} yi (w ¢ xi + b) ¸ 1. 11/20/2018 ©Bud Mishra, 2001

20 Optimization Problem ©Bud Mishra, 2001 Using Lagrange Multipliers
ai ¸ 0, i 2 {1,.., m} L(w,b,a) = (1/2) wT w - åi=1m ai [ yi (xi ¢ w + b) –1] Minimize the Lagrangian L with respect to the primal variables w and b Maximize the Lagrangian L with respect to the dual variables ai. Saddle Point… 11/20/2018 ©Bud Mishra, 2001

21 Intuition ©Bud Mishra, 2001 If a saddle point is violated, then
yi (w ¢ xi + b) – 1 < 0 L can be increased by increasing the corresponding ai w and b have to change such that L decreases To prevent “ai [yi (w ¢ xi + b) – 1]” from becoming arbitrarily large, the change in w and b will ensure that eventually the constraint is satisfied. Assuming that the problem is separable Karush-Kuhn-Tucker Complementarity Condition For all constraints which are not satisfied precisely as equalities, I.e. yi (w ¢ xi + b) – 1 > 0, the corresponding aI ´ 0. 11/20/2018 ©Bud Mishra, 2001

22 Duality At saddle point, the derivatives with respect to the primal variables must vanish: (¶/¶ b) L(w,b,a) = 0. ) åi=1m ai yi = 0 (¶/¶ w) L(w,b,a) = 0. ) w - åi=1m ai yi xi = 0 w = åi=1m ai yi xi By the Karush-Kuhn-Tucker complementarity: ai [ yi (xi ¢ w + b) – 1] = 0, 8 i 2 {1,.., m} Those patterns whose ai ¹ 0 ) “Support Vectors” 11/20/2018 ©Bud Mishra, 2001

23 Lagrangian ©Bud Mishra, 2001 L(w,b,a)
= wT w/2 - åi=1m ai [ yi (xi ¢ w + b) –1] = (1/2) åi,j = 1m ai aj yI yj (xi ¢ xj) Maximize the Dual: W(a) = åi=1m ai – (1/2) åi,j=1m ai aj yi yj (xi ¢ xj) 11/20/2018 ©Bud Mishra, 2001

24 Wolfe Dual Optimization Problem
Maximize W(a) = åi=1m ai– (1/2) åi,j=1m ai aj yi yj (xi ¢ xj) Subject to ai ¸ 0, i = 1, …, m and åi=1m ai yi = 0 11/20/2018 ©Bud Mishra, 2001

25 Decision Function ©Bud Mishra, 2001
The hyperplane decision function can be written as f(x) = sgn( åi=1m yi ai (x ¢ xi) + b] Where b is the solytion to ai [ yi (xi ¢ w + b) – 1] = 0 11/20/2018 ©Bud Mishra, 2001

26 Dealing with Data Not Separable by a Hyperplane
Map the data into some other dot product space (called the “Feature Space”) F via a nonlinear map: F : RN ! F Kernel Function: k(x, y) = F(x) ¢ F(y) Examples: Sigmoid Kernel = k(x,y) = tanh ( k(x ¢ y) + Q) k = gain and Q = Threshold Radial Basis Kernel = k(x,y) = exp{-|x-y|2/2 s2} 11/20/2018 ©Bud Mishra, 2001

27 Dealing with Data Not Separable by a Hyperplane
Find the maximum margin hyperplane in the feature space: yP = sgn[w ¢ F(x) + b] =sgn[ åIi=1m yi ai (F(x) ¢ F(xi) + b] =sgn[ åi=1m yi aI k(x, xi) + b] Optimization Problem: Maximize: W(a) = åi=1m ai – (1/2) åi,j =1m ai aj yi yj k(xi, xj) Subject to: ai ¸ 0, i =1, …, m åi=1m ai yi = 0 11/20/2018 ©Bud Mishra, 2001

28 Dealing with the Noise in Data
One can relax the requirement that the hyperplane strictly separates the data. A soft margin allows some misclassified training examples: Introduce m slack variables xi ¸ 0, 8 i 2 {1,…,m} Minimize the objective function: t(w, x) = (1/2) |w|2 + C åi=1m xi (C > 0) Dual Optimization Problem: Maximize: W(a) = åi=1m ai – (1/2) åi,j =1m ai aj yi yj k(xi, xj) Subject to: 0 5 ai 5 C , i =1, …, m åi=1m ai yi = 0 11/20/2018 ©Bud Mishra, 2001

29 SVM & Neural Networks ©Bud Mishra, 2001 SVM Neural Network
Represents linear or nonlinear separating surface Weights determined by optimization method (optimizing margins) Neural Network Represents linear or nonlinear separating surface Weights determined by optimization method (optimizing sum of squared error—or a related objective function) 11/20/2018 ©Bud Mishra, 2001

30 Experiments ©Bud Mishra, 2001 3-fold cross validation
Create a separate model for each class SVM with various kernel functions Dot product raised to power d= 1,2,3: k(x,y) = (x¢ y)d Gaussian Various Other Classification Methods Decision trees Parzen windows Fisher linear discriminant 11/20/2018 ©Bud Mishra, 2001

31 SVM Results ©Bud Mishra, 2001 Class FP FN TP TN 8 9 2442 6 24 2428 4
Krebs cycle 8 9 2442 Respiration 6 24 2428 Ribosome 4 117 2337 Proteasome 3 7 28 2429 Histone 2 2456 Helix-turn-helix 1 16 2450 11/20/2018 ©Bud Mishra, 2001

32 SVM Results ©Bud Mishra, 2001
SVM had highest accuracy for all classes (except the control) Many of the false positives could be easily explained in terms of the underlying biology: E.g. YAL003W was repeatedly assigned to the ribosome class Not a ribosomal protein But known to be required for proper functioning of the ribosome. 11/20/2018 ©Bud Mishra, 2001

