Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
We can frame functional annotation as a classification task Is gene X a penicillin amidase? Many possible types of labels: Biological process Molecular function Subcellular localization Many possible inputs: Gene or protein sequence Expression profile Protein-protein interactions Genetic associations Classifier Yes
Outline Bayesian networks Support vector machines Network diffusion / message passing
Annotation transfer Protein of known function Protein of unknown function Rule: If two proteins are linked with high confidence, and one protein’s function is unknown, then transfer the annotation.
Bayesian networks (Troyanskaya PNAS 2003)
P(B) = 0.001 P(E) = 0.002 Burglary Earthquake P(A|B,E) = 0.95 P(A|B, ¬E) = 0.94 P(A|¬B,E) = 0.29 P(A|¬B, ¬E) = 0.001 Alarm John calls Mary calls P(M|A) = 0.70 P(M|¬A) = 0.01 P(J|A) = 0.90 P(J|¬A) = 0.05
Create one network per gene pair Data type 1 B Probability that genes A and B are functionally linked Data type 2 Data type 3
Bayesian Network FIXME: Re-do this figure
Conditional probability tables A pair of yeast proteins that have a physical association will have a positive affinity precipitation result 75% of the time and a negative result in the remaining 25%. Two proteins that do not physically interact in vivo will have a positive affinity precipitation result in 5% of the experiments, and a negative one in 95%.
Inputs Protein-protein interaction data from GRID. Transcription factor binding sites data from SGD. Stress-response microarray data set.
ROC analysis Using Gene Ontology biological process annotation as the gold standard.
Pros and cons Bayesian network framework is rigorous. Exploits expert knowledge. Does not (yet) learn from data. Treats each gene pair independently.
The SVM is a hyperplane classifier + + + + + - - Locate a plane that separates positive from negative examples. + + - + - + + - - - - - + - - - + + - - + - - Focus on the examples closest to the boundary.
Four key concepts Separating hyperplane Maximum margin hyperplane Soft margin Kernel function (input space feature space)
Input space 1 3 gene1 gene2 patient1 -1.7 2.1 patient2 0.3 0.5 4 gene1 5
Each subject may be thought of as a point in an m-dimensional space.
Separating hyperplane Construct a hyperplane separating ALL from AML subjects.
Choosing a hyperplane For a given set of data, many possible separating hyperplanes exist.
Maximum margin hyperplane Choose the separating hyperplane that is farthest from any training example.
Support vectors The location of the hyperplane is specified via a weight associated with each training example. Examples near the hyperplane receive non-zero weights and are called support vectors.
Soft margin When no separating hyperplane exists, the SVM uses a soft margin hyperplane with minimal cost. A parameter C specifies the relative cost of a misclassifcation versus the size of the margin.
Incorrectly measured or labeled data The separating hyperplane does not generalize well No separating hyperplane exists
Soft margin
The kernel function “The introduction of SVMs was very good for the most part, but I got confused when you began to talk about kernels.” “I found the discussion of kernel functions to be slightly tough to follow.” “I understood most of the lecture. The part that was more challenging was the kernel functions.” “Still a little unclear on how the kernel is used in the SVM.”
Why kernels?
Separating previously unseparable data
Input space to feature space SVMs first map the data from the input space to a higher-dimensional feature space.
Kernel function as dot product Consider two training examples A = (a1, a2) and B = (b1, b2). Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2) Let K(X,Y) = (X • Y)2 Write (A) • (B) in terms of K.
Kernel function as dot product Consider two training examples A = (a1, a2) and B = (b1, b2). Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2) Let K(X,Y) = (X • Y)2 Write (A) • (B) in terms of K. (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)
Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)
Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2
Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2
Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2 = (a1b1 + a2b2) (a1b1 + a2b2)
Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2 = (a1b1 + a2b2) (a1b1 + a2b2) = [(a1, a2) • (b1, b2)]2
Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2 = (a1b1 + a2b2) (a1b1 + a2b2) = [(a1, a2) • (b1, b2)]2 = (A • B)2 = K(A, B)
Separating in 2D with a 4D kernel
“Kernelizing” Euclidean distance
Kernel function The kernel function plays the role of the dot product operation in the feature space. The mapping from input to feature space is implicit. Using a kernel function avoids representing the feature space vectors explicitly. Any continuous, positive semi-definite function can act as a kernel function. Need for “positive semidefinite” for kernel function unclear. Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.
Overfitting with a Gaussian kernel
The SVM learning problem Input: training vectors xi … xn and labels yi … yn. Output: bias b plus one weight wi per training example The weights specify the location of the separating hyperplane. The optimization problem is a convex, quadratic optimization. It can be solved using standard packages such as MATLAB.
SVM prediction architecture Query = x x1 x2 x3 ... xn k k k k w2 w3 wn w1
Kernel function The kernel function plays the role of the dot product operation in the feature space. The mapping from input to feature space is implicit. Using a kernel function avoids representing the feature space vectors explicitly. Any continuous, positive semi-definite function can act as a kernel function. Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.
Learning gene classes Training set Eisen et al. 2465 Genes Learner Model 79 experiments MYGD Eisen et al. 3500 Genes Predictor Class 79 experiments Test set
Class prediction FP FN TP TN TCA 4 9 8 2446 Respiration chain complexes 6 22 2431 Ribosome 7 3 118 2339 Proteasome 27 2429 Histone 2 2456 Helix-turn-helix 16 2451
SVM outperforms other methods
Predictions of gene function Fleischer et al. “Systematic identification and functional screens of uncharacterized proteins associated with eukaryotic ribosomal complexes” Genes Dev, 2006.
Overview 218 human tumor samples spanning 14 common tumor types 90 normal samples 16,063 “genes” measured per sample Overall SVM classification accuracy: 78%. Random classification accuracy: 1/14 = 9%.
Summary: Support vector machine learning The SVM learning algorithm finds a linear decision boundary. The hyperplane maximizes the margin; i.e., the distance from any training example. The optimization is convex; the solution is sparse. A soft margin allows for noise in the training set. A complex decision surface can be learned by using a non-linear kernel function.
Cost/Benefits of SVMs SVMs perform well in high-dimensional data sets with few examples. Convex optimization implies that you get the same answer every time. Kernels functions allow encoding of prior knowledge. Kernel functions handle arbitrary data types. The hyperplane does not provide a good explanation, especially with a non-linear kernel function.
Vector representation Each matrix entry is an mRNA expression measurement. Each column is an experiment. Each row corresponds to a gene.
Similarity measurement Normalized scalar product Similar vectors receive high values, and vice versa. Similar Dissimilar
Kernel matrix
Sequence kernels >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI We cannot compute a scalar product on a pair of variable-length, discrete strings.
Pairwise comparison kernel
Pairwise comparison kernel
Protein-protein interactions Pairwise interactions can be represented as a graph or a matrix. 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 protein
Linear interaction kernel 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 3 The simplest kernel counts the number of interactions between each pair.
Diffusion kernel A general method for establishing similarities between nodes of a graph. Based upon a random walk. Efficiently accounts for all paths connecting two nodes, weighted by path lengths.
Hydrophobicity profile Membrane protein Non-membrane protein Transmembrane regions are typically hydrophobic, and vice versa. The hydrophobicity profile of a membrane protein is evolutionarily conserved.
Hydrophobicity kernel Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index. Prefilter the profiles. Compare two profiles by Computing fast Fourier transform (FFT), and Applying Gaussian kernel function. This kernel detects periodicities in the hydrophobicity profile. Dir. Inc. …. -> known to be usefull in identifying membrane proteins
Combining kernels A B A B K(A) K(B) A:B Identical K(A:B) K(A)+K(B)
Semidefinite programming Define a convex cost function to assess the quality of a kernel matrix. Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.
Semidefinite programming According to a convex quality measure: Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it : Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers: SDP
Learn a linear mix Maximize the margin Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers:
Markov Random Field General Bayesian method, applied by Deng et al. to yeast functional classification. Used five different types of data. For their model, the input data must be binary. Reported improved accuracy compared to using any single data type.
Yeast functional classes Category Size Metabolism 1048 Energy 242 Cell cycle & DNA processing 600 Transcription 753 Protein synthesis 335 Protein fate 578 Cellular transport 479 Cell rescue, defense 264 Interaction w/ evironment 193 Cell fate 411 Cellular organization 192 Transport facilitation 306 Other classes 81
Six types of data Presence of Pfam domains. Genetic interactions from CYGD. Physical interactions from CYGD. Protein-protein interaction by TAP. mRNA expression profiles. (Smith-Waterman scores).
Results MRF SDP/SVM (binary) SDP/SVM (enriched)
Pros and cons Learns relevance of data sets with respect to the problem at hand. Accounts for redundancy among data sets, as well as noise and relevance. Discriminative approach yields good performance. Kernel-by-kernel weighting is simplistic. In most cases, unweighted kernel combination works fine. Does not provide a good explanation.
Network diffusion GeneMANIA
A rose by any other name … Network diffusion Random walk with restart Personalized PageRank Diffusion kernel Gaussian random field GeneMANIA
Top performing methods
GeneMANIA Normalize each network (divide each element by the square root of the product of the sums of the rows and columns). Learn a weight for each network via ridge regression. Essentially, learn how informative the network is with respect to the task at hand. Sum the weighted networks. Assign labels to the nodes. Use (n+ + n-)/n for unlabeled genes. Perform label propagation in the combined network. Mostafavi et al. Genome Biology. 9:S4, 2008.
Random walk with restart Positive examples
Random walk with restart
Random walk with restart
Random walk with restart
Random walk with restart
Random walk with restart
Final node scores Size indicates frequency of visit
Final node scores Size indicates frequency of visit Label propagation is random walk with restart except: You restart less often from nodes with many neighbours (i.e., Restart probability of a node is inversely related to its degree) Nodes with many neighbors have their final node scores scaled up
Label propagation vs SVM Performance averaged across 992 yeast Gene Ontology Biological Process categories.