Predicting protein function from heterogeneous data

Slides:

Advertisements

Similar presentations

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab

Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Data Mining Classification: Alternative Techniques

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

LOGO Classification IV Lecturer: Dr. Bo Yuan

N.U.S. - January 13, 2006 Gert Lanckriet U.C. San Diego Classification problems with heterogeneous information sources.

Robust Multi-Kernel Classification of Uncertain and Imbalanced Data

Discriminative and generative methods for bags of features

Support Vector Machines and Kernel Methods

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Chapter 5: Linear Discriminant Functions

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Reduced Support Vector Machine

Support Vector Machines Kernel Machines

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Support Vector Machines

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

This week: overview on pattern recognition (related to machine learning)

A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Support Vector Machine & Image Classification Applications

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

1 Kernel based data fusion Discussion of a Paper by G. Lanckriet.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Robust Optimization and Applications in Machine Learning.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

1 CISC 841 Bioinformatics (Fall 2008) Review Session.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Label propagation algorithm

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

We can frame functional annotation as a classification task Is gene X a penicillin amidase? Many possible types of labels: Biological process Molecular function Subcellular localization Many possible inputs: Gene or protein sequence Expression profile Protein-protein interactions Genetic associations Classifier Yes

Outline Bayesian networks Support vector machines Network diffusion / message passing

Annotation transfer Protein of known function Protein of unknown function Rule: If two proteins are linked with high confidence, and one protein’s function is unknown, then transfer the annotation.

Bayesian networks (Troyanskaya PNAS 2003)

P(B) = 0.001 P(E) = 0.002 Burglary Earthquake P(A|B,E) = 0.95 P(A|B, ¬E) = 0.94 P(A|¬B,E) = 0.29 P(A|¬B, ¬E) = 0.001 Alarm John calls Mary calls P(M|A) = 0.70 P(M|¬A) = 0.01 P(J|A) = 0.90 P(J|¬A) = 0.05

Create one network per gene pair Data type 1 B Probability that genes A and B are functionally linked Data type 2 Data type 3

Bayesian Network FIXME: Re-do this figure

Conditional probability tables A pair of yeast proteins that have a physical association will have a positive affinity precipitation result 75% of the time and a negative result in the remaining 25%. Two proteins that do not physically interact in vivo will have a positive affinity precipitation result in 5% of the experiments, and a negative one in 95%.

Inputs Protein-protein interaction data from GRID. Transcription factor binding sites data from SGD. Stress-response microarray data set.

ROC analysis Using Gene Ontology biological process annotation as the gold standard.

Pros and cons Bayesian network framework is rigorous. Exploits expert knowledge. Does not (yet) learn from data. Treats each gene pair independently.

The SVM is a hyperplane classifier + + + + + - - Locate a plane that separates positive from negative examples. + + - + - + + - - - - - + - - - + + - - + - - Focus on the examples closest to the boundary.

Four key concepts Separating hyperplane Maximum margin hyperplane Soft margin Kernel function (input space  feature space)

Input space 1 3 gene1 gene2 patient1 -1.7 2.1 patient2 0.3 0.5 4 gene1 5

Each subject may be thought of as a point in an m-dimensional space.

Separating hyperplane Construct a hyperplane separating ALL from AML subjects.

Choosing a hyperplane For a given set of data, many possible separating hyperplanes exist.

Maximum margin hyperplane Choose the separating hyperplane that is farthest from any training example.

Support vectors The location of the hyperplane is specified via a weight associated with each training example. Examples near the hyperplane receive non-zero weights and are called support vectors.

Soft margin When no separating hyperplane exists, the SVM uses a soft margin hyperplane with minimal cost. A parameter C specifies the relative cost of a misclassifcation versus the size of the margin.

Incorrectly measured or labeled data The separating hyperplane does not generalize well No separating hyperplane exists

Soft margin

The kernel function “The introduction of SVMs was very good for the most part, but I got confused when you began to talk about kernels.” “I found the discussion of kernel functions to be slightly tough to follow.” “I understood most of the lecture. The part that was more challenging was the kernel functions.” “Still a little unclear on how the kernel is used in the SVM.”

Why kernels?

Separating previously unseparable data

Input space to feature space SVMs first map the data from the input space to a higher-dimensional feature space.

Kernel function as dot product Consider two training examples A = (a1, a2) and B = (b1, b2). Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2) Let K(X,Y) = (X • Y)2 Write (A) • (B) in terms of K.

Kernel function as dot product Consider two training examples A = (a1, a2) and B = (b1, b2). Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2) Let K(X,Y) = (X • Y)2 Write (A) • (B) in terms of K. (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)

Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)

Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2 = (a1b1 + a2b2) (a1b1 + a2b2)

Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2 = (a1b1 + a2b2) (a1b1 + a2b2) = [(a1, a2) • (b1, b2)]2

Kernel function as dot product (A) • (B) = (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2 = a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2 = (a1b1 + a2b2) (a1b1 + a2b2) = [(a1, a2) • (b1, b2)]2 = (A • B)2 = K(A, B)

Separating in 2D with a 4D kernel

“Kernelizing” Euclidean distance

Kernel function The kernel function plays the role of the dot product operation in the feature space. The mapping from input to feature space is implicit. Using a kernel function avoids representing the feature space vectors explicitly. Any continuous, positive semi-definite function can act as a kernel function. Need for “positive semidefinite” for kernel function unclear. Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

Overfitting with a Gaussian kernel

The SVM learning problem Input: training vectors xi … xn and labels yi … yn. Output: bias b plus one weight wi per training example The weights specify the location of the separating hyperplane. The optimization problem is a convex, quadratic optimization. It can be solved using standard packages such as MATLAB.

SVM prediction architecture Query = x x1 x2 x3 ... xn k k k k w2 w3 wn w1

Kernel function The kernel function plays the role of the dot product operation in the feature space. The mapping from input to feature space is implicit. Using a kernel function avoids representing the feature space vectors explicitly. Any continuous, positive semi-definite function can act as a kernel function. Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

Learning gene classes Training set Eisen et al. 2465 Genes Learner Model 79 experiments MYGD Eisen et al. 3500 Genes Predictor Class 79 experiments Test set

Class prediction FP FN TP TN TCA 4 9 8 2446 Respiration chain complexes 6 22 2431 Ribosome 7 3 118 2339 Proteasome 27 2429 Histone 2 2456 Helix-turn-helix 16 2451

SVM outperforms other methods

Predictions of gene function Fleischer et al. “Systematic identification and functional screens of uncharacterized proteins associated with eukaryotic ribosomal complexes” Genes Dev, 2006.

Overview 218 human tumor samples spanning 14 common tumor types 90 normal samples 16,063 “genes” measured per sample Overall SVM classification accuracy: 78%. Random classification accuracy: 1/14 = 9%.

Summary: Support vector machine learning The SVM learning algorithm finds a linear decision boundary. The hyperplane maximizes the margin; i.e., the distance from any training example. The optimization is convex; the solution is sparse. A soft margin allows for noise in the training set. A complex decision surface can be learned by using a non-linear kernel function.

Cost/Benefits of SVMs SVMs perform well in high-dimensional data sets with few examples. Convex optimization implies that you get the same answer every time. Kernels functions allow encoding of prior knowledge. Kernel functions handle arbitrary data types. The hyperplane does not provide a good explanation, especially with a non-linear kernel function.

Vector representation Each matrix entry is an mRNA expression measurement. Each column is an experiment. Each row corresponds to a gene.

Similarity measurement Normalized scalar product Similar vectors receive high values, and vice versa. Similar Dissimilar

Kernel matrix

Sequence kernels >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI We cannot compute a scalar product on a pair of variable-length, discrete strings.

Pairwise comparison kernel

Pairwise comparison kernel

Protein-protein interactions Pairwise interactions can be represented as a graph or a matrix. 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 protein

Linear interaction kernel 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 3 The simplest kernel counts the number of interactions between each pair.

Diffusion kernel A general method for establishing similarities between nodes of a graph. Based upon a random walk. Efficiently accounts for all paths connecting two nodes, weighted by path lengths.

Hydrophobicity profile Membrane protein Non-membrane protein Transmembrane regions are typically hydrophobic, and vice versa. The hydrophobicity profile of a membrane protein is evolutionarily conserved.

Hydrophobicity kernel Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index. Prefilter the profiles. Compare two profiles by Computing fast Fourier transform (FFT), and Applying Gaussian kernel function. This kernel detects periodicities in the hydrophobicity profile. Dir. Inc. …. -> known to be usefull in identifying membrane proteins

Combining kernels A B A B K(A) K(B) A:B Identical K(A:B) K(A)+K(B)

Semidefinite programming Define a convex cost function to assess the quality of a kernel matrix. Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Semidefinite programming According to a convex quality measure: Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it : Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers: SDP

Learn a linear mix Maximize the margin Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers:

Markov Random Field General Bayesian method, applied by Deng et al. to yeast functional classification. Used five different types of data. For their model, the input data must be binary. Reported improved accuracy compared to using any single data type.

Yeast functional classes Category Size Metabolism 1048 Energy 242 Cell cycle & DNA processing 600 Transcription 753 Protein synthesis 335 Protein fate 578 Cellular transport 479 Cell rescue, defense 264 Interaction w/ evironment 193 Cell fate 411 Cellular organization 192 Transport facilitation 306 Other classes 81

Six types of data Presence of Pfam domains. Genetic interactions from CYGD. Physical interactions from CYGD. Protein-protein interaction by TAP. mRNA expression profiles. (Smith-Waterman scores).

Results MRF SDP/SVM (binary) SDP/SVM (enriched)

Pros and cons Learns relevance of data sets with respect to the problem at hand. Accounts for redundancy among data sets, as well as noise and relevance. Discriminative approach yields good performance. Kernel-by-kernel weighting is simplistic. In most cases, unweighted kernel combination works fine. Does not provide a good explanation.

Network diffusion GeneMANIA

A rose by any other name … Network diffusion Random walk with restart Personalized PageRank Diffusion kernel Gaussian random field GeneMANIA

Top performing methods

GeneMANIA Normalize each network (divide each element by the square root of the product of the sums of the rows and columns). Learn a weight for each network via ridge regression. Essentially, learn how informative the network is with respect to the task at hand. Sum the weighted networks. Assign labels to the nodes. Use (n+ + n-)/n for unlabeled genes. Perform label propagation in the combined network. Mostafavi et al. Genome Biology. 9:S4, 2008.

Random walk with restart Positive examples

Random walk with restart

Random walk with restart

Random walk with restart

Random walk with restart

Random walk with restart

Final node scores Size indicates frequency of visit

Final node scores Size indicates frequency of visit Label propagation is random walk with restart except: You restart less often from nodes with many neighbours (i.e., Restart probability of a node is inversely related to its degree) Nodes with many neighbors have their final node scores scaled up

Label propagation vs SVM Performance averaged across 992 yeast Gene Ontology Biological Process categories.