A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

N.U.S. - January 13, 2006 Gert Lanckriet U.C. San Diego Classification problems with heterogeneous information sources.

Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Robust Multi-Kernel Classification of Uncertain and Imbalanced Data

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.

Support Vector Machine

Support Vector Machines and Kernel Methods

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Heuristic alignment algorithms and cost matrices

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Outline Separating Hyperplanes – Separable Case

Whole Genome Expression Analysis

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

The dynamic nature of the proteome

PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Laxman Yetukuri T : Modeling of Proteomics Data

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Construction of Substitution Matrices

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

1 Kernel based data fusion Discussion of a Paper by G. Lanckriet.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.

Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Predicting protein function from heterogeneous data

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,

Sequence Alignment.

Robust Optimization and Applications in Machine Learning.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

NTU & MSRA Ming-Feng Tsai

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Combining HMMs with SVMs

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Anastasia Baryshnikova Cell Systems

Presentation transcript:

A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline Recognizing correctly identified peptides The support vector machine algorithm Experimental results Yeast protein classification SVM learning from heterogeneous data Results

Recognizing correctly identified peptides

Database search Protein sample Tandem mass spectrometer Search algorithm Observed spectra Predicted peptides Sequence database

The learning task We are given paired observed and theoretical spectra. Question: Is the pairing correct? Observed Theoretical

Properties of the observed spectrum 1.Total peptide mass. Too small yields little information; too large (>25 amino acids) yields uneven fragmentation. 2.Charge (+1, +2 or +3). Provides some evidence about amino acid composition. 3.Total ion current. Proportional to the amount of peptide present. 4.Peak count. Small indicates poor fragmentation; large indicates noise.

Observed vs. theoretical spectra 5.Mass difference. 6.Percent of ions matched. Number of matched ions / total number of ions. 7.Percent of peaks matched. Number of matched peaks / total number of peaks. 8.Percent of peptide fragment ion current matched. Total intensity of matched peaks / total intensity of all peaks. 9.Preliminary SEQUEST score (S p ). 10.Preliminary score rank. 11.SEQUEST cross-correlation (XCorr).

Top-ranked vs. second-ranked peptides 12.Change in cross-correlation. Compute the difference in XCorr for the top-ranked and second-ranked peptide. 13.Percent sequence identity. Usually anti- correlated with change in cross- correlation.

Positive examples Negative examples

The support vector machine algorithm

SVMs in computational biology Splice site recognition Protein sequence similarity detection Protein functional classification Regulatory module search Protein-protein interaction prediction Gene functional classification from microarray data Cancer classification from microarray data

Support vector machine

Locate a plane that separates positive from negative examples. Focus on the examples closest to the boundary.

Kernel matrix

Kernel functions Let X be a finite input space. A kernel is a function K, such that for all x, z  X, K(x, z) =  (x) ·  (y), where  is a mapping from X to an (inner product) feature space F. Let K(x,z) be a symmetric function on X. Then K(x,z) is a kernel function if and only if the matrix is positive semi-definite.

Peptide ID kernel function Let p(x,y) be the function that computes a 13-element vector of parameters for a pair of spectra, x and y. The kernel function K operates on pairs of observed and theoretical spectra:

Experimental results

Experimental design Data consists of one 13-element vector per predicted peptide. Each feature is normalized to sum to 1.0 across all examples. The SVM is tested using leave-one-out cross- validation. The SVM uses a second-degree polynomial, normalized kernel with a 2-norm asymmetric soft margin.

Three data sets Set 1: Ion trap mass spectrometer. Sequest search on the full non-redundant database. Set 2: Ion trap mass spectrometer. Sequest search on human NRDB. Set 3: Quadrupole time-of-flight mass spectrometer. Sequence search on human NRDB.

Data set sizes QTOF HNRDB Ion-trap HNRDB Ion-trap NRDB TotalNegativePositive

(18,966) (126,931) (57,732) (108,936) (27,936)

Conversion to probabilities Hold out a subset of the training examples. Use the hold-out set to fit a sigmoid. This is equivalent to assuming that the SVM output is proportional to the log- odds of a positive example. y = label f = discriminant

Yeast protein classification

Membrane proteins Membrane proteins anchor in a cellular membrane (plasma, ER, golgi, mitochondrial). Communicate across membrane. Pass through the membrane several times.

mRNA expression data protein-protein interaction data sequence data Heterogeneous data

Vector representation Each matrix entry is an mRNA expression measurement. Each column is an experiment. Each row corresponds to a gene.

>ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI Sequence kernels We cannot compute a scalar product on a pair of variable-length, discrete strings.

Pairwise comparison kernel

Pairwise kernel variants Smith-Waterman all- vs-all BLAST all-vs-all Smith-Waterman w.r.t. SCOP database E-values from Pfam database

Protein-protein interactions Pairwise interactions can be represented as a graph or a matrix protein

Linear interaction kernel The simplest kernel counts the number of interactions between each pair

Diffusion kernel A general method for establishing similarities between nodes of a graph. Based upon a random walk. Efficiently accounts for all paths connecting two nodes, weighted by path lengths.

Hydrophobicity profile Transmembrane regions are typically hydrophobic, and vice versa. The hydrophobicity profile of a membrane protein is evolutionarily conserved. Membrane protein Non-membrane protein

Hydrophobicity kernel Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index. Prefilter the profiles. Compare two profiles by –Computing fast Fourier transform (FFT), and –Applying Gaussian kernel function. This kernel detects periodicities in the hydrophobicity profile.

SVM learning from heterogeneous data

Combining kernels Identical A B K(A)K(B) K(A)+K(B) AB A:B K(A:B)

Semidefinite programming Define a convex cost function to assess the quality of a kernel matrix. Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it : According to a convex quality measure: Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin SDP Semidefinite programming

Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin

Experimental results

Seven yeast kernels KernelDataSimilarity measure K SW protein sequenceSmith-Waterman KBKB protein sequenceBLAST K HMM protein sequencePfam HMM K FFT hydropathy profileFFT K LI protein interactions linear kernel KDKD protein interactions diffusion kernel KEKE gene expressionradial basis kernel

Membrane proteins

Comparison of performance Simple rules from hydrophobicity profile TMHMM

Cytoplasmic ribosomal proteins

False negative predictions

False negative expression profiles

Markov Random Field General Bayesian method, applied by Deng et al. to yeast functional classification. Used five different types of data. For their model, the input data must be binary. Reported improved accuracy compared to using any single data type.

Yeast functional classes CategorySize Metabolism1048 Energy242 Cell cycle & DNA processing600 Transcription753 Protein synthesis335 Protein fate578 Cellular transport479 Cell rescue, defense264 Interaction w/ evironment193 Cell fate411 Cellular organization192 Transport facilitation306 Other classes81

Six types of data Presence of Pfam domains. Genetic interactions from CYGD. Physical interactions from CYGD. Protein-protein interaction by TAP. mRNA expression profiles. (Smith-Waterman scores).

Results MRF SDP/SVM (binary) SDP/SVM (enriched)

Many yeast kernels protein sequence phylogenetic profiles separate gene expression kernels time series expression kernel promoter regions using seven aligned species protein localization ChIP protein-protein interactions yeast knockout growth data more...

Future work New kernel functions that incorporate domain knowledge. Better understanding of the semantics of kernel weights. Further investigation of yeast biology. Improved scalability of the algorithm. Prediction of protein-protein interactions.

Acknowledgments Dave Anderson, University of Oregon Wei Wu, Genome Sciences, UW & FHCRC Michael Jordan, Statistics & EECS, UC Berkeley Laurent El Ghaoui, EECS, UC Berkeley Gert Lanckriet, EECS, UC Berkeley Nello Cristianini, Statistics, UC Davis

Fisher criterion score High score Low score

Feature ranking delta C n % match total ion current C n % match peaks S p mass0.704 charge rank S p peak count sequence similarity % ion match total ion current delta mass0.024

Pairwise feature ranking % match TIC-delta Cn % match peaks-delta Cn % match TIC-Cn delta Cn-Cn delta Cn-charge % match peaks-Cn3.377 delta Cn-mass % match TIC-% match peaks2.823 % ion match-delta Cn Sp-delta Cn % match TIC-Sp % match peaks-Sp % match TIC-mass % ion match-mass Cn-charge Sp-mass % match TIC-charge Cn-mass Sp-Cn % ion match-Cn Sp-charge % match peaks-mass1.668 % match peaks-charge % match TIC-% ion match1.473