Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1

Slides:



Advertisements
Similar presentations
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Information Retrieval Models: Probabilistic Models
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
My contact details and information about submitting samples for MS
Chapter 5: Information Retrieval and Web Search
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Multiple testing correction
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Common parameters At the beginning one need to set up the parameters.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Chapter 6: Information Retrieval and Web Search
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
CMU at TDT 2004 — Novelty Detection Jian Zhang and Yiming Yang Carnegie Mellon University.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
A Study of Poisson Query Generation Model for Information Retrieval
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Experience Report: System Log Analysis for Anomaly Detection
Sampath Jayarathna Cal Poly Pomona
7. Performance Measurement
Mass Spectrometry makes it possible to measure protein/peptide masses (actually mass/charge ratio) with great accuracy Major uses Protein and peptide identification.
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.
A Database of Peak Annotations of Empirically Derived Mass Spectra
MassMatrix Search Results Explained
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
An Empirical Study of Learning to Rank for Entity Search
Evaluating Results of Learning
Chapter 7 – K-Nearest-Neighbor
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Information Retrieval Models: Probabilistic Models
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Searching Similar Segments over Textual Event Sequences
John Lafferty, Chengxiang Zhai School of Computer Science
Proteomics Informatics –
Personalized Celebrity Video Search Based on Cross-space Mining
Panagiotis G. Ipeirotis Luis Gravano
Uncertainty-driven Ensemble Forecasting of QoS in Software Defined Networks Kostas Kolomvatsos1, Christos Anagnostopoulos2, Angelos Marnerides3, Qiang.
Dr. Sampath Jayarathna Cal Poly Pomona
INF 141: Information Retrieval
Dr. Sampath Jayarathna Cal Poly Pomona
Presentation transcript:

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1 1: Language Technologies Institute 2: Machine Learning Department School of Computer Science, Carnegie Mellon University

Outline Motivation & Background Two probabilistic approaches Experiments @Yiming Yang, ECML 2009, Sept 8

Motivation Proteins are important bio-markers for diseases, drug toxicity, therapeutic outcomes, etc. Statistical approaches have been developed for protein identification in computational proteomics Interdisciplinary research for comparing current solutions with successful methods in IR (information retrieval) for similar problems has been rare. We address this research gap by Analyzing a major limitation of popular approaches in protein ID Proposing a new solution (Language Modeling for IR) @Yiming Yang, ECML 2009, Sept 8

The Protein ID Problem Tandem mass (MS-MS) spectra are produced using some chemical process on an input sample (e.g., blood) A sample typically consists of multiple proteins. The process segments each protein into many (hundreds) pieces, called peptides. Peptides are further decomposed into ionized segments. The MS-MS spectrum of a peptide is a series of spikes. Each spike is the mass/charge (m/z) ratio of an ionized segment in the peptide. Let us start with AF, to see what are the unsolved problems or shortcomings which must be addressed for the true utility of AF systems. @Yiming Yang, ECML 2009, Sept 8

The Protein ID Problem (cont’d) Protein identification requires a mapping from empirical (MS-MS) spectra to protein sequences in an DB There are many protein sequence databases SwissProt, for example, contains 280,000+ sequences Each protein is defined as a sequence of amino-acid letters Peptides in each protein are specified using cleaving rules Each peptide has an amino-acid sequence and a corresponding theoretical (“expected”) spectrum Let us start with AF, to see what are the unsolved problems or shortcomings which must be addressed for the true utility of AF systems. @Yiming Yang, ECML 2009, Sept 8

Theoretical Spectra of peptides in a DB Empirical Spectra of peptides in a sample Mapping Matching Fourier Transformation Probabilistic Models Heuristic Rules @Yiming Yang, ECML 2009, Sept 8

Matched Documents (in L2) Theoretical Spectra Empirical Spectra Mapping Matching Words in L2 Words in L1 Matched Words (in L2) Doc Retrieval Matched Documents (in L2) @Yiming Yang, ECML 2009, Sept 8

Outline Motivation & Background Two probabilistic approaches Experiments @Yiming Yang, ECML 2009, Sept 8

-- estimates the probability for a Boolean OR logic A Popular Approach in Protein ID (ProteinProphet by Nesvizhskii et al., 2003) Given the predicted peptides based on MS-MS spectra, the probability for each candidate protein is estimated as: Let us start with AF, to see what are the unsolved problems or shortcomings which must be addressed for the true utility of AF systems. -- estimates the probability for a Boolean OR logic -- typically produces many false positives @Yiming Yang, ECML 2009, Sept 8

A Popular Approach in IR Language Models (Ponte 1998; Lafferty & Zhai, 2001; …) Query (q) is represented using a bag of words Document (d) is represented using a bag of words KL-divergence of the two words distributions (θq and θd ) is Let us start with AF, to see what are the unsolved problems or shortcomings which must be addressed for the true utility of AF systems. Cross entropy H (θq ||θd) -- not affect doc ranking -- a “soft” measure for the Boolean AND logic @Yiming Yang, ECML 2009, Sept 8

LM for Protein ID Query language model for predicted peptides Document language model for each protein sequence Let us start with AF, to see what are the unsolved problems or shortcomings which must be addressed for the true utility of AF systems. @Yiming Yang, ECML 2009, Sept 8

Outline Motivation & Background Two probabilistic approaches Experiments @Yiming Yang, ECML 2009, Sept 8

Data Sets PPK (Purvine et al., 2003) Mark12 Sigma49 2995 empirical spectra from a mixture of 35 proteins 4535 protein sequences (325,812 unique peptides) Mark12 9380 empirical spectra from a mixture of 12 proteins 50,012 protein sequences (5,149,302 unique peptides) randomly sampled from the SwithProt database Sigma49 12,498 empirical spectra from a mixture of 49 proteins 50,049 protein sequences (2,571,642 unique peptides) randomly sampled from the SwithProt database @Yiming Yang, ECML 2009, Sept 8

Systems Prob-AND Prob-OR Our proposed method Prob-OR Nesvizhskii’s method, our own implementation Conventional Vector Space Model (TFIDF-cosine) Supported by the Lemur search engine (Callan, 2002) X!Tandem A popular software (online available) for protein/peptide ID All the system, except X!Tandem, used SEQUEST to predict a set of peptides (as the “query”). Each system produces a ranked list of proteins per query. @Yiming Yang, ECML 2009, Sept 8

Metrics Mean Average Precision (MAP) Standard metric in IR for evaluating ranked lists Evaluate each ranked list from the top to each position where a true positive document is retrieved Recall = TP/(TP + FN) Precision = TP/(TP + FP) TP = # of true positives, TN = # of true negatives FP = # of false positives, FN = # of false negatives Average the precision scores in recall intervals among 0%, 10%, 20%, …, 100% (“11-pt AVGP”) Compute the mean of AVGP across all intervals and for all queries @Yiming Yang, ECML 2009, Sept 8

Main Results @Yiming Yang, ECML 2009, Sept 8

Statistical Significance Tests on Proportions @Yiming Yang, ECML 2009, Sept 8

Summary The first interdisciplinary investigation/evaluation of state-of-the-art IR methods (LM and VSM) in protein identification Prob-AND (LM) is a better choice of criterion than prob-OR in combining peptide-level evidence, improving precision significantly in the high-recall regions. Understanding the nature of proteomic data/problems by researchers with different backgrounds (IR or ML) is hard, but, the outcome is and will be rewarding. @Yiming Yang, ECML 2009, Sept 8

Future Research Finding the “best” protein mixture (Arnold et al., PSB 2007) Instead of predicting each protein independently Reduces to solving the minimum set cover problem (NP-hard) Revised as to find the most likely protein mixture (Li et al., 2008) Greedy approximation strategies Using Gibbs sampling (local maxima, efficiency issues) Better results than ProteinProphet (prob-OR) on Sigma49 Comparative evaluation (with LM, VSM, etc.) would be informative Scalability for high-recall predictions from very large protein databases? @Yiming Yang, ECML 2009, Sept 8

Thanks! @Yiming Yang, ECML 2009, Sept 8