Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Evaluating Hypotheses
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Testing an individual module
Scaffold Download free viewer:
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Multiple testing correction
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Common parameters At the beginning one need to set up the parameters.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
A Computational Framework for Assembling Pottery Vessels Presented by: Stuart Andrews The study of 3D shape with applications in archaeology NSF/KDI grant.
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Data Mining and Decision Support
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Fast and Efficient Static Compaction of Test Sequences Based on Greedy Algorithms Jaan Raik, Artur Jutman, Raimund Ubar Tallinn Technical University, Estonia.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Data Analysis.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Optimizing Parallel Algorithms for All Pairs Similarity Search
Software Engineering (CSI 321)
Protein Identification via Database searching
A paper on Join Synopses for Approximate Query Answering
Research in Computational Molecular Biology , Vol (2008)
Evaluating classifiers for disease gene discovery
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Predicting Active Site Residue Annotations in the Pfam Database
Critical Issues with Respect to Clustering
Highlights of proposed changes
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
SEG5010 Presentation Zhou Lanjun.
NoDupe algorithm to detect and group similar mass spectra.
Volume 24, Issue 13, Pages (July 2014)
False discovery rate estimation
Gautam Dey, Tobias Meyer  Cell Systems 
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan

Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. Duplicate Spectrum recognition Peptide charge state discernment Peptide identification Protein assembly Identification error rate assesment Sample comparison

Introduction As one peptide sequence can be mapped to multiple proteins in a database, naïve protein assembly can substantially overstate the number of proteins found in samples.

Introduction Tools DTASelect DBParser Groups together proteins with identical sets of identified peptides and uses a similarity score to describe the relationship between proteins with overlapped peptide identifications. DBParser Classifying and reporting proteins in six hierarchical categories. Used parsimony analysis. Statistical approach by Nesvizhskii et al. Compute probabilities that proteins are present in a sample on the basis of estimated peptide identification probabilities.

Introduction Parsimony algorithm Issues Substantially reduce the number of proteins reported. Issues No evaluations of the quality of the generated lists of protein identifications. Correct protein identifications would be erroneously filtered out. Algorithmically complex. To be required to describe the complex many-to-many relationships between identified peptides and the proteins that potentially explain their appearance.

Introduction IDPicker Estimate False Discovery Rates(FDR) instead of relying on peptide identification score thresholds from reversed-sequence database search to control the quality of the peptide identifications. Use efficient graph algorithms on the peptide-protein relationships To identify protein clusters with shared peptides To derive the minimal list of proteins.

IDPicker A pipeline of tools. Designed to assemble confident, parsimonious protein identifications from raw spectral identifications. Three modules Reads the unfiltered peptide identifications from a SQT file. Applied an initial pass of filtering (an FDR of 25% typically). Groups these identifications into appropriate sets. Filters peptides to the final FDR (an FDR of 5% typically). Applied parsimony analysis to discovered proteins and produces reports.

Deriving Error Estimates for Peptide identification Identification score thresholds that correspond to user-specified FDR. FDR = (2R) / (F + R). F : The numbers of peptide identifications derived from the forward-sequence database. R : The numbers of peptide identifications derived from the reverse-sequence database.

Deriving Error Estimates for Peptide identification All peptide pass this initial filter are given equal standing. The identifications below the threshold are removed entirely from consideration.

Bipartite Graph Analysis of the Peptide Identification Data Collect all proteins that could account for peptides. Protein-Peptide mapping is very complex. Can be modeled by bipartite graph.

Bipartite Graph Analysis of the Peptide Identification Data An undirected graph. Vertices can be partitioned into two sets such that no edge connects vertices in the same set.

Bipartite Graph Analysis of the Peptide Identification Data Four steps for the algorithm for bipartite graph. Initialize Collapse Separate Reduce

Bipartite Graph Analysis of the Peptide Identification Data Initialize The peptide identification data in a bipartite graph. Two sets of vertices. Protein Peptide

Bipartite Graph Analysis of the Peptide Identification Data Collapse Some protein vertices are connected to exactly the same set of peptide vertices.

Bipartite Graph Analysis of the Peptide Identification Data Collapse Some protein vertices are connected to exactly the same set of peptide vertices.

Bipartite Graph Analysis of the Peptide Identification Data Collapse Define meta-protein, and meta-peptide. Meta-protein: A group of discernible proteins based on available evidence. Meta-peptide: A group of discernible peptides based on available evidence.

Bipartite Graph Analysis of the Peptide Identification Data Collapse After this step, A bipartite graph has two sets of vertices Meta-protein vertices Meta-peptide vertices

Bipartite Graph Analysis of the Peptide Identification Data Separate Two proteins are independent with regard to protein assembly if they share no peptides directly or indirectly through other proteins.

Bipartite Graph Analysis of the Peptide Identification Data Separate Decompose the complex bipartite graph into independent subgraphs of proteins with shared peptides. Achieve this through the depth first search. Each connected component represents a meta-protein cluster.

Bipartite Graph Analysis of the Peptide Identification Data Reduce Generate a minimal list of meta-proteins for each meta-protein cluster. Using a greedy set cover algorithm.

Bipartite Graph Analysis of the Peptide Identification Data Set cover problem and the greedy set cover algorithm. NP-Complete A heuristic approach.

Bipartite Graph Analysis of the Peptide Identification Data Set cover problem and the greedy set cover algorithm

Bipartite Graph Analysis of the Peptide Identification Data Reduce

Bipartite Graph Analysis of the Peptide Identification Data Reduce

Bipartite Graph Analysis of the Peptide Identification Data Reduce

Bipartite Graph Analysis of the Peptide Identification Data Reduce Parsimonious protein list generated.

Results & Discussion Data sets Databases Sigma49: Human protein mixture. Yeast-Extract: A yeast whole cell extract. Serum-MARS: A human serum proteome. Databases Swiss-Prot(SP): For human. IPI Human(IPI): For human. Saccharomyces Genome Database(SGD): For yeast. Compact species-specific subsets of Swiss-Prot. SPH: For human. SPY: For yeast.

Results & Discussion Protein List Reduction White bar: Each protein separately whether it can be distinguished from others on the basis of observed peptides or not. Gray bar: The result of grouping indiscernible proteins into meta-proteins. Black bar: Meta-proteins that remain after parsimony analysis.

Results & Discussion In swiss prot, the protein counts were reduced most. By grouping indiscernible proteins, Sigma49: 39% reduced. Serum-MARS: 24% reduced. Yeast-Extract: 14% reduced. By parsimony analysis, in addition Sigma49: 51% reduced. Serum-MARS: 44% reduced. Yeast-Extract: 3% reduced.

Results & Discussion Grouping indiscernible proteins and parsimony analysis cam improve protein reporting.

Results & Discussion Removing the redundancy in the protein list. In the Sigma49 runs, the average initial numbers of proteins, (SP, IPI, SPH) = (414,161,59) After the two reductions, the count of meta-proteins.. (SP, IPI, SPH) = (51,49,48) The known proteins in the original Sigma49 sample are 37, 32, and 40.

Results & Discussion These list reduction strategies cause the resulting protein lists to converge to numbers far closer to the true number of proteins in the sample.

Results & Discussion Improved Accuracy of Protein Identification Reducing the size of protein lists is useful only if incorrect protein identifiers are the ones being removed.

Results & Discussion Terms True Positive(TP): Each meta-protein is counted as TP if it included one of the 49 proteins listed as part of the sample. False Positive(FP): Otherwise of true positive. Precision : nTP / (nTP + nFP) Recall: nTP / nP nP : The number of all proteins in the sample, that is, 49 in this analysis. F1-measures: 2pr / (p + r). p: Precision. r: Recall.

Results & Discussion Terms PEP1: Retains all meta-proteins. PEP2: Retains only meta-proteins matching to at least two different peptide sequences. PEP1-PARS : Applies the parsimony analysis on PEP1. PEP2-PARS : Applies the parsimony analysis on PEP2.

Results & Discussion Sigma49 Data set. The bipartite graph approach is highly effective at removing false protein identifications while retaining true identifications.

Results & Discussion SPH Search Less effective. The parsimony analysis is most useful in removing redundant homologous proteins. It will be most powerful in processing data sets generated by searching multispecies databases, such as Swiss-Prot, IPI.

Results & Discussion Yeast-Extract data set Unable to measure F1-measrue. Protein content is not defined. True Positive(TP): Each meta-protein is counted as TP if it included at least one protein with the “_YEAST” identifier. False Positive(FP): Otherwise of true positive.

Results & Discussion Yeast-Extract data set Using PEP2 filtering lost a considerable number of true identifications. Removed yeast proteins could actually be false identifications, as we obviously overestimated the number of true positives

Results & Discussion IPI search on Serum-MARS data set. A total of 194523 tandem mass spectra. 350648 identifications resulted from the database search. IDPicker filtered these identifications down to 37246 to achieve a 5% FDR for identifications. The software found that 2605 different peptide sequences were represented. These 2605 peptides could be explained by as many as 472 proteins (including reversed sequences) These could be reduced to 339 distinguishable meta-proteins and subsequently to 189 meta-proteins after parsimony analysis.

Results & Discussion IPI search on Serum-MARS data set. IDPicker produces a tabular list of the proteins.

Results & Discussion IPI search on Serum-MARS data set. Association tables revealing which meta-proteins map to which meta-peptides.

Results & Discussion IPI search on Serum-MARS data set. A graphic illustrating the relationship among the five proteins and seven meta-peptides.

Results & Discussion IPI search on Serum-MARS data set. Association tables revealing which meta-proteins map to which meta-peptides. (After the parsimonious analysis)

Results & Discussion IPI search on Serum-MARS data set. Association tables revealing which meta-proteins map to which meta-peptides. (After the parsimonious analysis)

Results & Discussion Grouping functionally related proteins. Clustering proteins by their shared peptides. Below, there are reported the number of proteins in each cluster both with and without parsimony applied for the top five clusters. X -> Y X: The number of proteins before the parsimonious analysis. Y: The number of proteins after the parsimonious analysis.

Conclusion The bipartite graph is a useful model for representing peptide identification data in LC-MS/MS proteomics. It provides efficiency, accuracy, and transparency in deriving a minimal protein list from peptide identifications. The bipartite graph analysis was highly efficient in removing false protein identifications while retaining true identifications. It groups functionally related proteins together through clustering proteins with shared sequences and, thus, helps users to examine results more efficiently.

IDPicker A pipeline of tools. Designed to assemble confident, parsimonious protein identifications from raw spectral identifications. Three modules Reads the unfiltered peptide identifications from a SQT file. Applied an initial pass of filtering (an FDR of 25% typically). Groups these identifications into appropriate sets. Filters peptides to the final FDR (an FDR of 5% typically). Applied parsimony analysis to discovered proteins and produces reports.

IDPicker In 2.0 version.. Multiple score combination. New partitioning strategy.

IDPicker In 2.0 version.. Multiple score combination. Improves the peptide identification by combining multiple scores reported by database search engines. Users can specify which scoring metrics are to be included from their search results. S = w1s1 + … + wnsn Weights User defined. (Static) Automatically determined using a Monte Carlo simulation method. (Dynamic)

Results & Discussion Combining multiple scores from a search engine.

IDPicker In 2.0 version.. New partitioning strategy. NTT & Z state partitioning Peptides of different NTT or peptide charge values are likely to produce scores in different ranges. 9 separate peptide classes based on.. NTT: 0,1,2. Z state: 1+, 2+, 3+. The distinct score thresholds for each class.

Results & Discussion The effect of peptide partitioning was determined for three different search strategies: Fully tryptic, Semitryptic Unconstrained The following four different peptide partition styles were tested for each database search strategy: (A) no partitioning, (B) Z state (1+, 2+, or 3+) only (C) NTT (0, 1, or 2) only, (D) both Z state and NTT.

Results & Discussion NTT & Z state partitioning in 2.0 version. Tandem mass spectra from a whole cell lysate data set (“DLD1 LTQ”) A human serum data set (“Serum Orbi”)

IDPicker A pipeline of tools. Designed to assemble confident, parsimonious protein identifications from raw spectral identifications. Three modules Reads the unfiltered peptide identifications from a SQT file. Applied an initial pass of filtering (an FDR of 25% typically). Groups these identifications into appropriate sets. Filters peptides to the final FDR (an FDR of 5% typically). Applied parsimony analysis to discovered proteins and produces reports.

IDPicker In 2.0 version.. A novel filter to remove spurious protein identifications from multispecies. Adds a new protein to the minimal list of protein identifications only if it contributes a specified number of distinct peptide identifications that are not already explained by other proteins.

IDPicker In 2.0 version.. A novel filter to remove spurious protein identifications from multispecies. Adds a new protein to the minimal list of protein identifications only if it contributes a specified number of distinct peptide identifications that are not already explained by other proteins.

Results & Discussion In 2.0 version.. A novel filter to remove spurious protein identifications from multispecies. Data set: DLD1 LTQ, Serum Orbi. Database: Swiss-Prot Search: Myri Match.