Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.

Slides:



Advertisements
Similar presentations
Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.
Advertisements

Asking translational research questions using ontology enrichment analysis Nigam Shah
Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Darwinian Genomics Csaba Pal Biological Research Center Szeged, Hungary.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Peter Tsai Bioinformatics Institute, University of Auckland
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Gene expression analysis summary Where are we now?
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
Metabolomics Bob Ward German Lab Food Science and Technology.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Daehee Hwang Leroy Hood Institute for Systems Biology.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Proteomics Informatics Workshop Part III: Protein Quantitation
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
Finish up array applications Move on to proteomics Protein microarrays.
MMAP: mouse Metabolomics Analysis Platform Preeti Bais 09/09/2014.
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang
Modeling of complex systems: what is relevant? Arno Knobbe, Marvin Meeng, Joost Kok Leiden Institute of Advanced Computer Science (LIACS)
Panel Discussion: Reference Databases Nathan Edwards Georgetown University Medical Center.
Bioinformatics lectures at Rice University Li Zhang Lecture 9: Networks and integrative genomic analysis
Primary Mets Node Patient 1Patient 2Patient 3 Primary Mets Node Patient 1Patient 2Patient 3 Primary Mets Node Patient 1Patient 2Patient 3 Primary Mets.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
Clustering Algorithms to make sense of Microarray data: Systems Analyses in Biology Doug Welsh and Brian Davis BioQuest Workshop Beloit Wisconsin, June.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Lecture 11. Topics in Omic Studies (Cancer Genomics, Transcriptomics and Epignomics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational.
No reference available
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
1 Genomics Advances in 1990 ’ s Gene –Expressed sequence tag (EST) –Sequence database Information –Public accessible –Browser-based, user-friendly bioinformatics.
Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Figure S1 Figure S1. Effect of SA on spore germination of M. oryae. The data presented were the means (± standard error) of spore germination from three.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Nature as blueprint to design antibody factories Life Science Technologies Project course 2016 Aalto CHEM.
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Connecting Cancer Genomics to Cancer Biology using Proteomics
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Enrichment of sequence disorder in the cytosolic phosphoproteome.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Proteomics Informatics David Fenyő
Standards Development for Metabolomics
Volume 24, Issue 13, Pages (July 2014)
Significantly enriched phosphorylation motifs from up-regulated phosphopeptides by Motif-X analysis. Significantly enriched phosphorylation motifs from.
Relative abundance of proteins identified in MALDI IMS
Proteomics Informatics David Fenyő
Concordance between the genomic landscape identified by whole-exome sequencing of plasma cfDNA and tumor; DNA and recurrence of KDR/VEGFR2 oncogenic mutations.
Generalized Protein Parsimony
Presentation transcript:

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Systems Biology 2 Structured High-Throughput Experiments Knowledge Databases

molecular biology ↕ phenotype Systems Biology 3 Knowledge Databases Structured High-Throughput Experiments Localization Function Process Interactions Pathway Mutation Proteomics Sequencing Microarrays Metabolomics molecular biology ↕ biology

molecular biology ↕ phenotype Systems Biology 4 Mathematical Models Structured High-Throughput Experiments Localization Function Process Interactions Pathway Mutation Proteomics Sequencing Microarrays Metabolomics molecular biology ↕ biology Knowledge Databases

molecular biology ↕ phenotype Systems Biology 5 Mathematical Models Structured High-Throughput Experiments Localization Function Process Interactions Pathway Mutation Proteomics Sequencing Microarrays Metabolomics molecular biology ↕ biology Knowledge Databases Functional Annotation Enrichment

molecular biology ↕ phenotype Systems Biology 6 Mathematical Models Structured High-Throughput Experiments Localization Function Process Interactions Pathway Mutation Proteomics Sequencing Microarrays Metabolomics molecular biology ↕ biology Knowledge Databases Functional Annotation Enrichment

molecular biology ↕ phenotype Systems Biology 7 Mathematical Models Structured High-Throughput Experiments Localization Function Process Interactions Pathway Mutation Proteomics Sequencing Microarrays Metabolomics molecular biology ↕ biology Knowledge Databases Functional Annotation Enrichment

Functional Annotation Enrichment In any draw, we expect: ~ 5 "evens", ~ 2 "≤ 10", etc. Each ball is equally likely Balls are independent p-value is surprise! For transcriptomics: Genes↔ Balls Genome↔ Tumbler Diff. Expr.↔ Draw Annotation↔ "evens",… 8 Draw 10 of 50!

Why not in proteomics? Double counting and false positives… …due to traditional protein inference Proteomics cannot see all proteins… …proteins are not equally likely to be drawn Good relative abundance is hard… …extra chemistries, workflows, and software …missing values are particularly problematic 9

In proteomics… Double counting and false positives… Use generalized protein parsimony Proteomics cannot see all proteins… Use identified proteins as background Good relative abundance is hard… Model differential spectral counts directly 10

Ignore some PSMs FDR filtering leaves some false PSMs Enforce strict protein inference criteria Leave some PSMs uncovered 11 10% Proteins PSMs

Ignore some PSMs FDR filtering leaves some false PSMs Enforce strict protein inference criteria Leave some PSMs uncovered 12 Proteins PSMs 90%

Match uncovered PSMs to FDR 13

Plasma membrane enrichment Pellicle enrichment of plasma membrane Choksawangkarn et al. JPR 2013 (Fenselau Lab) Six replicate LC-MS/MS analyses each Cell-lysate (44,861 MS/MS) Fe 3 O 4 -Al 2 O 3 pellicle (21,871 MS/MS) unique proteins to match 10% FDR: Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs 89 proteins with significantly (< ) increased counts 14

Plasma membrane enrichment Na/K+ ATPase subunit alpha-1 (P05023): Lysate: 1; Pellicle: 90; p-value: 5.2 x Transferrin receptor protein 1 (P02786): Lysate: 17; Pellicle: 63; p-value: 2.0 x DAVID Bioinformatics analysis (89/625): Plasma membrane (GO: ) : 29 (5.2 x ) Transmembrane (SwissProtKW): 24 (1.3 x ) Transmembrane (SwissProtKW): Lysate: 524; Pellicle: 1335; p-value: 2.6 x

A protein's PSMs rise and fall together! 16

A protein's PSMs rise and fall together? 17

Anomalies indicate proteoforms 18

Nascent polypeptide-associated complex subunit alpha x 10 -8

20 Pyruvate kinase isozymes M1/M2 2.5 x 10 -5

Summary Functional annotation enrichment for proteomics too: Careful counting (generalized parsimony) Differential abundance by spectral counts Use (multivariate-)hypergeometric model for Differential abundance by spectral counts Proteoform detection 21

HER2/Neu Mouse Model of Breast Cancer Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue by LC-MS/MS 1.4 million MS/MS spectra Peptide-spectrum assignments Normal samples (N n ): 161,286 (49.7%) Tumor samples (N t ): 163,068 (50.3%) 4270 proteins identified in total 2-unique generalized protein parsimony 22

Distribution of p-values (Yeast) 23