Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
Gene regulation in cancer 11/14/07. Overview The hallmark of cancer is uncontrolled cell proliferation. Oncogenes code for proteins that help to regulate.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Mass Spectrometry in a drug discovery setting Claus Andersen Senior Scientist Sienabiotech Spa.
Gene expression analysis summary Where are we now?
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Introduction to BioInformatics GCB/CIS535
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
ProReP - Protein Results Parser v3.0©
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Bioinformatics for biomedicine More annotation, Gene Ontology and pathways Lecture 6, Per Kraulis
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Proteomics Informatics Workshop Part III: Protein Quantitation
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
A highly abbreviated introduction to proteomics
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Proteomics Informatics – Data Analysis and Visualization (Week 13)
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
GSAT501 - proteomics Name, home-town Students – previous lab experience –Lab you hope to end up in? Teachers – what is your current project.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Finish up array applications Move on to proteomics Protein microarrays.
Common parameters At the beginning one need to set up the parameters.
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
High throughput Protein Measurement Techniques Harin Kanani.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Central dogma: the story of life RNA DNA Protein.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
A New Strategy of Protein Identification in Proteomics Xinmin Yin CS Dept. Ball State Univ.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Considerations for multi-omics data integration Michael Tress CNIO,
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Proteomics Informatics David Fenyő
A perspective on proteomics in cell biology
Proteomic analysis of seminal plasma from infertile patients with oligoasthenoteratozoospermia due to oxidative stress and comparison with fertile volunteers 
Volume 24, Issue 13, Pages (July 2014)
Proteomics Informatics David Fenyő
Presentation transcript:

Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center

Outline Systems Biology Gene Sets & Functional Enrichment Balls in Urns Proteomics MS/MS and Peptide ID Quantitation and Spectrum Counting Differential Protein Abundance Detecting Splicing and Isoforms 2

Systems Biology 3 Mathematical Models Knowledge Databases High-Throughput Experiments

Systems Biology 4 Mathematical Models Knowledge Databases High-Throughput Experiments Sequencing Microarrays Proteomics Metabolomics molecular biology ↕ phenotype

Systems Biology 5 Mathematical Models Knowledge Databases High-Throughput Experiments UniProt OMIM Kegg molecular biology ↕ biology

Systems Biology 6 Mathematical Models Knowledge Databases High-Throughput Experiments Software Statistics Algorithms phenotype ↕ biology

Systems Biology 7 Mathematical Models Knowledge Databases High-Throughput Experiments Software Statistics Algorithms phenotype ↕ biology UniProt OMIM Kegg molecular biology ↕ biology Sequencing Microarrays Proteomics Metabolomics molecular biology ↕ phenotype

Gene Expression Analysis Differential expression via: Structured experiments Transcript measurements Statistics But now what? 8

Gene Expression Analysis Hengel et al. J Immunol Structured experiment: CD4+/L-selectin- T-cells, vs CD4+/L-selectin+ T-cells Affymetrix Human Genome U95A Array Processing & Statistics MAS 4.0, t-Tests, FDR filtering, … 164 probe identifiers for upregulated genes. 9

Gene Expression Analysis _AT 38816_AT 679_AT 37105_AT 34623_AT 36378_AT 35648_AT 33979_AT 34529_AT 1372_AT 38646_S_AT 35896_AT 34249_AT 40317_AT 32413_AT 33530_AT 32469_AT 34720_AT 36317_AT 31987_AT 33027_AT 35439_AT 36421_AT 966_AT 967_G_AT 31525_S_AT 38236_AT 34618_AT 34546_AT 31512_AT 40959_AT 38604_AT 33922_AT 40790_AT 35595_AT 33963_AT 33685_AT 35566_F_AT 33684_AT 36436_AT 37166_AT 34453_AT 1645_AT 39469_S_AT 38229_AT 38945_AT 37711_AT 39908_AT 1355_G_AT 38948_AT 1786_AT 39198_S_AT 606_AT 35091_AT 35090_G_AT 37954_AT 822_S_AT 36766_AT 37953_S_AT 38128_AT 40350_AT 37097_AT 33516_AT 38691_S_AT 34702_F_AT 31715_AT 1331_S_AT 34577_AT 33027_AT 38508_S_AT 32680_AT 39187_AT 31506_S_AT 31793_AT 40294_AT 40553_AT 1983_AT 32250_AT 37968_AT 33293_AT 40271_AT 32418_AT 33077_AT 38201_AT 2090_I_AT 34012_AT 34703_F_AT 38482_AT 40058_S_AT 34902_AT 34636_AT 41113_AT 35996_AT 40735_AT 34539_AT 41280_R_AT 37061_AT 34233_I_AT 41703_R_AT 37898_R_AT 35373_AT 37408_AT 35213_AT 31576_AT 39094_AT 32010_AT 919_AT 1855_AT 1391_S_AT 34436_AT 33371_S

Gene Expression Analysis _g_atneural cell adhesion molecule _s_attumor necrosis factor receptor superfamily, member _g_atneurotrophic tyrosine kinase, receptor, type _attumor necrosis factor, alpha-induced protein _s_atcytochrome P450, family 4, subfamily A, polypeptide _s_atchemokine (C-C motif) ligand _g_atnitric oxide synthase 2, inducible 1575_atATP-binding cassette, sub-family B (MDR/TAP), member _atKiSS-1 metastasis-suppressor 1786_atc-mer proto-oncogene tyrosine kinase 1855_atfibroblast growth factor 3 (murine mammary tumor virus integration site (v-int-2) oncogene homolog) 1890_atgrowth differentiation factor 15 ……

Gene Set Enrichment Candidate genes are “special” with respect to the experiment structure (phenotype) Are they special with respect to general biological knowledge? Are the candidate genes related? Can we filter out the noise? Can we expose associated genes? What genes' changes are linked to the experimental structure / phenotype? 12

Gene Sets Genes may be related in many ways: Same pathway, similar function, cellular location Cytoband, identified in previous study, etc. Define gene sets for relatedness GO Biological Process GO Molecular Function GO Cellular Component KEGG Pathway, Biocarta Pathway Biological knowledge databases 13

Gene Set Enrichment 14

Gene Set Enrichment 15

Gene Set Enrichment 16

Drawing Balls from Urns Balls, 900 Red, 100 Blue.

Drawing Balls from Urns Balls Drawn at Random? # Red? # Blue?

Drawing Balls from Urns 19 How surprising is 5, 10, 15, 20, … blue?

Drawing Balls from Urns 20 How surprising is 30, 50, 70, … blue?

Drawing Balls from Urns 21 6 of 155 upregulated genes have "oxygen binding" GO annotation! All human genes ( = 25), blue is oxygen binding.

How surprised should we be? Classic problem in probability theory How well do the observed counts match the expected counts? Various mostly equivalent statistical tests are applied: Fisher exact test Hypergeometric Chi-Squared (χ 2 ) p-value measures "surprise". 22

23 Proteomics Proteins are the machines that drive much of biology Genes are merely the recipe The direct characterization of proteins en masse. What proteins are present? How much of each protein is present? Which proteins change in abundance?

24 Sample Preparation for Tandem Mass Spectrometry Enzymatic Digest and Fractionation

25 Single Stage MS MS

26 Tandem Mass Spectrometry (MS/MS) MS/MS

27 Peptide Fragmentation

LC-MS/MS Powerful combination of liquid chromatography (LC), and Tandem mass-spectrometry (MS/MS) Automatically collect 100k MS/MS spectra in an afternoon Tens of thousands of peptide/spectra assignments, Thousands of proteins identified 28

Spectral Counting Abundant proteins are more likely to be identified: Selection (by the instrument) for fragmentation is based on intensity More abundant ions are more likely to fragment in an informative manner A proteins' peptide identification count (spectra) can be used as a crude abundance measurement. Easy, cheap, (relative) protein quantitation 29

Differential Spectral Counts Spectral counts are too crude for classical (microarray) statistics. Fold change, t-tests, … However, we expect "similar" spectral counts when the protein abundance is unchanged. Recast as drawing balls from urns. 30

HER2/Neu Mouse Model of Breast Cancer Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue by LC-MS/MS 1.4 million MS/MS spectra Peptide-spectrum assignments Normal samples (N n ): 161,286 (49.7%) Tumor samples (N t ): 163,068 (50.3%) 4270 proteins identified in total 31

Drawing Balls from Urns 32 All Normal SpectraAll Tumor Spectra Plastin-2 (Lcp1) E-123 Osteopontin (Spp1) E-62 Hypoxia up-regulated protein 1 (Hyou1) E-40

Functional Enrichment 374 proteins with "significantly" increased abundance in tumor tissue Use 4270 proteins as background! DAVID gene set enrichment: Protein translation RNA binding, splicing 33

Differential Spectral Counting Assumptions of the formal tests (Fisher exact, χ 2 ) are violated, so p-values can be misleading (too small) Use label permutation tests to compute empirical p-values. SLOW! Collapse spectral counts to protein sets (GO terms) directly: Potential to observe more subtle spectral count differences 34

35 Unannotated Splice Isoform

36 Unannotated Splice Isoform

37 Halobacterium sp. NRC-1 ORF: GdhA1 K-score E-value vs 10% FDR Many peptides inconsistent with annotated translation start site of NP_279651

What if there is no "smoking gun" peptide… 38

What if there is no "smoking gun" peptide… 39

What if there is no "smoking gun" peptide… 40

PKM2 in Peptide Atlas 41 experiments peptides

What if there is no "smoking gun" peptide… 42 ?

Nascent polypeptide-associated complex subunit alpha Long form is "muscle-specific" Exon 3 is missing from short form Peptide identifications provide evidence for long form only 9 peptides are specific to long form 6 peptides are found in both isoforms Urn with balls of 15 different colors p-value of observed spectral counts: 7.3E-8 43

Nascent polypeptide-associated complex subunit alpha 44

Pyruvate kinase isozymes M1/M2 Exon "substitution" changes sequence in the middle of the protein Peptide identifications provide evidence for both isoforms 3 peptides are specific to isoform 1 5 peptides are specific to isoform 2 Urn with balls of 63 colors for isoform 1 p-value of observed spec. counts: 2.46E-05 45

46 Pyruvate kinase isozymes M1/M2

Summary Systems biology requires: Experiments, Databases, Models Informaticians and Disease Experts Functional Enrichment: Quickly navigate knowledge databases using experiment derived genes Classical probability experiment: Balls & Urns How surprised should you be? Still require domain expert to pick out gems 47

Summary Proteomics: High-throughput protein comparison Proteome "sample" is identified Crude spectral count quantitation Differential protein abundance: Use Balls & Urns to find significant changes Apply functional enrichment tools Splicing detection: Perturbed peptide spectral counts provide evidence for splicing. Evaluate using Balls & Urns 48