Proteogenomic Novelty in 105 TCGA Breast Tumors

Slides:



Advertisements
Similar presentations
Fill in missing numbers or operations
Advertisements

1 Chapter 40 - Physiology and Pathophysiology of Diuretic Action Copyright © 2013 Elsevier Inc. All rights reserved.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Objectives: Generate and describe sequences. Vocabulary:
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
0 - 0.
1 1  1 =.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLICATION EQUATIONS 1. SOLVE FOR X 3. WHAT EVER YOU DO TO ONE SIDE YOU HAVE TO DO TO THE OTHER 2. DIVIDE BY THE NUMBER IN FRONT OF THE VARIABLE.
Addition Facts
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
RNA-Seq as a Discovery Tool
The genetic dissection of complex traits
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Overview of Genevestigator
Addition 1’s to 20.
25 seconds left…...
Subtraction: Adding UP
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
Protein Quantitation II: Multiple Reaction Monitoring
RNA-Seq based discovery and reconstruction of unannotated transcripts
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Kelly Ruggles, Ph.D. Proteomics Informatics Week 9
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
The Central Dogma of Molecular Biology (Things are not really this simple) Genetic information is stored in our DNA (~ 3 billion bp) The DNA of a.
Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Karl Clauser Proteomics and Biomarker Discovery Taming Errors for Peptides with Post-Translational Modifications Bioinformatics for MS Interest Group ASMS.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Proteomics and Biomarker Discovery Discovery to Targets for a Phosphoproteomic Signature Assay: One-stop shopping in Skyline Jake Jaffe Skyline Users Meeting.
Karl Clauser Proteomics and Biomarker Discovery Bioinformatics of Phosphopeptide Identification, Phosphosite Localization, and iTRAQ Quantitation in Phosphoproteomics.
Karl Clauser Proteomics and Biomarker Discovery Breast Cancer Proteomics and the use of TCGA Mutational Data - Broad Institute update/issues Karl Clauser.
MPL Identification of alternative spliced mRNA variants related to cancers by genome-wide ESTs alignment KIM DAE SOO Oncogene Apr.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Lecture 11. Topics in Omic Studies (Cancer Genomics, Transcriptomics and Epignomics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational.
Proteogenomic Novelty in 105 TCGA Breast Tumors
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Research about Alternative Splicing recently 楊佳熒.
A New Strategy of Protein Identification in Proteomics Xinmin Yin CS Dept. Ball State Univ.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
Considerations for multi-omics data integration Michael Tress CNIO,
Detect alternative splicing
Connecting Cancer Genomics to Cancer Biology using Proteomics
Refining Peptide Fragmentation Models for Improved Confidence in Sequence/Spectrum Matching Karl Clauser Broad Institute of MIT and Harvard Cambridge,
by Nancy D. Borson, Martha Q. Lacy, and Peter J. Wettstein
Volume 21, Issue 13, Pages (December 2017)
Proteomics Informatics David Fenyő
Reliable Identification of Genomic Variants from RNA-Seq Data
Alternative Splicing May Not Be the Key to Proteome Complexity
Schematic representation of proteogenomic annotation strategy.
Universal Alternative Splicing of Noncoding Exons
Proteomics Informatics David Fenyő
Presentation transcript:

Proteogenomic Novelty in 105 TCGA Breast Tumors Karl Clauser CPTAC Breast Cancer Analysis Group Broad Institute of MIT and Harvard Fred Hutchinson Cancer Research Center Washington University New York University CPTAC Data Jamboree April 16, 2014 National Institutes of Health Bethesda, Maryland

Tumor-specific protein databases for MS/MS-spectra searches Kelly Ruggles, David Fenyo, NYU

QUILTS: Treatment of different variant types In alternates frameshifts Unannotated Alternative Splicing 1 frame translation 1 frame translation In frameshifts db 1 frame translation Novel Partially Novel Splicing Novel Novel downstream: 1 frame translation Novel upstream: 6 frame translation In other db Completely Novel Expression 6 frame translation Fusion Genes 6 frame translation In variants db Variants 1 frame translation

Proteogenomic mapping: Genetic alterations can be observed on protein level (105 tumors) | work in progress Low confidence thresholds applied to Genome calls Variants: >2 QUAL phred-scaled quality score in ALT Alternative splices: >1 read This document http://www.1000genomes.org/node/101 defines the quality value as: "QUAL phred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. (Numeric)” Low thresholds applied to Genome calls (>1 read RNA-seq, >2 QUAL phred-scaled Variants) High thresholds applied to Proteome calls (<0.1% FDR) 0.2-2.7% of frameshifts, alternative splices & single AA variants observable by proteomics mRNA may not be translated or at low abundance Proteome coverage is incomplete

Global proteome and phosphoproteome discovery workflow for TCGA breast tumors 1 mg total protein per tumor Internal reference: equal representation of basal, Her2 and Luminal A/B subtypes

Serial Search Strategy with Personalized Databases Concatenated FASTA files, 105 patients Altered proteins Removed redundant entries 25,776,160 Spectra (105 patients) (36 iTRAQ experiments) (25 LC-MS/MS runs / experiment) > Canonical – Variant Patient 1 SIGNALINGPATHWAHREGULATOR >Canonical Protein – Variant Patient 2 SIKNALINGPATHWAYREGULATOR Variants: 133,241 3247 Variants Matched RefSeq-Hs-7/2013: 31,852 > Canonical – Alternate splice Patient 1 SIGNALINGREGULATOR >Canonical – Alternate splice Patient 2 SIGNALINGPATHREGULATOR Alternate Spliceforms: 67,853 > Canonical Protein SIGNALINGPATHWAYREGULATOR 197 Splice Junctions Matched 11,328,955 Matched Spectra (44% of total) (1% FDR) 14,447,205 Leftover Spectra > Canonical – Truncation Patient 1 SIGNALINGPATFRAMESHIF >Canonical – Novel Exon Insert Patient 2 SIGNALINGPATHWAYINSERTREGULATOR >Canonical – Partial Exon Deletion Patient 3 SIGNALINGPATHWAYULATOR Frameshifts: 19,944 22 Truncation Overlaps Matched 11 Insertion Overlaps Matched 49 Deletion Junctions Matched Concatenated: 252,890 Low confidence thresholds for Genome calls Variants: >2 QUAL score (phred-scaled) Alternative splices, frameshifts: >1 read High confidence for Proteome IDs <0.1% FDR peptide spectrum match

Frequency of Single AA Variants, Alternative Splices, Frameshifts Across Patients Somatic variants are less frequent than germline variants Some germline variants are very common Rare germline variants present in RefSeq Some alternative splice forms and frameshifts are very common Should be in RefSeq Genome & Transcriptome Data very common

1 experiment: 3 individual patients + 1 Common control (40 patients) How many RNA-seq reads to yield a proteomics observation of an alternate splice or frameshift? 1 experiment: 3 individual patients + 1 Common control (40 patients) 197 Alternative splices 82 Frameshifts Max # Reads 17 observed in >1 Expmt Max # Reads 19 observed in >1 Expmt

Present in only 1 Common control member Frameshift Truncation: ras-Related protein Rab-15 Observed only in Proteomics Exp 3 E159 Max RNA-Seq Reads: 1 Present in only 1 Common control member

Present in only 1 Common control member Frameshift Truncation: Cysteine-rich protein 1 Observed in 9 Proteomics Experiments E159 Max RNA-Seq Reads: 1 Present in only 1 Common control member

Present in only 1 Common control member Frameshift Truncation: Cullin-2 isoform a Observed in 3 Proteomics Experiments Max RNA-Seq Reads: 1 Present in only 1 Common control member E159

1 experiment: 3 individual patients + 1 Common control (40 patients) Many missing observations even when transcript present in many common control members 1 experiment: 3 individual patients + 1 Common control (40 patients) Alternative splices Frameshifts

Majority of Alternative Splice Junctions and Frameshifts observed in >1 Proteomics Experiment Pie chart 1 experiment: 3 individual patients + 1 Common control (40 patients) Alternative splices Frameshifts 150/197 observed in >1 experiment 44/82 observed in >1 experiment

Next steps: Examine “other” category Fusion genes (junction-spanning) Novel exon splicing (2 sides) Completely novel gene Use updated somatic variants from QUILTS Define genomic data thresholds suitable for proteomic observations RNA-seq: Min read count Variant calling: phred-scaled QUAL score Sort out Germline/Somatic variant call mix status across patients

Summary of Proteome Re-processing 105 TCGA patients- 36 iTAQ experiments

Changes in Re-processing of TCGA data Extraction Centroiding Use Xcalibur , instead of SM. iTRAQ ratios  are little changed, intensities lower by ~5x (will more closely match NIST central analysis pipeline) Precursor  MH+  range expanded from 750-4000 to 750-6000.   Searches Replace database with RefSeq version used as reference for the personalized database generation. database content/size very similar, protein identifiers change from gi numbers to RefSeq numbers. Allowed modifications will be expanded. Increases the # of identified spectra by ~10%. From Full iTRAQ, M-ox, N-deam, q-pyro To iTRAQ-Full-Lys-only, M-ox, N-deam, q-pyro, c-pyro, Ac-nTermProt Autovalidation Proteome initial processing, peptide FDR per experiment : 1.1 -1.4%, but overall peptide FDR across all 36 experiments: ~5.5% Phosphoproteome initial processing , peptide FDR per experiment : 1.6 -2.1% but overall peptide FDR across all 36 experiments: ~7.2%. Changes will seek to bring the overall peptide FDR’s down to ~1% require multiple observations (protein, P-site) across experiments raise score thresholds Quantitation Will use PIP(precursor ion purity) filtering to exclude from quantitation but not identification. PIP > 50% excludes ~7.8% of spectra. Filtering reduces standard deviations on protein & phosphosite level iTRAQ ratios

Transcript present in 18/40 Common Control Members Y Chromosome Frameshift - CD99 antigen Observed in 36 Proteomics Experiments E159 Partial exon deletion splice, plus frameshift truncation Max RNA-Seq Reads: 12 Transcript present in 18/40 Common Control Members

Acknowledgments Broad Institute/FHCRC Steve Carr Karl Clauser Michael Gillette Jana Qiao Philipp Mertins DR Mani Eric Kuhn Sue Abbatiello Amanda Paulovich Pei Wang Sean Wang Ping Yan Washington U./MD Anderson/NYU Sherri Davies Matthew Ellis David Fenyo Kelly Ruggles Reid Townsend Li Ding NCI Staff Emily Boja Mehdi Mesri Rob Rivers Chris Kinsinger Henry Rodriguez Funding National Cancer Institute

Single AA Variants may be Somatic in Some Patients, Germline in Others Nov 2013 Genomic Highly Interesting, should correlate with prognosis and/or subtype. May correlate with prognosis? Might as well be canonical isoforms? Detectable, but too rare to indicate biology. Proteomic G&S mix genomic variants have the highest observation rate by Proteomics. Genomic variants present in only a single patient are observable by Proteomics

Not all Germline &Somatic mix Single AA Variants are “Essentially” Germline 81 Patients Nov 2013 Genomic Proteomic Is G&S mix status primarily an artifact of variant calling accuracy/sensitivity? Is there some cancer biology involved for high S/G ratio variants? Are patients with germline form more cancer prone? Does somatic form correlate with prognosis, development of drug-resistance?

Wide Range of Somatic Single AA Variants/Patient Skip Low confidence thresholds applied to calls Variants: >2 QUAL score (phred-scaled) Alternative splices: >1 read