Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Kelly Ruggles, Ph.D. Proteomics Informatics Week 9
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
CSE182-L12 Gene Finding.
Bacterial Physiology (Micr430)
Annotating genomes using proteomics data Andy Jones Department of Preclinical Veterinary Science.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Experimental validation. Integration of transcriptome and genome sequencing uncovers functional variation in human populations Tuuli Lappalainen et al.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Karl Clauser Proteomics and Biomarker Discovery Breast Cancer Proteomics and the use of TCGA Mutational Data - Broad Institute update/issues Karl Clauser.
Next Generation DNA Sequencing
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Laxman Yetukuri T : Modeling of Proteomics Data
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Genome Annotation Rosana O. Babu.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genes and How They Work Chapter The Nature of Genes information flows in one direction: DNA (gene)RNAprotein TranscriptionTranslation.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
SMARTAR: small RNA transcriptome analyzer Geuvadis RNA analysis meeting April 16 th 2012 Esther Lizano and Marc Friedländer Xavier Estivill lab Programme.
Bioinformatics and Computational Biology
Proteogenomic Novelty in 105 TCGA Breast Tumors
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Accessing and visualizing genomics data
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Welcome to the combined BLAST and Genome Browser Tutorial.
INTERPRETING GENETIC MUTATIONAL DATA FOR CLINICAL ONCOLOGY Ben Ho Park, M.D., Ph.D. Associate Professor of Oncology Johns Hopkins University May 2014.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Considerations for multi-omics data integration Michael Tress CNIO,
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Virginia G. Piper Center for Personalized Diagnostics Chromosome 10 Joshua LaBaer, M.D., Ph.D. Konstantinos Petritis, Ph.D. Virginia G. Piper Center for.
The regulation of Caspase 8 chIP-seq motifs mRNA expression DNA methylation.
Gene Hunting: Design and statistics
Sequencing Data Analysis
Genome organization and Bioinformatics
Proteomics Informatics David Fenyő
Diverse abnormalities manifest in RNA
Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes
Genome Annotation and the Human Genome
Integrative omic approaches for the study of host–pathogen interactions Integrative omic approaches for the study of host–pathogen interactions (A) Proteomic.
Proteomics Informatics David Fenyő
Sequencing Data Analysis
Presentation transcript:

Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015 Proteogenomics Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015

Proteogenomics: Intersection of proteomics and genomics As the cost of high-throughput genome sequencing goes down whole genome, exome and RNA sequencing can be easily attained for most proteomics experiments In combination with mass spectrometry-based proteomics, sequencing can be used for: Genome annotation Studying the effect of genomic variation in proteome Biomarker identification

Proteogenomics: Intersection of proteomics and genomics First published on in 2004 “Proteogenomic mapping as a complementary method to perform genome annotation” (Jaffe JD, Berg HC and Church GM) using genomic sequencing to better annotate Mycoplasma pneumoniae Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011

Proteogenomic workflow High throughput shotgun MS/MS Requires no knowledge of peptides present, uses mass difference to determine next AA in peptide chain.

Requirements for Proteogenomic Analysis DNA or RNA sequencing data High resolution MS/MS Informatics tools for proteogenomic database construction and protein searching DNA and/or RNA Sequencing Sample Informatics Tools Informatics Tools MS/MS Compare, score, test significance Personalized Protein DB Identified peptides and proteins

Proteogenomics In the past, computational algorithms were commonly used to predict and annotate genes. Many limitations With mass spectrometry we can Confirm existing gene models Correct gene models Identify novel genes and splice isoforms

Proteogenomics Improving genome annotation Sequencing driven database construction Proteomic mapping to genomic coordinates

Proteogenomics Improving genome annotation Sequencing driven database construction Proteomic mapping to genomic coordinates

Genome Annotation Process of identifying and assigning function to genes Historically, identification of protein coding regions was completed using Comparative sequence similarity analysis ab initio gene prediction algorithms RNA transcript analysis Limitations associated with these methods in determining Gene start and stop sites Translation reading frames Short genes, overlapping genes Alternative splice boundaries Translated vs. transcribed genes Therefore, MS-based proteomics can be used to supplement sequence analysis for genome annotation

Protein Sequence Databases Identification of peptides from MS relies heavily on the quality of the protein sequence database (DB) DBs with missing peptide sequences will fail to identify the corresponding peptides DBs that are too large will have low sensitivity Ideal DB is complete and small, containing all proteins in the sample and no irrelevant sequences

Genome Sequence-based database for genome annotation Commonly used method is to search MS against 6 frame translation, removing bias based on established annotation m/z intensity MS/MS 6 frame translation of genome sequence Reference protein DB Compare, score, test significance Compare, score, test significance annotated + novel peptides annotated peptides

Creating 6-frame translation database ATGAAAAGCCTCAGCCTACAGAAACTCTTTTAATATGCATCAGTCAGAATTTAAAAAAAAAATC Positive Strand M K S L S L Q K L F * Y A S V R I * K K N * K A S A Y R N S F N M H Q S E F K K K I E K P Q P T E T L L I C I S Q N L K K K S Negative Strand H F A E A * L F E K L I C * D S N L F F I S F G * G V S V R K I H M L * F K F F F D F L R L R C F S K * Y A D T L I * F F F G Software: Peppy: creates the database + searches MS, Risk BA, et. al (2013) PIUS (Peptide Identification by Unbiased Searching): Costa et al, 2013 MS-Dictionary: Kim et al, 2009

Genome Annotation Example 1: A. gambiae Peptides mapping to annotated 3’ UTR Peptides mapping to novel exon within an existing gene Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011

Genome Annotation Example 1: A. gambiae Peptides mapping to unannotated gene related species Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011

Genome Annotation Example 2: Correcting Miss-annotations A. Establishes new transcriptional start location B. Confirm ORF C. Establishes intron-exon boundaries D. Determines new reading frame for exons E. Predicts novel coding region F. Finds the end of a gene G. Uses a related species to build on genomic annotation

RNA Sequence-based database for alternatively splicing identification m/z intensity MS/MS RNA-Seq junction DB Compare, score, test significance Identification of novel splice isoforms

Annotation of organisms which lack genome sequencing m/z intensity MS/MS Reference DB of related species De novo MS/MS sequencing Compare, score, test significance Identification of potential protein coding regions

Proteogenomics: Genome Annotation Summary Confirms existing gene models Corrects existing gene models Intron-exon boundaries Reading frames Novel splice isoforms Novel exons Identifies novel genes Fusion protein identification Identify genomic polymorphisms

Proteogenomics Improving genome annotation Sequencing driven database construction Proteomic mapping to genomic coordinates

Proteogenomic workflow Before the advent of proteogenomics, variant protein analysis was laborious, often requiring de novo sequencing**, which is very time-consuming, and therefore only a very limited number of peptides can be sequenced. ** DNA/RNA sequencing

Single nucleotide variant database for variant protein identification m/z intensity MS/MS Reference protein DB Variant DB + Compare, score, test significance Variants predicted from genome sequencing TCGAGAGCTG TCGATAGCTG Identification of variant proteins Exon 1

Creating variant sequence DB VCF File Format # Meta-information lines Columns: Chromosome Position ID (ex: dbSNP) Reference base Alternative allele Quality score Filter (PASS=passed filters) Info (ex: SOMATIC, VALIDATED..)

Creating variant sequence DB EXON 1 EXON2 … … …GTATTGCAAAAATAAGATAGAATAAGAATAATTACGACAAGATTC… Add in variants within exon boundaries …CTATTGCAAAAATACGATAGCATAAGAATAGTTACGACAAGATTC… In silico translation …LLQKYDSIRIVTTRF… Variant DB

Splice junction database for novel exon, alternative splicing identification m/z intensity MS/MS RNA-Seq junction DB Reference protein DB + Compare, score, test significance Intron/Exon boundaries from RNA sequencing Alt. Splicing Novel Expression Identification of novel splice proteins Exon 1 Exon 2 Exon 3 Exon 1 Exon X Exon 2

Creating splice junction DB BED File Format Columns: Chromosome Chromosome Start Chromosome End Name Score Strand (+or-) 7-9. Display info 10. # blocks (exons) 11. Size of blocks 12. Start of blocks

Creating splice junction DB Bed file with new gene mapping Map to known intron/exon boundaries Junction bed file Unannotated alternative splicing One novel intron/exon boundary Two novel intron/exon boundaries

Fusion protein identification m/z intensity MS/MS Fusion Gene DB Reference protein DB + Compare, score, test significance Gene X Exon 1 Gene X Exon 2 Gene Y Exon 1 Gene Y Exon 2 Identification of variant proteins Chr 1 Chr 2 Gene X Exon 1 Gene Y Exon 2

Fusion Genes Find consensus sequence 6 frame translation FASTA .…AGAACTGGAAGAATTGG*AATGGTAGATAACGCAGATCATCT..… 6 frame translation FASTA Fusion Location

Informatics tools for customized DB creation QUILTS: perl/python based tool to generate DB from genomic and RNA sequencing data (Fenyo lab) customProDB: R package to generate DB from RNA-Seq data (Zhang B, et al.) Splice-graph database creation (Bafna V. et al.)

Proteogenomics and Human Disease: Genomic Heterogeneity Whole genome sequencing has uncovered millions of germline variants between individuals Genomic, proteome studies typically use a reference database to model the general population, masking patient specific variation Nature October 28, 2010

Proteogenomics and Human Disease: Cancer Proteomics Cancer is characterized by altered expression of tumor drivers and suppressors Results from gene mutations causing changes in protein expression, activity Can influence diagnosis, prognosis and treatment Cancer proteomics Are genomic variants evident at the protein level? What is their effect on protein function? Can we classify tumors based on protein markers?

Tumor Specific Proteomic Variation Nature April 15, 2010 Stephens, et al. Complex landscape of somatic rearrangement in human breast cancer genomes. Nature 2009

Personalized Database for Protein Identification Germline Variants Somatic Variants MQYAPNTQVEIIPQGR SSAEVIAQSR ASSSIIINESEPTTNIQIR QRAQEAIIQISQAISIMETVK SSPVEFECINDK SPAPGMAIGSGR… SVATGSSEAAGGASGGGAR GQVAGTMKIEIAQYR DSGSYGQSGGEQQR EETSDFAEPTTCITNNQHS EPRDPR FIKGWFCFIISAR…. m/z intensity MS/MS Protein DB Compare, score, test significance Identified peptides and proteins

Personalized Database for Protein Identification RNA-Seq Genome Sequencing m/z intensity MS/MS Tumor Specific Protein DB Compare, score, test significance Identified peptides and proteins + tumor specific + patient specific peptides

Tumor Specific Protein Databases Non-Tumor Sample Genome sequencing Identify germline variants Genome sequencing RNA-Seq Tumor Sample Identify alternative splicing, somatic variants and novel expression TCGAGAGCTG TCGATAGCTG Exon 1 Exon 2 Exon 3 Variants Alt. Splicing Novel Expression Exon X Fusion Genes Gene X Gene Y Tumor Specific Protein DB Reference Human Database (Ensembl)

Proteogenomics and Biomarker Discovery Tumor-specific peptides identified by MS can be used as sensitive drug targets or diagnostic tools Fusion proteins Protein isoforms Variants Effects of genomic rearrangements on protein expression can elucidate cancer biology

Proteogenomics Improving genome annotation Sequencing driven database construction Proteomic mapping to genomic coordinates

Proteogenomic mapping Map back observed peptides to their genomic location. Requires tools to convert proteomic location to genomic coordinates Use to determine: Exon location of peptides Proteotypic Novel coding region Visualize in genome browsers (UCSF genome browser, Integrative Genomics Viewer (IGV)) Quantitative comparison based on genomic location

Informatics tools for proteogenomic mapping PGx: python-based tool, maps peptides back to genomic coordinates using user defined reference database (Fenyo lab) The Proteogenomic Mapping Tool: Java-based search of peptides against 6-reading frame sequence database (Sanders WS, et al).

PGX: Proteogenomic mapping tool Sample specific protein database Peptides Log Fold Change in Expression (10,000 bp bins) Copy Number Variation Methylation Status Manor Askenazi David Fenyo Exon Expression (RNA-Seq) Number of Genes/Bin Peptides Peptides mapped onto genomic coordinates

Variant Peptide Mapping Peptides with single amino acid changes corresponding to germline and somatic variants Exon Skipping Unannotated Exons SVATGSSETAGGASGGGAR ACG->GCG ENSEMBL Gene SVATGSSEAAGGASGGGAR Tumor Peptide Reference Peptide

Novel Peptide Mapping Peptides corresponding to RNA-Seq expression in non-coding regions ENSEMBL Gene Tumor Peptide Tumor RNA-Seq

Proteogenomic integration Variants Proteomic Quantitation RNA-Seq Data Predicted gene expression Proteomic Mapping Maps genomic, transcriptomic and proteomic data to same coordinate system including quantitative information

Summary The integration of proteomics and genomics can improve our understanding of not only genomic annotation, but also of the functional protein products integral in biological processes. Proteogenomics is currently being used extensively in cancer discovery Genetic rearrangement differs between tumors Requires personalized database Can provide cancer specific proteins for biomarker development Proteogenomics will likely continue to grow, particularly in the identification of genomic abnormalities in disease

Questions?