Considerations for multi-omics data integration Michael Tress CNIO,

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

RNA-Seq as a Discovery Tool
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Transcriptome Sequencing with Reference
Basics of Comparative Genomics Dr G. P. S. Raghava.
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Sequence comparisons June 23, 2009 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Lecture 12 Splicing and gene prediction in eukaryotes
UniProt - The Universal Protein Resource
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
RNAseq analyses -- methods
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Sackler Medical School
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
The Havana-Gencode annotation GENCODE CONSORTIUM.
Motif discovery and Protein Databases Tutorial 5.
Mark D. Adams Dept. of Genetics 9/10/04
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Bioinformatics and Computational Biology
Geuvadis achievements and contributions Robert Häsler, functional genomics.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Chapter 3 The Interrupted Gene.
Motif Search and RNA Structure Prediction Lesson 9.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Finding genes in the genome
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
bacteria and eukaryotes
Using RNA-seq data to improve gene annotation
Basics of Comparative Genomics
Sequence based searches:
ENCODE Pseudogenes and Transcription
Eukaryotic Gene Finding
From: TopHat: discovering splice junctions with RNA-Seq
Proteomics Informatics David Fenyő
Functional Impact of Transposable Element using Bioinformatic Analysis
Alternative Splicing May Not Be the Key to Proteome Complexity
Complementary identification and novel protein discovery
Schematic representation of proteogenomic annotation strategy.
Basics of Comparative Genomics
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Integrative omic approaches for the study of host–pathogen interactions Integrative omic approaches for the study of host–pathogen interactions (A) Proteomic.
Universal Alternative Splicing of Noncoding Exons
Proteomics Informatics David Fenyő
Figure 1. The overlap between Ensembl/GENCODE, RefSeq and UniProtKB genes. The number of genes classified as coding in ... Figure 1. The overlap between.
Presentation transcript:

Considerations for multi-omics data integration Michael Tress CNIO,

Predictions: Ensembl/GENCODE automatic pipelines, HBM and GENCODE RNA-seq data, individual large-scale studies. Coding potential is determined from similarity to known proteins, conservation, the presence of Pfam functional domains. Some transcripts that are annotated as coding or non-coding based on the balance of probabilities. Good proteomics evidence could help here. A few years ago the human reference genome was missing a number of coding genes, in part due to gaps in the reference build used for Ensembl and RefSeq. Now the coding genes are probably almost complete. GENCODE genome annotation

We collected peptides from a number of large scale proteomics resources 3 NISTMuñozPeptideAtlasNagarajEzkurdiaGeigerWilhelm Kim We wanted to make sure that we had reliably identified peptides

The older the ancestral gene, the higher the chance of detecting peptides. Genes that appeared since primates are practically not detected! 4 Gene family ages based on ENSEMBL Compara Ezkurdia, Juan et al, Hum Mol Gen, 2014

Genes with no protein features at all (structure, function, etc.) were not detected Y-axis % of genes in each bin detected in proteomics experiments

ParaloguesAncestor ACSL1, ACSL6Jawed vertebrates ACTN1, ACTN2, ACTN4 One AS in fruitfly, one in vertebrates. ATP2B1, ATP2B2, ATP2B3, ATP2B4Bilateria DNM1, DNM2Vertebrates GNAL, GNASJawed vertebrates ITGA3, ITGA6Vertebrates PDLIM3, LDB3Chordates TPM1, TPM2, TPM3, TPM4Vertebrates All 60 homologous exons were conserved in jawed vertebrates, e.g. fugu and zebrafish, which implies that they evolved at least 460 million years ago. As a comparison mouse and human conserve fewer than 20% of AS exons. Abascal et al, PLoS Comp Biol, 2015 We found evidence for just 282 splice events - many were of ancient origin

Most detected alternative isoforms would not break Pfam domains ISE = isoforms detected with peptide evidence – GENCODE20 is background of whole genome, AI genes are all isoforms annotated for the 246 genes with detected alternative isoforms.

What does that mean for proteogenomics analyses? Most (but not all!) detected novel coding genes/isoforms are likely to have little evolutionary history and few protein features. We find that standard proteomics experiments are less likely to detect peptides for these regions. If many novel regions are identified in the study quality control is needed because many will have been identified by less reliable peptides (semi-tryptic peptides, low scoring PSM, poor spectra). Multi-omics considerations

XXX ORFs – no protein features Results: More than 200 previously uncharacterized coding regions A recent paper that identified many peptides for these new ORFs. These candidates are short and have no protein features. Problem: Peptides were cleaved by trypsin in the experiment, yet more than 80% of the peptides are semi- tryptic or non-tryptic. Caveat: that is not to say that these novel regions do not code for proteins, just that they are not found in standard proteomics experiments.

Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods Novel peptides identified using proteogenomics should be held to a higher standard of evidence than known peptides (spectra!). it is important to use a a multi-stage data analysis strategy If you search with a combined database and few modifications you will find that many pseudogenes express peptides. Initial searches should be first be carried out against known coding genes (with a range of possible modifications) and possibly known SAV. Proteogenomics strategy

Spectrum matched (incorrectly) to peptide EITALAPSIMK from putative POTEPK gene. The match is nearly perfect. The same spectrum matched (probably correctly) to actin peptide EITALAPSTMK with a lysine dimethylation. This peptide is identified 63,000 times in PeptideAtlas. Pseudogene detection - PeptideAtlas

Dominant isoforms

We found evidence of AS in just over 1% of human genes, so 98% of protein coding genes have evidence for just a single isoform Can we predict this isoform?

LONGEST CCDS RNASEQAPPRIS Five methods for selecting a reference isoform 5-fold dominant transcripts from HBM data Gonzalez-Porta et al, Gen. Biol Principal isoforms based on structure, function and conservation (Rodriguez et al, NAR, 2012) Unique CCDS. CCDS variants are consensus between RefSeq, and Ensembl/GENCODE HCI Highest connected isoforms trained on RNA- seq data in Li et al, JPR, 2015 Standard reference isoform in all databases/large scale experiments

98.6% 97.8% 77.2% 77.7% Five means of selecting reference isoforms 78% We calculated % agreement between the main proteomics isoform we found and the five reference methods: the longest sequence, APPRIS principal isoforms, unique CCDS variants, the dominant RNAseq transcripts and the Highest Connected Isoforms

For those 3,000+ genes with a main experimental isoform, an APPRIS principal isoform and a unique CCDS variant, all three isoforms agreed over 99% of the genes. The clear agreement between three orthogonal sources (and the large number of tissues sampled) suggests that the main proteomics isoform is the dominant protein isoform in the cell. Indeed alternative isoforms (non-APPRIS principal isoforms) “are significantly enriched in amino acid-changing variants, particularly those that have a strong impact on protein function“ Liu et al, Molecular BioSystems, 2015 Ezkurdia et al, J. Proteome Res, 2015