Considerations for multi-omics data integration Michael Tress CNIO,
Predictions: Ensembl/GENCODE automatic pipelines, HBM and GENCODE RNA-seq data, individual large-scale studies. Coding potential is determined from similarity to known proteins, conservation, the presence of Pfam functional domains. Some transcripts that are annotated as coding or non-coding based on the balance of probabilities. Good proteomics evidence could help here. A few years ago the human reference genome was missing a number of coding genes, in part due to gaps in the reference build used for Ensembl and RefSeq. Now the coding genes are probably almost complete. GENCODE genome annotation
We collected peptides from a number of large scale proteomics resources 3 NISTMuñozPeptideAtlasNagarajEzkurdiaGeigerWilhelm Kim We wanted to make sure that we had reliably identified peptides
The older the ancestral gene, the higher the chance of detecting peptides. Genes that appeared since primates are practically not detected! 4 Gene family ages based on ENSEMBL Compara Ezkurdia, Juan et al, Hum Mol Gen, 2014
Genes with no protein features at all (structure, function, etc.) were not detected Y-axis % of genes in each bin detected in proteomics experiments
ParaloguesAncestor ACSL1, ACSL6Jawed vertebrates ACTN1, ACTN2, ACTN4 One AS in fruitfly, one in vertebrates. ATP2B1, ATP2B2, ATP2B3, ATP2B4Bilateria DNM1, DNM2Vertebrates GNAL, GNASJawed vertebrates ITGA3, ITGA6Vertebrates PDLIM3, LDB3Chordates TPM1, TPM2, TPM3, TPM4Vertebrates All 60 homologous exons were conserved in jawed vertebrates, e.g. fugu and zebrafish, which implies that they evolved at least 460 million years ago. As a comparison mouse and human conserve fewer than 20% of AS exons. Abascal et al, PLoS Comp Biol, 2015 We found evidence for just 282 splice events - many were of ancient origin
Most detected alternative isoforms would not break Pfam domains ISE = isoforms detected with peptide evidence – GENCODE20 is background of whole genome, AI genes are all isoforms annotated for the 246 genes with detected alternative isoforms.
What does that mean for proteogenomics analyses? Most (but not all!) detected novel coding genes/isoforms are likely to have little evolutionary history and few protein features. We find that standard proteomics experiments are less likely to detect peptides for these regions. If many novel regions are identified in the study quality control is needed because many will have been identified by less reliable peptides (semi-tryptic peptides, low scoring PSM, poor spectra). Multi-omics considerations
XXX ORFs – no protein features Results: More than 200 previously uncharacterized coding regions A recent paper that identified many peptides for these new ORFs. These candidates are short and have no protein features. Problem: Peptides were cleaved by trypsin in the experiment, yet more than 80% of the peptides are semi- tryptic or non-tryptic. Caveat: that is not to say that these novel regions do not code for proteins, just that they are not found in standard proteomics experiments.
Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods Novel peptides identified using proteogenomics should be held to a higher standard of evidence than known peptides (spectra!). it is important to use a a multi-stage data analysis strategy If you search with a combined database and few modifications you will find that many pseudogenes express peptides. Initial searches should be first be carried out against known coding genes (with a range of possible modifications) and possibly known SAV. Proteogenomics strategy
Spectrum matched (incorrectly) to peptide EITALAPSIMK from putative POTEPK gene. The match is nearly perfect. The same spectrum matched (probably correctly) to actin peptide EITALAPSTMK with a lysine dimethylation. This peptide is identified 63,000 times in PeptideAtlas. Pseudogene detection - PeptideAtlas
Dominant isoforms
We found evidence of AS in just over 1% of human genes, so 98% of protein coding genes have evidence for just a single isoform Can we predict this isoform?
LONGEST CCDS RNASEQAPPRIS Five methods for selecting a reference isoform 5-fold dominant transcripts from HBM data Gonzalez-Porta et al, Gen. Biol Principal isoforms based on structure, function and conservation (Rodriguez et al, NAR, 2012) Unique CCDS. CCDS variants are consensus between RefSeq, and Ensembl/GENCODE HCI Highest connected isoforms trained on RNA- seq data in Li et al, JPR, 2015 Standard reference isoform in all databases/large scale experiments
98.6% 97.8% 77.2% 77.7% Five means of selecting reference isoforms 78% We calculated % agreement between the main proteomics isoform we found and the five reference methods: the longest sequence, APPRIS principal isoforms, unique CCDS variants, the dominant RNAseq transcripts and the Highest Connected Isoforms
For those 3,000+ genes with a main experimental isoform, an APPRIS principal isoform and a unique CCDS variant, all three isoforms agreed over 99% of the genes. The clear agreement between three orthogonal sources (and the large number of tissues sampled) suggests that the main proteomics isoform is the dominant protein isoform in the cell. Indeed alternative isoforms (non-APPRIS principal isoforms) “are significantly enriched in amino acid-changing variants, particularly those that have a strong impact on protein function“ Liu et al, Molecular BioSystems, 2015 Ezkurdia et al, J. Proteome Res, 2015