Download presentation
Presentation is loading. Please wait.
Published byPamela Butler Modified over 8 years ago
1
Lydie Lane, HUPO meeting 2013, Yokohama Integration of proteomics data in
2
PeptideAtlas reprocessing Raw storage EBI PRIDE (MS/MS) ISB PASSEL (SRM) ProteomeXchange identifications study metadata mass spec output files neXtProt as an integration resource for HPP Ab/IHC data MS data
3
A ‘one stop shop for human proteins’ integrating as many resources as possible Each integrated data is accompanied by : - a quality flag – Gold: estimated error rate <1% – Silver: estimated error rate 1-5% – (Bronze: noisy or low quality data, not imported) - a metadata file Quality assessment and metadata documentation is a dynamic process involving data providers when possible neXtProt key principles
4
neXtProt contents (sept 2013 release) – 20’128 entries representing 39’325 protein sequences (isoforms created by splicing or alternative initiation), in synch with UniProtKB/Swiss-Prot – All of Swiss-Prot human annotations PLUS – Affymetrix and Illumina chip sets identifiers. – Chromosomal location and exons mapping from Ensembl – Subcellular localization data from different sources (soon from HPA) – 854’400 cSNPs from COSMIC, dbSNP and Ensembl – Tissular expression data at mRNA level (microarray/EST) from Bgee (meta-analysis of ArrayExpress/UniGene data) – Tissular expression data at protein level (IHC) from Human Protein Atlas (HPA) – Protein/PTM identifications from PeptideAtlas and other large scale proteomic studies
5
Proteomics data - All peptides from PeptideAtlas Human builds (Aug 2013) were integrated as GOLD - 20 additional studies were curated. For each of them, GOLD and SILVER thresholds for identification were defined based on FDR, Mascot ion scores and PTM localisation scores. Peptides matching the quality criteria are aligned to neXtProt entries and displayed. Peptides that match more than one entry are labeled «found in other entries». - A total of 420’330 identified peptides were mapped to 15’509 proteins. - 45’430 PTM sites (N-glycosylation, phosphorylation, S-nitrosylation, ubiquitination and sumoylation) were integrated, which corresponds to 30’083 new PTM annotations compared to UniProtKB/Swiss-Prot
6
cSNP display on the sequence view
7
Expression view displays data obtained at mRNA and protein levels
8
PTM display on the proteomics view
9
Peptide display on the proteomics view
10
neXtProt computes a “Protein Existence” status based on all the integrated information Entries whose protein(s) existence is based on evidence at protein level *: 15,646 (77.8 %) Entries whose protein(s) existence is based on evidence at transcript level : 3,570 (17.7 %) Entries whose protein(s) existence is based on homology : 187 (0.9 %) Entries whose protein(s) existence is based on a prediction (gene model) : 87 (0.4 %) Entries whose protein(s) existence is uncertain : 638 (3.2%) *clear experimental evidence for the existence of the protein. The criteria include partial or complete Edman sequencing, clear identification by mass spectrometry, X-ray or NMR structure, good quality protein-protein interaction or detection of the protein by antibodies.
11
In quest for the “missing proteins”...
12
Data export and programmatic access Download by ftp at: ftp.nextprot.orgftp.nextprot.org to obtain neXtProt in XML or PEFF* format; A first version of our REST API is available at http://www.nextprot.org/rest/ *= “Proteomics-enriched FASTA format”, allowing tools to easily and consistently access data essential to the success of HPP, namely sequence variations and PTMs. > nxp:Q9HCU4 \NcbiTaxId=9606 \Pname=Cadherin EGF LAG seven-pass G-type receptor 2 \Gname=CELSR2 \Processed=(1|31|SIGNAL)(32|2923|CHAIN) \ModRes=(1591|MOD:00035) (1810|MOD:00035) \Variant=(1066|1066|Q)(1639|1639|H)(1992|1992|R)(2387|2387|A) MRSPATGVPL PTPPPPLLLL LLLLLPPPLL GDQVGPCRSL GSRGRGSSGA CAPMGWLCPS SASNLWLYTS RCRDAGTELT GHLVPHHDGL RVWCPESEAH IPLPPAPEGC PWSCRLLGIG GHLSPQGKLT LPEEHPCLKA PRLRCQSCKL AQAPGLRAGE RSPEESLGGR RKRNVNTAPQ FQPPSYQATV PENQPAGTPV ASLRAIDPDE GEAGRLEYTM DALFDSRSNQ FFSLDPVTGA VTTAEELDRE TKSTHVFRVT AQDHGMPRRS ALATLTILVT DTNDHDPVFE QQEYKESLRE NLEVGYEVLT VRATDGDAPP NANILYRLLE GSGGSPSEVF EIDPRSGVIR TRGPVDREEV ESYQLTVEAS DQGRDPGPRS TTAAVFLSVE DDNDNAPQFS EKRYVVQVRE DVTPGAPVLR VTASDRDKGS NAVVHYSIMS GNARGQFYLD AQTGALDVVS PLDYETTKEY TLRVRAQDGG RPPLSNVSGL VTVQVLDIND NAPIFVSTPF QATVLESVPL GYLVLHVQAI DADAGDNARL EYRLAGVGHD FPFTINNGTG WISVAAELDR EEVDFYSFGV EARDHGTPAL TASASVSVTV LDVNDNNPTF TQPEYTVRLN EDAAVGTSVV TVSAVDRDAH SVITYQITSG NTRNRFSITS QSGGGLVSLA LPLDYKLERQ YVLAVTASDG TRQDTAQIVV NVTDANTHRP VFQSSHYTVN VNEDRPAGTT VVLISATDED TGENARITYF MEDSIPQFRI DADTGAVTTQ AELDYEDQVS YTLAITARDN GIPQKSDTTY LEILVNDVND NAPQFLRDSY QGSVYEDVPP FTSVLQISAT DRDSGLNGRV FYTFQGGDDG DGDFIVESTS GIVRTLRRLD RENVAQYVLR AYAVDKGMPP ARTPMEVTVT VLDVNDNPPV FEQDEFDVFV EENSPIGLAV ARVTATDPDE GTNAQIMYQI VEGNIPEVFQ LDIFSGELTA LVDLDYEDRP EYVLVIQATS APLVSRATVH VRLLDRNDNP PVLGNFEILF NNYVTNRSSS FPGGAIGRVP AHDPDISDSL TYSFERGNEL SLVLLNASTG ELKLSRALDN NRPLEAIMSV LVSDGVHSVT AQCALRVTII TDEMLTHSIT LRLEDMSPER FLSPLLGLFI QAVAATLATP PDHVVVFNVQ RDTDAPGGHI LNVSLSVGQP PGPGGGPPFL PSEDLQERLY LNRSLLTAIS AQRVLPFDDN ICLREPCENY MRCVSVLRFD SSAPFIASSS VLFRPIHPVG GLRCRCPPGF TGDYCETEVD LCYSRPCGPH GRCRSREGGY TCLCRDGYTG EHCEVSARSG RCTPGVCKNG GTCVNLLVGG FKCDCPSGDF EKPYCQVTTR SFPAHSFITF RGLRQRFHFT LALSFATKER DGLLLYNGRF NEKHDFVALE VIQEQVQLTF SAGESTTTVS PFVPGGVSDG QWHTVQLKYY NKPLLGQTGL PQGPSEQKVA VVTVDGCDTG VALRFGSVLG NYSCAAQGTQ GGSKKSLDLT GPLLLGGVPD LPESFPVRMR QFVGCMRNLQ VDSRHIDMAD FIANNGTVPG CPAKKNVCDS NTCHNGGTCV NQWDAFSCEC PLGFGGKSCA QEMANPQHFL GSSLVAWHGL SLPISQPWYL SLMFRTRQAD GVLLQAITRG RSTITLQLRE GHVMLSVEGT GLQASSLRLE PGRANDGDWH HAQLALGASG GPGHAILSFD YGQQRAEGNL GPRLHGLHLS NITVGGIPGP AGGVARGFRG CLQGVRVSDT PEGVNSLDPS HGESINVEQG CSLPDPCDSN PCPANSYCSN DWDSYSCSCD PGYYGDNCTN VCDLNPCEHQ SVCTRKPSAP HGYTCECPPN YLGPYCETRI DQPCPRGWWG HPTCGPCNCD VSKGFDPDCN KTSGECHCKE NHYRPPGSPT CLLCDCYPTG SLSRVCDPED GQCPCKPGVI GRQCDRCDNP FAEVTTNGCE VNYDSCPRAI EAGIWWPRTR FGLPAAAPCP KGSFGTAVRH CDEHRGWLPP NLFNCTSITF SELKGFAERL QRNESGLDSG RSQQLALLLR NATQHTAGYF GSDVKVAYQL ATRLLAHEST QRGFGLSATQ DVHFTENLLR VGSALLDTAN KRHWELIQQT EGGTAWLLQH YEAYASALAQ NMRHTYLSPF TIVTPNIVIS VVRLDKGNFA GAKLPRYEAL RGEQPPDLET TVILPESVFR ETPPVVRPAG PGEAQEPEEL ARRQRRHPEL SQGEAVASVI IYRTLAGLLP HNYDPDKRSL RVPKRPIINT PVVSISVHDD EELLPRALDK PVTVQFRLLE TEERTKPICV FWNHSILVSG TGGWSARGCE VVFRNESHVS CQCNHMTSFA VLMDVSRREN GEILPLKTLT YVALGVTLAA LLLTFFFLTL LRILRSNQHG IRRNLTAALG LAQLVFLLGI NQADLPFACT VIAILLHFLY LCTFSWALLE ALHLYRALTE VRDVNTGPMR FYYMLGWGVP AFITGLAVGL DPEGYGNPDF CWLSIYDTLI WSFAGPVAFA VSMSVFLYIL AARASCAAQR QGFEKKGPVS GLQPSFAVLL LLSATWLLAL LSVNSDTLLF HYLFATCNCI QGPFIFLSYV VLSKEVRKAL KLACSRKPSP DPALTTKSTL TSSYNCPSPY ADGRLYQPYG DSAGSLHSTS RSGKSQPSYI PFLLREESAL NPGQGPPGLG DPGSLFLEGQ DQQHDPDTDS DSDLSLEDDQ SGSYASTHSS DSEEEEEEEE EEAAFPGEQG WDSLLGPGAE RLPLHSTPKD GGPGPGKAPW PGDFGTTAKE SSGNGAPEER LRENGDALSR EGSLGPLPGS SAQPHKGILK KKCLPTISEK SSLLRLPLEQ CTGSSRGSSA SEGSRGGPPP RPPPRQSLQE QLNGVMPIAM SIKAGTVDED SSGSEFLFFN FLH
13
Relevant files for the HPP community Full Chromosome reports ftp://ftp.nextprot.org/pub/current_release/chr_reports/ Simplified chromosome reports ftp://ftp.nextprot.org/pub/current_release/custom/hpp/ Mappings to ENSG, ENST and ENSP ftp://ftp.nextprot.org/pub/current_release/mapping/ Control vocabularies, including tissue and cell lines ontologies ftp://ftp.nextprot.org/pub/current_release/controlled_vocabularies/
14
neXtProt team@SIB and UniGe neXtProt content: – Coordinator: Pascale Gaudet – Biocurators: Guislaine Argoud-Puy, Aurore Britan, Jonas Cicenas, Isabelle Cusin, Paula Duek, Nevila Nouspikel (+ Ying Zhang from T. Yamamoto’s group) – QA: Monique Zahn neXtProt software developers: – Coordinator: Pierre-André Michel – Olivier Evalet, Alain Gateau, Anne Gleizes, Mario Pereira, Daniel Teixeira Directed by: – Amos Bairoch and Lydie Lane
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.