Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nathan Edwards Center for Bioinformatics and Computational Biology

Similar presentations


Presentation on theme: "Nathan Edwards Center for Bioinformatics and Computational Biology"— Presentation transcript:

1 Proteomic Characterization of Alternative Splicing and Coding Polymorphism
Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 Synopsis MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to minor sequence variation Observed peptides represent folded proteins

3 Synopsis MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Applications: Cancer biomarkers Genome annotation

4 Mass Spectrometry for Proteomics
Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required

5 Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI
+ _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

6 High Bandwidth 100 250 500 750 1000 m/z % Intensity

7 Mass is fundamental!

8 Mass Spectrometry for Proteomics
Measure mass of many molecules simultaneously ...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules ...but need a reference to compare to

9 Mass Spectrometry for Proteomics
Mass spectrometry has been around since the turn of the century... ...why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison

10 Sample Preparation for Peptide Identification
Enzymatic Digest and Fractionation

11 Single Stage MS MS m/z

12 Tandem Mass Spectrometry (MS/MS)
m/z Precursor selection m/z

13 Tandem Mass Spectrometry (MS/MS)
Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z

14 Peptide Identification
For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr, ... Automated, high-throughput peptide identification in complex mixtures

15 Why don’t we see more novel peptides?
Tandem mass spectrometry doesn’t discriminate against novel peptides but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

16 What goes missing? Known coding SNPs Novel coding mutations
Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

17 Why should we care? Alternative splicing is the norm!
Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site

18 Novel Splice Isoform Human Jurkat leukemia cell-line LIME1 gene:
Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

19 Novel Splice Isoform

20 Novel Splice Isoform

21 Novel Mutation HUPO Plasma Proteome Project TTR gene
Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance

22 Ala2→Pro associated with familial amyloid polyneuropathy
Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

23 Novel Mutation

24 Translation Start-Site
Human erythroleukemia K562 cell-line Depth of coverage study Resing et al. Anal. Chem THOC2 gene: Part of the heteromultimeric THO/TREX complex. Initially believed to be a “novel” ORF RefSeq mRNA in Jun 2007, no RefSeq protein TrEMBL entry Feb 2005, no SwissProt entry Genbank mRNA in May 2002 (complete CDS) Plenty of EST support ~ 100,000 bases upstream of other isoforms

25 Translation Start-Site

26 Translation Start-Site

27 Translation Start-Site

28 Translation Start-Site

29 Expressed Sequence Tags (ESTs)
Cheap, fast, coding Single sequencing reads of mRNA Sequence from 5’ or 3’ end No “assembly”

30 Searching ESTs Proposed long ago: Now:
Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.

31 Searching Expressed Sequence Tags (ESTs)
Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%

32 Compressed EST Peptide Sequence Database
For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

33 Compressed EST Peptide Sequence Database
For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

34 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

35 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

36 Sequence Databases & CSBH-graphs
Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

37 Sequence Databases & CSBH-graphs
All k-mers represented by an edge have the same count 1 2 2 1 2

38 cSBH-graphs Quickly determine those that occur twice 2 2 1 2

39 Correct, Complete (C2) Enumeration
Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

40 Compressed EST Database
Gene centric compressed EST peptide sequence database 20,774 sequence entries ~8Gb vs 223 Mb ~35 fold compression 22 hours becomes 15 minutes E-values improve by similar factor! Makes routine EST searching feasible Search ESTs instead of IPI?

41 Significant False Positives
E-values are not enough! Random guessers are easy to beat. Post-translational modifications vs. amino-acid substitution methylation (on I/L, Q, R, C, H, K, S, T, N): +14 D → E, G → A, V → I/L, N → Q, S → T: +14 Peptide extension z=+2 → z=+3 Nonsense AA masses sum to precursor Need to ensure: fragment ions define novel sequence sequence evidence is strong other plausible explanations can be eliminated

42 Significant False Positives
DFLAGGLAAAISK 2.2x10-8 2 ESTs DFLAGGIAAAISK x10-8 IPI (2), RefSeq, mRNA, ~ 1400 ESTs DFLAGGVAAAISK 3.7x10-8 IPI, RefSeq, mRNA, ~700 ESTs DFLAGGVAAAISKMAVVPI 3.5x10-5 Genscan exon AISFAKDFLAGGIAAAISK 3.3x10-4

43 Significant False Positives

44 Back to the lab... Current LC/MS/MS workflows identify a few peptides per protein ...not sufficient for protein isoforms Need to raise the sequence coverage to (say) 80% ...protein separation prior to LC/MS/MS analysis Potential for database of splice sites of (functional) proteins!

45 Spectral Matching for Peptide Identification
Detection vs. identification Increased sensitivity & specificity No novel peptides NIST GC/MS Spectral Library Identifies small molecules, 100,000’s of (consensus) spectra Bundled/Sold with many instruments “Dot-product” spectral comparison Current project: Peptide MS/MS

46 NIST MS Search: Peptides

47 Peptide DLATVYVDVLK

48 Protein Families

49 Protein Families

50 Peptide DLATVYVDVLK

51 Hidden Markov Models for Spectral Matching
Capture statistical variation and consensus in peak intensity Capture semantics of peaks Extrapolate model to other peptides Good specificity with superior sensitivity for peptide detection Assign 1000’s of additional spectra (p-value < 10-5)

52 (m/z,int) pair emitted by ion & insert states
Hidden Markov Model Delete Insert Ion (m/z,int) pair emitted by ion & insert states

53 Spectral Matching of Peptide Variants
DFLAGGIAAAISK DFLAGGVAAAISK

54 Spectral Matching of Peptide Variants
AVMDDFAAFVEK AVM*DDFAAFVEK

55 HMM model extrapolation

56 Conclusions Proteomics can inform genome annotation
Eukaryotic and prokaryotic Functional vs silencing variants Peptides identify more than just proteins Untapped source of disease biomarkers Compressed peptide sequence databases make routine EST searching feasible Novel spectral matching technique using HMMs looks very promising

57 Acknowledgements Catherine Fenselau, Steve Swatkoski
UMCP Biochemistry Chau-Wen Tseng, Xue Wu UMCP Computer Science Cheng Lee Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: NIH/NCI, USDA/ARS


Download ppt "Nathan Edwards Center for Bioinformatics and Computational Biology"

Similar presentations


Ads by Google