Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics
What is proteomics? “Proteomics includes not only the identification and quantification of proteins, but also the determination of their localization, modifications, interactions, activities, and, ultimately, their function.” -Stan Fields in Science, 2001.
Genomics vs. Proteomics Similarities: Large datasets, tools needed for annotation and interpretation of results Differences: Genomics – generally mature technologies, data processing methods, questions asked usually involve quantitative changes in RNA transcripts (microarrays) Proteomics – still evolving, complexity of protein biochemical properties: expression changes, modifications, interactions, activities – many questions to ask and data to interpret, methods changing, different approaches (mass spec, arrays etc.),
Genomics, Proteomics, and Systems Biology mature prototype emerging genomic DNA mRNA sequencing arrays genomics protein cataloguing protein products functional protein quantitative profiling protein phosphorylation Protein dynamics Protein Modifications sub cellular location catalytic activity descriptive protein interaction maps 3D structure proteomics measure and define properties system identify system components interactions between components computational biology
µLC separation ( um) Tandem mass spectrum (thousands in a matter of hours) “Shotgun” identification of proteins in mixtures by LC-MS/MS Liquid chromatography coupled to tandem mass spectrometry (MS/MS) Ionization: MALDI or Electrospray IsolationFragmentation Mass Analysis peptide fragments peptides m/z
Peptide sequence determination from MS/MS spectra H 2 N -N--S--G--D--I--V--N--L--G--S--I--A--G--R- COOH b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 b8b8 b9b9 b 10 b 11 b 12 b 13 b 14 b1b1 y 13 y 12 y 11 y 10 y9y9 y8y8 y7y7 y6y6 y5y5 y4y4 y3y3 y2y2 y1y1 y 14 Collision-induced dissociation (CID) creates two prominent ion series: y-series: b-series:
H 2 N -NSGDIVNLGSIAGR- COOH m/z Relative Abundance LGSIAGR GSIAGR SIAGR IAGR AGR GR R NLGSIAGR VNLGSIAGR IVNLGSIAGR DIVNLGSIAGR GDIVNLGSIAGR Peptide sequence identifies the protein YMR134W, yeast protein involved in iron metabolism
High-throughput protein identification by LC-MS/MS and automated sequence database searching Protein sequence and/or DNA sequence database search Raw MS/MS spectrum Peptide sequence match Direct identification of proteins from complex mixtures Protein identification
Dealing with the data 1. Data acquisition 2. Peak analysis 3. Knowledge annotation and interpretation Experimental information, metadata capture Sequence database searching Quantitative analysis Database mining Assignment of function, pathway, localization etc. Output for database archiving, publication Integrated workflow?
1. Data acquisition: capturing experimental information Proteomics Experimental Data Repository (PEDRo) Proposed schema Similar to genomic needs, but experimental info a bit different
2. Peak Analysis ProFound Mascot PepSea MS-Fit MOWSE Peptident Multident Sequest PepFrag MS-Tag Protein identification Computational algorithms for searching MS/MS spectra against protein sequence databases, mRNA sequences, DNA sequences need cpu horsepower (parallel computing)
2. Peak Analysis: data formats Format 1Format 3Format 2 Output 1 Output 2 Output 3 Lack of flexibility Slow to evolve Lack of incorporation of competing products, methods ??
2. Peak Analysis: need general, flexible, in-house solutions Format 1Format 3Format 2 General tools for analysis of multiple data formats reverse engineering of data formats
2. Peak Analysis; reverse engineering data formats
2. Peak analysis: quality control of protein matches Unfiltered – matches (lots of noise and junk) Filtered – thousands of “true” matches filtering Statistical analysis of database results (tools are available)
2. Peak Analysis: Quantitative analysis Flexibility is key – need tools to handle different quantitative methods External chemical labeling Metabolic labeling (SILAC) Enzymatic incorporation (O 16 /O 18 )
2. Peak Analysis: Quantitative analysis Sample 1 Sample 2 Relative intensity = relative protein abundance
Evolving methodologies: iTRAQ iTRAQ label: Multidimensional separation m/z Intensity Digest to peptides Diagnostic ions used for quantitative analysis Peptide fragments used for sequence identification MS/MS spectrum Sample: way multiplexing: simultaneous comparison of multiple states, replicates
Need for “changeable” tools Intensity “old” “new” Automated analysis tools?
3. Knowledge annotation: making sense of lists of data
3. Knowledge annotation: mining proteomic/genomic databases
3. Knowledge annotation: needs Annotation: accession numbers and protein names Functional assignments (functional degeneracy?) Pathway assignments Subcellular localization Disease implications Comparison of different proteomic datasets (i.e. expression profiles compared to modification state profiles, other protein properties) Automated and streamlined?? Publication and deposit in databases Visualization of complex phenomena, interpretation of biological relevance Modeling, integration with genomics data – computational and systems biology