Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Outline Systems Biology Functional enrichment analysis Careful counting Generalized protein parsimony (Multivariate-)hypergeometric models: Differential proteins by spectral counts Functional enrichment of differential proteins Differential functional categories by spectral counts Indirect detection of alternative splicing 2

Systems Biology 3 Mathematical Models Knowledge Databases High-Throughput Experiments

Systems Biology 4 Mathematical Models Knowledge Databases High-Throughput Experiments Software Statistics Algorithms phenotype ↕ biology UniProt OMIM Kegg molecular biology ↕ biology Sequencing Microarrays Proteomics Metabolomics molecular biology ↕ phenotype

Gene Expression Analysis Hengel et al. J Immunol. 2003. Structured experiment: CD4+/L-selectin- T-cells, vs CD4+/L-selectin+ T-cells Affymetrix Human Genome U95A Array Processing & Statistics MAS 4.0, t-Tests, FDR filtering, … 164 probe identifiers for upregulated genes. 5

Gene Set Enrichment Candidate genes are “special” with respect to the experiment structure (phenotype) Are they special with respect to general biological knowledge? Are the candidate genes related? Can we filter out the noise? Can we expose associated genes? What genes' changes are linked to the experimental structure / phenotype? 6

Gene Sets Genes may be related in many ways: Same pathway, similar function, cellular location Cytoband, identified in previous study, etc. Define gene sets for relatedness GO Biological Process GO Molecular Function GO Cellular Component KEGG Pathway, Biocarta Pathway Biological knowledge databases 7

Functional Enrichment Analysis 8

Drawing Balls from Urns 9 For 100 balls drawn at random, how surprising is 5, 10, 15, 20, … black?

Drawing Balls from Urns 10 6 of 155 upregulated genes have "oxygen binding" GO annotation! All human genes ( = 25), black is oxygen binding.

How surprised should we be? Classic problem in probability theory How well do the observed counts match the expected counts? Various (mostly equivalent) statistical tests applied: Fisher exact test Hypergeometric Chi-Squared (χ 2 ) p-value measures "surprise". 11

Why not in proteomics? Shared peptides lead to double counting Human genome is not an appropriate background Relative abundance measurements are noisy and incomplete 12

Why not in proteomics? Shared peptides lead to double counting Improve protein inference Human genome is not an appropriate background Must account for observability bias Relative abundance measurements are noisy and incomplete Use spectral counts to detect differential abundance 13

Traditional Protein Parsimony Select the smallest set of proteins that explain all identified peptides. Sensible principle, implies Eliminate equivalent/subset proteins Equivalent proteins are problematic: Which one to choose? Unique-protein peptides force the inclusion of proteins into solution True for most tools, even probability based ones Bad consequences for FDR filtered ids 14

Peptide-Spectrum Matches Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD. X!Tandem E-value (no refinement), 1% FDR 15 Spectra used in: Zhang, B.; Chambers, M. C.; Tabb, D. L. 2007.

Many proteins are easy Eliminate equivalent / dominated proteins Sigma49: 277 → 60 proteins Yeast:1226 → 1085 proteins Many components have a single protein: Sigma49: 52 ( 3 multi-protein) Yeast: 994 (43 multi-protein) Single peptides force protein inclusion Sigma49: 16 single-peptide proteins Yeast: 476 single-peptide proteins 16

Must eliminate redundancy Contained proteins should not be selected 17 37 distinct peptides

Must eliminate redundancy Contained proteins should not be selected Even if they have some probability mass Number of sibling peptides matter less if they are shared. 18 1.0 0.8 0.7 0.0 1.0 Single AA Difference

1.0 0.0 1.0 Must ignore some PSMs A single additional peptide should not force protein into solution 19 Single AA Difference

Example from Yeast "Inosine monophosphate dehydrogenase" 4 gene family Contained proteins should not be selected Single peptide evidence for YML056C 20 1.0 0.6 0.0 1.0

Must ignore some PSMs Improving peptide identification sensitivity makes things worse! False PSMs don't cluster 21 10% 2x Proteins PSMs

Must ignore some PSMs Improving peptide identification sensitivity makes things worse! False PSMs don't cluster 22 Select Proteins to Explain True PSM% PSMs 90%

Must ignore some PSMs How do we choose? Maximize # peptides? Minimize FDR (naïve model)? Maximize # PSMs? 23

Generalized Protein Parsimony Weight peptides by number of PSMs Constrain unique peptides per protein Maximize explained peptides (PSMs) Match PSM filtering FDR to % uncovered PSMs Readily solved by branch-and-bound Permits complex protein/peptide constraints Reduces to traditional protein parsimony 24

Match FDR to uncovered PSMs 25 Traditional Parsimony at 1% FDR: 1085 (609 2+-Unique) Proteins

Spectral Counting Abundant proteins are more likely to be identified: Selection (by the instrument) for fragmentation is based on intensity More abundant ions are more likely to fragment in an informative manner A proteins' peptide identification count (spectra) can be used as a crude abundance measurement. Easy, cheap, (relative) protein quantitation 26

Differential Spectral Counts Spectral counts are too crude for classical (microarray) statistics. Fold change, t-tests, … However, we expect "similar" spectral counts when the protein abundance is unchanged. Recast as drawing balls from urns. 27

HER2/Neu Mouse Model of Breast Cancer Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue by LC-MS/MS 1.4 million MS/MS spectra Peptide-spectrum assignments Normal samples (N n ): 161,286 (49.7%) Tumor samples (N t ): 163,068 (50.3%) 4270 proteins identified in total 2-unique generalized protein parsimony 28

Drawing Balls from Urns 29 All Normal SpectraAll Tumor Spectra Plastin-2 (Lcp1)827102 2.437E-123 Osteopontin (Spp1)33419 2.444E-62 Hypoxia up-regulated protein 1 (Hyou1)2007 1.437E-40

Functional Enrichment 374 proteins with "significantly" increased abundance in tumor tissue Use 4270 proteins as background! DAVID gene set enrichment: Protein translation RNA binding, splicing 30

Differential Spectral Counting Spectral counts are not independent! p-values are misleading (too small) Non-independence from: Repeated observations of the same peptide Peptides observed in both samples Count distinct peptides instead and adjust for shared peptides Get "Correct" Fisher exact-test p-values Loss of statistical power. 31

Differential Peptide Counts (Distinct) peptides are independent. Enriched membrane preparation vs cell-lysate: 17341:18536 (spec) – 2896:2948 (pep) DNA-dependent protein kinase catalytic subunit 87:19 (spec) 10 -13 – 22:5 (pep) 10 -5 Counts are really 19:3:2 with 3 shared peptides. Translational activator GCN1 104:29 (spec) 10 -12 – 29:7 (pep) 10 -5 Counts are really 22:7:0 with 7 shared peptides. K+/Na+-transporting ATPase subunit alpha 47:2 (spec) 10 -13 – 6:1 (pep) Not significant! Counts are really 5:1:0 with 1 shared peptide 32

Differential Spectral Counting Assumptions of the formal tests (Fisher exact, χ 2 ) are violated, so p-values can be misleading (too small) Monte-carlo peptide sampling? Label permutation tests? SLOW! Collapse spectral counts to protein sets (GO terms) directly: Potential to observe more subtle spectral count differences Account for non-uniformity of protein sampling 33

34 Unannotated Splice Isoform

35 Unannotated Splice Isoform

What if there is no "smoking gun" peptide… 36

PKM2 in Peptide Atlas 39 experiments peptides

What if there is no "smoking gun" peptide… 40 ?

Nascent polypeptide-associated complex subunit alpha Long form is "muscle-specific" Exon 3 is missing from short form Peptide identifications provide evidence for long form only 9 peptides are specific to long form 6 peptides are found in both isoforms Urn with balls of 15 different colors p-value of observed spectral counts: 7.3E-8 41

Nascent polypeptide-associated complex subunit alpha 42

Pyruvate kinase isozymes M1/M2 Exon "substitution" changes sequence in the middle of the protein Peptide identifications provide evidence for both isoforms 3 peptides are specific to isoform 1 5 peptides are specific to isoform 2 Urn with balls of 63 colors for isoform 1 p-value of observed spec. counts: 2.46E-05 43

44 Pyruvate kinase isozymes M1/M2

Summary Systems biology requires: Experiments, Databases, Models Informaticians and Disease Experts Functional enrichment for proteomics needs: Careful counting (generalized parsimony) Differential abundance by counts (open) Balls in Urns statistical model for: Differential protein and protein-set abundance Splicing detection. 45

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.

Similar presentations

Presentation on theme: "Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.

Similar presentations

Presentation on theme: "Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular."— Presentation transcript:

Similar presentations

About project

Feedback