Algorithms and Computation: Bottom-Up Data Analysis Workflows Nathan Edwards Georgetown University Medical Center
Changing landscape Experimental landscape Computational landscape Spectra, sensitivity, resolution, samples Computational landscape Data-size, cloud, resources, reliability Data-size and false positive identifications Controlling for false proteins/genes Improving peptide identification sensitivity Machine-learning, multiple search engines Filtered PSMs as a primary data-type
Changing Experimental Landscape Instruments are faster… More spectra, better precursor sampling Sensitivity improvements… More fractionation (automation), deeper precursor sampling, ion optics Resolution continues to get better… Accurate precursors (fragments) make a big difference Analytical samples per study… Fractionation, chromatography, automation improvements
Clinical Proteomic Tumor Analysis Consortium (NCI) Comprehensive study of genomically characterized (TCGA) cancer biospecimens by mass-spectrometry-based proteomics workflows ~ 100 clinical tumor samples per study Colorectal, breast, ovarian cancer CPTAC Data Portal provides Raw & mzML spectra; TSV and mzIdentML PSMs; protein reports; experimental meta-data
…from Edwards et al., Journal of Proteome Research, 2015 CPTAC Data Portal …from Edwards et al., Journal of Proteome Research, 2015
CPTAC/TCGA Colorectal Cancer (Proteome) Vanderbilt PCC (Liebler) 95 TCGA samples, 15 fractions / sample Label-free spectral count / precursor XIC quant. Orbitrap Velos; high-accuracy precursor 1425 spectra files ~600 Gb/~129 Gb (mzML.gz) Spectra: ~ 18M; ~ 13M MS/MS ~ 4.6M PSMs at 1% MSGF+ q-value
Changing Computational Landscape Single computer operating on a single spectral data-file is no longer feasible MS/MS search is the computational bottleneck Private computing clusters are quickly obsolete Need $$ to upgrade every 3-4 years Personnel costs for cluster administration and management Cloud computing gets faster and cheaper over time… …but requires rethinking the computing model
PepArML Meta-Search Engine Simple, unified, peptide identification search parameterization and execution: Mascot, MSGF+, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch Cluster, grid, and cloud scheduler: Reliable batch spectra conversion and upload, Automated distribution of spectra and sequence, Job-failure tolerant with result-file validation Machine-learning-based result combining: Model-free – heterogeneous features Adapts to the characteristics of each dataset
PepArML Meta-Search Engine Georgetown & Maryland HPC Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 48+ CPUs Under the hood, the user interacts with a scheduler, uploading spectra, and specifying a meta-search. The compute clients, local or remote, contact the scheduler to get spectra and jobs to compute. Single, simple search request Amazon Web Services
PepArML Meta-Search Engine Georgetown & Maryland HPC Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 48+ CPUs Under the hood, the user interacts with a scheduler, uploading spectra, and specifying a meta-search. The compute clients, local or remote, contact the scheduler to get spectra and jobs to compute. Single, simple search request Amazon Web Services
Run all of the search engines!
Search Engine Running Time Which (combination of) search engine(s) should I use?
Fault Tolerant Computing Spot instances can be preempted for those willing to pay more Spot prices are cheaper (7¢/hour vs 46¢/hour)
Identifications per $$ How long will a specific job take? How much memory / data-transfer is needed? What is a good decomposition size? What cloud-instance to use? Wall-clock time can be significantly reduced: …but management overhead costs too. Cost of total compute may even increase. Failed analyses cost too!
Data-Scale and False Positives Big datasets have more false positive proteins and genes! CPTAC Colorectal Cancer (CDAP) 4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene ~ 10,000 genes identified…
Data-Scale and False Positives Big datasets have more false positive proteins and genes! CPTAC Colorectal Cancer (CDAP) 4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene ~ 10,000 genes identified… …but ~ 40% gene FDR
Simple decoy protein model Decoy peptides hit decoy proteins uniformly. Each decoy peptide represents an independent trial. Binomial distribution on size of protein database number of decoy peptides Big-datasets have more decoy peptides!
Example Large: 10,000 proteins, 100,000 peptides Small: 1,000 proteins, 10,000 peptides
Data-Size and False Positives CPTAC Colorectal Cancer 1% FDR PSMs, but ~ 25% peptide FDR ~ 25,000 decoy peptides on ~ 20,000 genes Control of gene FDR requires even more stringent filtering of PSMs. If we require strong evidence in all 95 samples: No decoy genes, but less than 1000 genes identified. Bad scenario: PDHA1 and PDHA2 in CPTAC Breast Cancer – shared and unique peptides PDHA2 is testes specific!
Improved Sensitivity Machine-learning models Use additional metrics for good identifications Combining multiple search engines Agreement indicates good identifications Both approaches successful at boosting ids, particularly when adaptable to each dataset. Watch for the use of decoys in training the model. Both have scaling issues and lack transparency …may add noise to comparisons
PepArML Performance Standard Protein Mix Database LCQ QSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1
Search Engine Info. Gain
Precursor & Digest Info. Gain
Filtered PSMs as Primary Data For large enough spectral datasets, we might choose best effort peptide identification Filtered PSMs become primary data Spectral counts become more quantitative Need linear-time spectra → PSM algorithm We work less hard to identify all spectra? Output as genome alignments, BAM files? How should PSMs be represented to maximize their utility? What about decoy peptide identifications?
Nascent polypeptide-associated complex subunit alpha
Pyruvate kinase isozymes M1/M2 2.5 x 10-5
Questions?