NMRbox Data-as-a-Service Overview data archival and retrieval software integration data interchange Synergy between BMRB & CONNJUR. BMBR handles data archival & retrieval (among other things). CONNJUR’s goal is software integration. They have in common the task of data management & interchange. Projects Analysis-as-a-service
Objectives 2 1 3 1. CONNJUR: capture metadata to save the state of NMR study. 2. CONNJUR as a deposition engine to BMRB. 3. M2M communication services between NMRbox and BMRB. The four aims of TRD2 & how they (a) are related and (b) unify the missions of CONNJUR & BMRB.
Approach: CONNJUR Workflow Builder Spectrum Translator Graphical software integration platform for spectral reconstruction Spectrum Translator Command-line tool for translating time and frequency domain data. Integral component of Workflow Builder. Sparky “R” Extension Annotation for reproducibility NMR-STAR Parser Translation tool CONNJUR Database MySQL database managing datasets used by Workflow Builder
Approach: BMRB Application Program Interface (API) Allows for software access to the BMRB database, both for data retrieval and deposition Data Format Translators CONNJUR, NMR-STAR, XML, JSON, NEX Data Analysis & Visualization DEVise visualization tool, Libraries in R language, Validation tools Deposition Engine CONNJUR integration, automatic gathering and deposition of data and important meta-data, including workflow specs
Workflow Builder
Time-domain and other files Approach: NMRbox M2M data exchange API Query response BMRB servers Auto-query generator NMRbox user CONNJUR database CONNJUR data harvester Time-domain and other files Spectral processing Peak lists Auto assignments Restraints Structure models NMR spectrometer NMRPipe Sparky ABACUS TALOS+ CNS
Time-domain and other files Content Harvesting for Deposition BMRB Deposition constructor API NMRbox user wwPDB CONNJUR data harvester DRCC Time-domain and other files Spectral processing Peak lists Auto assignments Restraints Structure models NMR spectrometer NMRPipe Sparky ABACUS TALOS+ CNS CONNJUR workflow manager
NMRbox/CONNJUR Deposition Service Dynamics Chemistry Interactions NMR-STAR Raw data Spectral data Derived data Data annotation CONNJUR Structure & related data Metabolomics results
NMR & supplemental data Approach: NMRbox Data Mining – BMRB Archive Content Metadata chemical structure, natural source, sample, experimental detail Imported data coordinates, restraints, phi-psi angles Validation results LACS, AVS, PANAV, SPARTA+, CING, MolProbity Biological NMR & supplemental data Derived data back calculated chemical shifts, BLAST alignments Data interpretation citations External data links PDB, UniProt, KEGG, PubChem
Approach: NMRbox BMRB Data Mining Exploring the BMRB archive for new knowledge Expose the BMRB relational database and additional value added data for query and analysis from within the NMRbox platform Develop information search and analysis tools that encompass the breadth of the BMRB archive Brief general examples Prediction and analysis of intrinsically disordered protein conformational space from NMR spectral parameters and derived data Search for links between NMR parameters, low population biopolymer conformers, and biopolymer interactions with other biopolymers and ligands Extract RNA chemical shifts and statistics for improving automated chemical shift assignment methods and structure analysis Integration of molecular dynamics simulations with NMR experimental results to understand biopolymer conformational sampling
Data mining and visualization on BMRB – R libraries CA-CB Chemical shift Distibution in BMRB per residue
Data mining and visualization on BMRB – R libraries Comparing HSQC spectra for homologous entries
Data mining and visualization on BMRB – DEVise Comparing HSQC spectra for homologous entries
Impacts (CONNJUR) 1- Additional metadata is critical to foster reproducibility. It serves dual purpose of allowing us to populate new instances of NMRbox. 2- Eases the burden on the NMR community for submitting data to the BMRB. As CONNJUR is capable of tracking larger amounts of intricate data than the spectroscopist is likely to be willing to provide – the BMRB depositions will be fuller.
Impacts (BMRB) 1 - BMRB content relevant to the NMRbox users, and possibly unknown to them, will be exposed and presented without the need for user knowledge of the BMRB archive architecture or content or user training. 2 – New possibly unexpected correlations between NMRbox user data and the full BMRB archive (experimental, derived and/or predicted, validation, and other kinds of data) will be advanced. 3 – Workflow and preservation meta-data archived for reproducibility.
Thank you! Any questions?
Data mining and visualization on BMRB – R libraries TOCSY EXAMPLE
Personnel UConn Health Wisconsin Admin Infra Train Dissem CS DBPs TRD1 Hoch Maciejewski Schuyler Gryk Ulrich Eghbalnia Gilman Gorbatyuk Moraru Livny Maziuk TBN TBN1 TBN2 TBN3 TBN4 TBN5 UConn Health Wisconsin
Metadata Examples for M2M and Data Mining Applications Biopolymer sequence, natural source including location Mining Intermediate data (restraints, chemical shifts, peak lists) Value added data (secondary structure elements, physical properties, etc.) Sample conditions (pH, temperature, pressure, ionic strength) Selection Validation report content User process annotations Best practices Software application parameter files Pulse programs Spectrometer field strength Sample contents (buffers, salts, stabilizing agents, others) Author names Keywords Descriptive User text annotations
Personnel Personnel Effort Role Gryk 2.4 Co-leader of TRD2 Extend CONNJUR data model Ulrich 0.84 Livny 0.24 Collaborator – systems design TBN1 9.6 Application architect CONNJUR software components Query Engine design Maziuk 1.2 Systems administration TBN3 8.4 Researcher/programmer BMRB software components TBN5 6 Programmer
CONNJUR Schema Expansion (Aim 2.1) Current CONNJUR strengths Spectrometers Pulse programs Parameters Output data Processing software Fully extended CONNJUR schema Current NMR-STAR strengths Citation Molecular system Sample Conditions Spectral data Derived data Current NEF strengths Structure software Input restraints data parameters
NMR Computational Pipeline 1 2 3 4 + L10 A5 < 5Ǻ Four broad phases of computation. 1st is on spectrometer – we don’t touch that. 2nd is handled by CWB. 3rd & 4th is the realm of peak lists, resonance, spin systems – semi-automated peak pickers, assignment, NOE assignment & structure determination. Spectrometer Acquisition Spectral Reconstruction Spectral Analysis Biophysical Characterization