Introduction to NMRbox Project and NMRbox Virtual Machine Mark Maciejewski UConn Health Thanks the organizers for giving me a chance to speak today. The title is “Towards reproducible computation for NMR with NMRbox. Primarily focused on a new virtual machine we are building pre-configured with software used in all aspects of NMR data processing and analysis “Think inside the box”
Outline Lecture Motivation for the project NMRbox platform Benefits to users and developers Usage Hands on (at the beginning of the tutorial session) - Adam Account management Connect to NMRbox VM Set the display resolution Software inventory File transfers The NMRbox project will deliver… NMRbox VM – A virtual machine pre-configured with a range of software used in biological NMR Will provide access to significant computational resources through individual VMs and computational clusters Will provide advanced training material which in many cases will be integrated into the VM The platform will have BMRB integration Annotation of workflows And interoperability between different NMR software packages In addition Bayesian tools will be incorporated into some existing NMR software packages And an API will be developed for developers to incorporate Bayesian inference into their packages.
Motivation: Abundance of software Figure shows a weighted “word cloud” based on software frequency from the BMRB ~120 packages from BMRB As part of the NMRbox project we have identified over 200 packages From looking through J. Bio NMR BMRB and simple web search Hundreds of packages cited in BioMagResBank depositions, J. Bio-NMR, and other journals.
Motivation: Fragmentation Operating systems Programming languages Another motivation is the fragmentation of platforms Several different operating systems, programming languages, and libraries This leads to an enormous burden for software developers and end users attempting to install software and is a major burden for non-computational experts. Libraries BLAS
Motivation: Persistency Platforms become obsolete Developers graduate Software time bombs Many NMR software packages lack persistency for a variety of reasons Platforms become obsolete Graduate students or other developers move on and leave the lab This leaves many programs “as-is” and makes them difficult to keep running on newer platforms Grants end Software time-bombs. In order to push users to keep their software up-to-date to avoid issues with platforms and the evolution of OSs developers sometimes put time bombs in their software. While this helps to a degree for actively developed packages it still leads to old versions of their software not being persistent and can lead to problems if the developer ends their support. Grants end
Motivation: Meta-software packages SHIFTX2 Sparky Rosetta MODELLER NMRPipe Python scripts Another motivation for NMRbox is the growing number of meta-packages such as a new software package called Compass from Chad Rienstra’s lab. This program attempts to predict the structure of a protein from solid state spectra The compass program itself is developed as python scripts, but The workflow relies on NMRPipe, Sparky, Rosetta, and ShiftX, Modeller Not only does the end user now need to install Compass They need to install the dependent programs They also need to configure compass based on the installation of the other packages. Compass will undoubtedly rely on certain versions of these ancillary programs which can lead to issues for end users. These issues combined make it very difficult for non-experts to utilize NMR software and adds to an overly high activation barrier for a researcher to dive into NMR Experimental protein structure verification by scoring with a single, unassigned NMR spectrum. Courtney, Rienstra, et al., Structure, 2015.
Motivation: Computational reproducibility A computational study is reproducible when it provides the “complete software environment needed to reproduce the figures” - D. Donoho, Stanford Obstacles Missing primary empirical data Missing meta-data Missing software (scripts, programs) Non-persistence of software Manual interventions Read from slide through obstacles
Challenges Question How do we address these challenges? Abundance of software (discovery) Fragmentation of OSes, programming languages, libraries Persistence of resources Complexity of design and installation Reproducibility of results Read from slide through obstacles Question How do we address these challenges? Answer NMRbox VM
Deliverables – primary tools Platform NMRbox VM: A virtual machine pre-configured with a wide range of software used in biological NMR Significant computational resources Data BMRB integration & richer depositions Metadata management and workflow annotation Analytics Bayesian tools to enhance data analysis and interpretation API for developers to incorporate Bayesian inference Read from slide through obstacles
Deliverables – community services Training and Dissemination Workshops, tutorials, and guides User and developer support Driving Biological Projects (DBPs) Test beds for NMRbox technology development What limits your progress? Collaboration and Service (C&S) Apply technologies to challenging biomedical research problems Read from slide through obstacles
NMRbox VM. What’s included? Acquisition Agnostic – Install all software available Access Persistent – Archive all versions Content – Software packages 100+ packages installed (see https://nmrbox.org) Spectral reconstruction Spectral visualization Automated assignment Structure determination Molecular visualization Validation Chemical shift prediction Dynamics Residual dipolar coupling Meta packages General purpose Instrument manufactures Read the slide Note on Agnostic – We are trying to have VMs with a wide range of software. We will work hard to enhance the workflows of the most used software, but at the same time allow everyone access to their “favorite” software. There have also been some efforts lately for developers to release the software as a VM for easier installation, such as NMRPipe. The issue then is that you would have multiple VMs for all the software installed – we hope to have everything under a single umbrella.
NMRbox VM. What’s included? Content – Productivity Tools OS xubuntu 16.04 over a dozen editors scientific python packages R and R tools office tools drawing tools Octave shells browsers Dropbox Read the slide Note on Agnostic – We are trying to have VMs with a wide range of software. We will work hard to enhance the workflows of the most used software, but at the same time allow everyone access to their “favorite” software. There have also been some efforts lately for developers to release the software as a VM for easier installation, such as NMRPipe. The issue then is that you would have multiple VMs for all the software installed – we hope to have everything under a single umbrella.
NMRbox VM. What’s included? Release 3 features added GPUs to support 3D drawing PyMOL, VMD, Chimera, and others GPUs to support CUDA processing NAMD, others coming soon Commercial software dataChord spectrum Analyst, dataChord spectrum Miner, MestReNova Matlab compiled binaries ALATIS, GUARDD, TITAN virtual on-screen keyboard See Release notes at - https://nmrbox.org/files/release-notes-version-3-0.pdf Read the slide Note on Agnostic – We are trying to have VMs with a wide range of software. We will work hard to enhance the workflows of the most used software, but at the same time allow everyone access to their “favorite” software. There have also been some efforts lately for developers to release the software as a VM for easier installation, such as NMRPipe. The issue then is that you would have multiple VMs for all the software installed – we hope to have everything under a single umbrella.
Virtual Machine Terminology A software-based emulation of a guest computer backed by the physical resources of a host computer, managed by a hypervisor. VM = Access Local installation (standalone or downloadable) Connect to server (PaaS = Platform-as-a-Service) Advantages Over-subscribe the host computer Snapshot the VM and restore to any point Run multiple OS’s on a single computer “spin-up” VMs in minutes Dynamically load balance VMs across multiple hosts No performance penalties on modern computers Read the slide Note on Agnostic – We are trying to have VMs with a wide range of software. We will work hard to enhance the workflows of the most used software, but at the same time allow everyone access to their “favorite” software. There have also been some efforts lately for developers to release the software as a VM for easier installation, such as NMRPipe. The issue then is that you would have multiple VMs for all the software installed – we hope to have everything under a single umbrella.
Standalone NMRbox VM host computer hypervisor NMRbox (guest) shared folder OS / NMR software user accounts Just to get a feel for how an end user would interact with a downloadable VM here is a short animation User would download a hypervisor software package such as VirtualBox Then download NMRbox Start the hypervisor and then import the NMRbox VM Essentially the user would have a fully functional OS pre-configured with a wide variety of software used in NMR data processing and analysis. User would then need to get their data into the VM Data is a bit trickier with a local VM. Your data can reside in a virtual disk (however this is a single flat file to the OS and can be dangerous) Shared folders work great, but can be tricky to configure the hypervisor to access at times. USB or file servers are the best but require additional hardware. These issues are resolved with a PaaS version of the VM
High Performance Storage PaaS NMRbox VM Authentication Server VM host server Remote Users NMRbox VM - 1 CPU, Ram, NIC NMRbox VM - 2 CPU, Ram, NIC user data Cloud Storage backups user data user home folders NMR Software OS Files In a PaaS version each user will have their own NMRbox VM spun-up on our servers. They will access the VM with full GUI via RealVNC or ssh for advanced users A key is that the user storage and authentication is all separated from the VMs allowing seamless migration as new versions of NMRbox VMs are released and for going back to older versions if needed. High Performance Storage
PaaS deployed with enterprise-class resources 100 GB network 12 VM servers 480 cores 3.8 TB memory Redundant internal network Network attached storage 100’s of TBs available to NMRbox Ultra reliable cloud storage in excess of PB NMRbox VM is being deployed at UConn Health with enterprise level hardware The research network has a 100 GB network connection to our ISP That feeds into a 40 GB network fabric connecting all the switches in the datacenter VM hosts and compute clusters are connected via 10 GB connections with a separate 10 GB dedicated connection to storage The VM hosts will run the NMRbox VMs for individuals and developers Users home folders and the files needed to run the VMs are on fast storage with performance similar to a local SSD We also have access to cloud storage for backups and extra space for user data. The university has a 3 PB geo-dispersed storage system that continues to grow and offers unmatched reliability. It is currently configured for 15 – 9s of reliability. Users will connect via ssh or RealVNC. RealVNC offers several benefits Full GUI Free and runs on all devices Everything is encrypted Built-in file transfer for those not comfortable with scp Maps your local printer Runs in daemon mode. Users just connects and does nothing else. 38 NVIDIA GPUs dramatically increasing graphic performance & CUDA processing
VM Requirements for Users Standalone VM 64-bit hardware (Windows, OSX, Linux,…) any modern laptop and desktop Server based PaaS VM ssh or VNC (Windows, OSX, Linux, tablet, phone, 32-bit hardware, …) Network connection Oracle VirtualBox VMware Workstation VMware Fusion VMware Player
Benefits Users Developers Instructors “Zero-configuration” Access Training Computational resources Discovery Persistence Reproducibility Cost Developers Single platform Discovery Usage metrics Persistence Community Developer tools Computational resources Instructors Access to NMRbox VMs for courses and workshops
Practical aspects Large VM model Updates Backups NMRbox VMs configured with many cores, high memory, and GPUs Multiple users per VM, each user has two VMs (username.nmrbox.org and username2.nmrbox.org) CPU and memory utilization restricted to 50% of full VM GPUs restrict VM management Updates Additional software will be added to “live” NMRbox VMs Version numbers updated All states archived Software versions updated on major releases Older major VM releases continue to run with reduced resources at version.nmrbox.org Backups User data backed up daily
Home folder and archive folder Practical aspects Large memory VM A large memory VM can be “spun-up” for users if needed Home folder and archive folder Each user has two home folders; /home/nmrbox/username and /nmr/archive/username Google Group We have started a Google Group at https://groups.google.com Search for NMRbox to join. Support Email support@nmrbox.org Downloadable version Downloadable version in final testing
Practical aspects Host workshops with NMRbox VMs The NMRbox team will “spin-up” custom VMs to support other workshops File permissions and access Home and archive folders are not accessible to others by default. Will setup lab groups if desired. /public folder for quick sharing Contact us Suggestions for packages to include Suggestions about the package Issues with the NMRbox platform
NMRbox Usage 500+ Users
NMRbox Usage package total_runs total_users rnmrtk 41846863 69 nmrpipe 7104050 186 shiftx2-v110-linux-20160912 482070 3 amber16 215403 24 openbabel-2.4.1 105769 6 hmsIST 101733 44 nmr-scripts 66566 141 cns_solve_1.3 62756 67 mddnmr 51360 62 cns_solve_1.21 28272 nustool 19380 65 xplor-nih-2.43 12322 35 rosetta 7076 28 nmrfam-sparky 5285 97 namd_gpu 4556 namd_cpu 3121 9 shifts-5.1 2119 37 connjurst 2027 56 ensemble 1698 ccpnmr 1197 xplor-nih-2.45 1113 7 molmol 968 34 modelfree 873 16 NMRViewJ 621 57 aria2.3 614 vmd 611 61 NMRFxProcessor 486 Redcat 334 4 chimera 301 21 relax 291 27 pymol-1.8.2.1 262 54 redcraft 251 13 pymol-1.8.6.0 228 26 flexible-meccano 211 12 fmcgui2.5_linux 189 16 TENSORV2_PC9 167 24 cyana-3.97 166 glove 142 11 camera 111 cara 83 INCHI-1 78 14 nmr_wash-1.0.0-linux 68 15 cpmg_fitd9 66 21 pales 63 7 ponderosa 60 nestanmr 57 TREND-1.0 52 8 tinker 48 6 ALATIS 43 17 GISSMO 41 rnmr 37 fastmodelfree 33 5 MestReNova 29 9 ssp 4 BMRB-CS-Rosetta-Submission topspin 25 nessy adapt_nmr_enhancer azara-2.8
Cite NMRbox Very Important!! If you utilize NMRbox in your research please cite and acknowledge us. Details at https://nmrbox.org NMRbox: A Resource for Biomolecular NMR Computation. Maciejewski, M.W., Schuyler, A.D., Gryk, M.R., Moraru, I.I., Romero, P.R., Ulrich, E.L., Eghbalnia, H.R., Livny, M., Delaglio, F., and Hoch, J.C., Biophys J., 112: 1529-1534, 2017. [PMID: 28445744, DOI: 10.1016/j.bpj.2017.03.011] "This study made use of NMRbox: National Center for Biomolecular NMR Data Processing and Analysis, a Biomedical Technology Research Resource (BTRR), which is supported by NIH grant P41GM111135 (NIGMS)."