Challenges and Solutions Will Schroeder, co-Founder, President VAC Big Data Consortium Meeting July 31, 2012
Thanks
Big Data Architecture Platform Collaboration
Kitware, Inc. Open Source Scientific Computing Software Software Services
Kitware CMake CDash ParaView
Other Kitware Big Data Projects HPC -Simulation BioMedical Point Clouds Text & Documents Web: >8 billion indexed pages Kitware / VTK / Titan Electron Scanning Microscopy Connectome Resolution towards 100,000 2 x 10,000 Whole Slide Imaging / Digital Pathology Resolution at 100,000 2 x hundreds LIDAR Acquisition rates: > 200,000 pts/sec Kitware VTK / PCL / VES 3deling.com nimh.nih.gov Turbulent Flow /kitware ParaView 160,000 Computing Cores Argonne Intrepid
Columbus Large Image Format (CLIF) 2007 & k x 8k tiled image (64 MP) Six cameras with 4k x 2.6k images 8-bit grayscale raw format Frame rate ~ 1.6Hz 15-30cm GSD Duration ~ 2.8 hrs (16117 frames) in 2007; ~1 hr in 2006 Metadata Camera configuration
SCALABLE ARCHITECTURES Data-Centric Computing Client-Server Co-Processing Mobile to Supercomputer Big Data Architecture Platform Collaboration
The Traditional Visualization Workflow is Breaking Down Image from Rob Ross, Argonne National Laboratory Solver Disk Storage Disk Storage Visualization Full Mesh
Small Example Simulation 40 million finite elements simulation File size: 3.2GB per time step 1000 time steps 100 time steps written to disk Visualization ParaView Quad-core Mac Pro with 12 GB memory IO: 240 secs Contour: 25 secs Slice: 7 secs
Issues IO vs. analysis time Reduced time accuracy in post-processing Data movement ORNL Jaguar 2.33 petaflops, 224,526 compute cores
Data-Centric Computing
ParaViewWeb
Co-Processing
Mobile to Supercomputer ParaView Kiwi / VES
PLATFORM Toolkits & Modularization Integration Software Licenses Big Data Architecture Platform Collaboration
Toolkits & Modularization
Integration Module 1 Module 2Module 3Module 2 (Python) Integration Glue
Software Licenses Early Reciprocal Licenses –Requires release of software combined with OS software –Generally discourages commercial collaboration –E.g., GPL Now Permissive Licenses –Few strings attached –Suitable for commercial collaboration –E.g., BSD, Apache, MIT
COLLABORATION Multi-view, Multi-control Test-Driven Development / Software processes Big Data Architecture Platform Collaboration
Multi-View, Multi-Control Collaboration ParaViewWeb
Software Repository Build, Test & Package Community Review Developers & Users
Scalable Architectures Agile, open platforms Robust, test-driven collaboration Summary Big Data Architecture Platform Collaboration
Scientists Publisher Journals Evolution Papers Peer-Review
If it’s not reproducible, it’s not Science Nullius in Verba “take nobody's word for it” Royal Society 1640
Nature (March 2012) –Glenn Begley, former head of cancer research at pharma giant Amgen –Lee M. Ellis, cancer researcher at the University of Texas Failure of Reproducibility Found that more than 90% of papers published in science journals describing "landmark" breakthroughs in preclinical cancer research, are not reproducible, and are thus just plain wrong.