Opening Big Data; in small and large chunks Tim Smith CERN/IT at 3rd International LSDMA Symposium
Agenda Big Data Zenodo Open Data Analysis Preservation Open Access
Big Data ! 150 million sensors Generating data 40 million times per second Peta Bytes / sec ! Trigger Select 100,000 per second Tera Bytes / sec ! HLT / Filter Select 100 per second Giga Bytes / sec !
Layered Cloud Virtualized Services
Open Access Also commercialization in the sense of inclusion in books that are sold – even if we might consider them educational exceptions…
Open Source Repository Platform Mature digital library platform, originated in 2002 at CERN OAIS-inspired preservation practices Co-developed by international collaboration
Worlds Apart ? Big Data Open Access Active Data Mgmt Distributed access Transfer protocols Placement Caches Tiered storage Grid / Cloud Communities Open Access Sharing Discoverability Re-use Preservation Open Licence Persistent IDs OAI-PMH Public OAIS Citation
Open Data as a Service REST API OAI-PMH API
Big Data … in small pieces Big facilities x (a small number) Dedicated Big Data Stores Data Size Long tail of science x (a large number) http://zenodo.org
Easy, in essence…
Challenging, in practice Media Verification Technology tracking Bit Rot Media Migration
Low Barriers
Communities Accept/reject uploads Export Direct community upload
Research Repository
Beware the False Summit Science Data Publication Data Publication
Digital Dark Ages Scientific method Propose hypotheses to explain phenomena Test hypotheses predictions through repeatable experiment Share observations and conclusions for independent scrutiny, reproduction and verification Publication: Preparation (standardisation), issuing
Accessible Normalisation
Actionable Software 10M LoC Raw Data Objects Detector Signals Tracks Clusters Vertices Software 10M LoC Physics Analysis Objects Leptons, photons, Jets
Zenodo – GitHub bridge .zenodo.json
Code ↔ Data ↔ Paper
Interpretable Data Reduction / Analysis Raw Reconstructed Reduced Calibration data Conditions data Calibrate Filter Transform Formatters Filter/Selection algorithms Reconstructed Select Combine Statistical Models Raw data objects into physics analysis objects Complex data structures into simple vectors Reduced Anonymised Standardised Annotated Published
Closed to be Open Publication moment is too late! Capture before the knowledge evaporates While data is live Open ≠ Simple Logins & Roles Fine grained ACLs Sophisticated security model Life-cycles and workflows
Repeatable Capture Entire workflow With data, code, statistical models, documentation Environment, Virtual Machines
Reproducible v2 v1 v3 v1 v2 v1 2010 2011 2012 Data Inflation !
Verification / Repetition / Reproduction Good software development practice: Code test suite Unit & regression Publish data and analysis code together Workflow and environment captured Automated test of the result rerun confirmed
OpenData Portal
Conclusions Information is a valuable asset that is multiplied when it is shared Digital Libraries have a valuable part to play As do Big Data stores Open Data, on the path to Open Science Discoverable, Accessible, Intelligible, Assessable, Useable
Tim.Smith@cern.ch http://www.cern.ch http://zenodo.org @zenodo_org