Presentation is loading. Please wait.

Presentation is loading. Please wait.

Opening Big Data; in small and large chunks

Similar presentations


Presentation on theme: "Opening Big Data; in small and large chunks"— Presentation transcript:

1 Opening Big Data; in small and large chunks
Tim Smith CERN/IT at 3rd International LSDMA Symposium

2 Agenda Big Data Zenodo Open Data Analysis Preservation Open Access

3 Big Data ! 150 million sensors
Generating data 40 million times per second  Peta Bytes / sec ! Trigger Select 100,000 per second  Tera Bytes / sec ! HLT / Filter Select 100 per second  Giga Bytes / sec !

4 Layered Cloud Virtualized Services

5 Open Access Also commercialization in the sense of inclusion in books that are sold – even if we might consider them educational exceptions…

6 Open Source Repository Platform
Mature digital library platform, originated in 2002 at CERN OAIS-inspired preservation practices Co-developed by international collaboration

7 Worlds Apart ? Big Data Open Access Active Data Mgmt
Distributed access Transfer protocols Placement Caches Tiered storage Grid / Cloud Communities Open Access Sharing Discoverability Re-use Preservation Open Licence Persistent IDs OAI-PMH Public OAIS Citation

8 Open Data as a Service REST API OAI-PMH API

9 Big Data … in small pieces
Big facilities x (a small number) Dedicated Big Data Stores Data Size Long tail of science x (a large number)

10 Easy, in essence…

11 Challenging, in practice
Media Verification Technology tracking Bit Rot Media Migration

12 Low Barriers

13 Communities Accept/reject uploads Export Direct community upload

14 Research Repository

15 Beware the False Summit
Science Data Publication Data Publication

16 Digital Dark Ages Scientific method
Propose hypotheses to explain phenomena Test hypotheses predictions through repeatable experiment Share observations and conclusions for independent scrutiny, reproduction and verification Publication: Preparation (standardisation), issuing

17 Accessible Normalisation

18 Actionable Software 10M LoC Raw Data Objects Detector Signals Tracks
Clusters Vertices Software 10M LoC Physics Analysis Objects Leptons, photons, Jets

19 Zenodo – GitHub bridge .zenodo.json

20 Code ↔ Data ↔ Paper

21 Interpretable Data Reduction / Analysis Raw Reconstructed Reduced
Calibration data Conditions data Calibrate Filter Transform Formatters Filter/Selection algorithms Reconstructed Select Combine Statistical Models Raw data objects into physics analysis objects Complex data structures into simple vectors Reduced Anonymised Standardised Annotated Published

22 Closed to be Open Publication moment is too late!
Capture before the knowledge evaporates While data is live Open ≠ Simple Logins & Roles Fine grained ACLs Sophisticated security model Life-cycles and workflows

23 Repeatable Capture Entire workflow
With data, code, statistical models, documentation Environment, Virtual Machines

24 Reproducible v2 v1 v3 v1 v2 v1 2010 2011 2012 Data Inflation !

25 Verification / Repetition / Reproduction
Good software development practice: Code test suite Unit & regression Publish data and analysis code together Workflow and environment captured Automated test of the result rerun confirmed

26 OpenData Portal

27 Conclusions Information is a valuable asset that is multiplied when it is shared Digital Libraries have a valuable part to play As do Big Data stores  Open Data, on the path to Open Science Discoverable, Accessible, Intelligible, Assessable, Useable

28 @zenodo_org


Download ppt "Opening Big Data; in small and large chunks"

Similar presentations


Ads by Google