Download presentation
Presentation is loading. Please wait.
1
Opening Big Data; in small and large chunks
Tim Smith CERN/IT at 3rd International LSDMA Symposium
2
Agenda Big Data Zenodo Open Data Analysis Preservation Open Access
3
Big Data ! 150 million sensors
Generating data 40 million times per second Peta Bytes / sec ! Trigger Select 100,000 per second Tera Bytes / sec ! HLT / Filter Select 100 per second Giga Bytes / sec !
4
Layered Cloud Virtualized Services
5
Open Access Also commercialization in the sense of inclusion in books that are sold – even if we might consider them educational exceptions…
6
Open Source Repository Platform
Mature digital library platform, originated in 2002 at CERN OAIS-inspired preservation practices Co-developed by international collaboration
7
Worlds Apart ? Big Data Open Access Active Data Mgmt
Distributed access Transfer protocols Placement Caches Tiered storage Grid / Cloud Communities Open Access Sharing Discoverability Re-use Preservation Open Licence Persistent IDs OAI-PMH Public OAIS Citation
8
Open Data as a Service REST API OAI-PMH API
9
Big Data … in small pieces
Big facilities x (a small number) Dedicated Big Data Stores Data Size Long tail of science x (a large number)
10
Easy, in essence…
11
Challenging, in practice
Media Verification Technology tracking Bit Rot Media Migration
12
Low Barriers
13
Communities Accept/reject uploads Export Direct community upload
14
Research Repository
15
Beware the False Summit
Science Data Publication Data Publication
16
Digital Dark Ages Scientific method
Propose hypotheses to explain phenomena Test hypotheses predictions through repeatable experiment Share observations and conclusions for independent scrutiny, reproduction and verification Publication: Preparation (standardisation), issuing
17
Accessible Normalisation
18
Actionable Software 10M LoC Raw Data Objects Detector Signals Tracks
Clusters Vertices Software 10M LoC Physics Analysis Objects Leptons, photons, Jets
19
Zenodo – GitHub bridge .zenodo.json
20
Code ↔ Data ↔ Paper
21
Interpretable Data Reduction / Analysis Raw Reconstructed Reduced
Calibration data Conditions data Calibrate Filter Transform Formatters Filter/Selection algorithms Reconstructed Select Combine Statistical Models Raw data objects into physics analysis objects Complex data structures into simple vectors Reduced Anonymised Standardised Annotated Published
22
Closed to be Open Publication moment is too late!
Capture before the knowledge evaporates While data is live Open ≠ Simple Logins & Roles Fine grained ACLs Sophisticated security model Life-cycles and workflows
23
Repeatable Capture Entire workflow
With data, code, statistical models, documentation Environment, Virtual Machines
24
Reproducible v2 v1 v3 v1 v2 v1 2010 2011 2012 Data Inflation !
25
Verification / Repetition / Reproduction
Good software development practice: Code test suite Unit & regression Publish data and analysis code together Workflow and environment captured Automated test of the result rerun confirmed
26
OpenData Portal
27
Conclusions Information is a valuable asset that is multiplied when it is shared Digital Libraries have a valuable part to play As do Big Data stores Open Data, on the path to Open Science Discoverable, Accessible, Intelligible, Assessable, Useable
28
@zenodo_org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.