Opening Big Data; in small and large chunks

Slides:



Advertisements
Similar presentations
1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.
Advertisements

1 Institutional Repository (IR) Models Rutgers University Community Repository (RUcore) A digital library perspective (objects and collections) Flexible.
Brief Overview of Major Enhancements to PAWN. Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Data Publishing Workflows: Strategies and Standards
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
E-BIOGENOUEST: A REGIONAL LIFE SCIENCES INITIATIVE FOR DATA INTEGRATION Datacite Annual Conference Nancy Olivier Collin – IRISA/INRIA
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CERN – IT Department CH-1211 Genève 23 Switzerland t CERN Open Source Collaborative tools: Digital Library Software Tim Smith CERN/IT.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
1 European policies for e- Infrastructures Belarus-Poland NREN cross-border link inauguration event Minsk, 9 November 2010 Jean-Luc Dorel European Commission.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Challenges of Digital Media Preservation Karen Cariani, Director Media Library and Archives Dave MacCarn, Chief Technologist.
Software Sustainability Institute Dealing with software: the research data issues 26 August.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Dr. M.-C. Sawley IPP-ETH Zurich Nachhaltige Begegnungen Standing at the crossing point between data analysis and simulation Knowledge Discovery Panel.
European Organization for Nuclear Research Organisation Européenne pour la Recherche Nucléaire High-Energy Physics Data Delivering Data in Science ICSTI.
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
CERN – IT Department CH-1211 Genève 23 Switzerland t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session.
Create Content Capture Content Review Content Edit Content Version Content Version Content Translate Content Translate Content Format Content Transform.
Machine Learning as a Service
Enterprise Solutions Chapter 10 – Enterprise Content Management.
Experiment Management from a Pegasus Perspective Jens-S. Vöckler Ewa Deelman
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
23.March 2004Bernd Panzer-Steindel, CERN/IT1 LCG Workshop Computing Fabric.
CERN – IT Department CH-1211 Genève 23 Switzerland t Data Publishing Tim Smith CERN/IT.
Data Organization Quality Assurance and Transformations.
CERN – IT Department CH-1211 Genève 23 Switzerland t Zenodo: Share, Publish and Preserve Multidisciplinary Research Results Tim SMITH Cloud.
1 Why is Digital Curation Important for Workforce and Economic Development? Alan Blatecky Office of Cyberinfrastructure Symposium on Digital Curation in.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
IPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment Sriram Srinivasan.
The Global Scene Wouter Los University of Amsterdam The Netherlands.
ATLAS Data preservation April 2015 Roger Jones for the ATLAS Collaboration.
Dr Tim Smith CERN/IT For the visit of the Alliance of German Science Organizations.
Big Data for Big Discoveries How the LHC looks for Needles by Burning Haystacks Alberto Di Meglio CERN openlab Head DOI: /zenodo.45449, CC-BY-SA,
WMO WIS strategy – Life cycle data management WIS strategy – Life cycle data management Matteo Dell’Acqua.
NASA Earth Science Data Stewardship
Accessing the VI-SEEM infrastructure
Matlab.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Budget JRA2 Beneficiaries Description TOT Costs incl travel
Tools and Services Workshop
Joslynn Lee – Data Science Educator
EOSCpilot WP4: Use Case 5 Material for
Jarek Nabrzyski Director, Center for Research Computing
The importance of being Connected
Technical Meeting with CNR and INAF 7 October 2014
ALICE analysis preservation
Fernando Aguilar, IFCA-CSIC
WLCG: TDR for HL-LHC Ian Bird LHCC Referees’ meting CERN, 9th May 2017.
"Production of social statistics… goes social!"
Thoughts on Computing Upgrade Activities
Publishing software and data
Test Automation for IoT solutions A Paradigm shift
Workflows in archaeology & heritage sciences
GSAF Grid Storage Access Framework
Graduation Project Kick-off presentation - SET
Zenodo: A Research Data Repository for All
Data management for reproducible research
OpenML Workshop Eindhoven TU/e,
EOSCpilot All Hands Meeting 8 March 2018 Pisa
NSDL Data Repository (NDR)
Module 01 ETICS Overview ETICS Online Tutorials
Brian Matthews STFC EOSCpilot Brian Matthews STFC
GEO Knowledge Hub: overview
Supporting Open Research
Research Data Dr Aoife Coffey, Research Data Coordinator
INAF Long Term Preservation
Presentation transcript:

Opening Big Data; in small and large chunks Tim Smith CERN/IT at 3rd International LSDMA Symposium

Agenda Big Data Zenodo Open Data Analysis Preservation Open Access

Big Data ! 150 million sensors Generating data 40 million times per second  Peta Bytes / sec ! Trigger Select 100,000 per second  Tera Bytes / sec ! HLT / Filter Select 100 per second  Giga Bytes / sec !

Layered Cloud Virtualized Services

Open Access Also commercialization in the sense of inclusion in books that are sold – even if we might consider them educational exceptions…

Open Source Repository Platform Mature digital library platform, originated in 2002 at CERN OAIS-inspired preservation practices Co-developed by international collaboration

Worlds Apart ? Big Data Open Access Active Data Mgmt Distributed access Transfer protocols Placement Caches Tiered storage Grid / Cloud Communities Open Access Sharing Discoverability Re-use Preservation Open Licence Persistent IDs OAI-PMH Public OAIS Citation

Open Data as a Service REST API OAI-PMH API

Big Data … in small pieces Big facilities x (a small number) Dedicated Big Data Stores Data Size Long tail of science x (a large number) http://zenodo.org

Easy, in essence…

Challenging, in practice Media Verification Technology tracking Bit Rot Media Migration

Low Barriers

Communities Accept/reject uploads Export Direct community upload

Research Repository

Beware the False Summit Science Data Publication Data Publication

Digital Dark Ages Scientific method Propose hypotheses to explain phenomena Test hypotheses predictions through repeatable experiment Share observations and conclusions for independent scrutiny, reproduction and verification Publication: Preparation (standardisation), issuing

Accessible Normalisation

Actionable Software 10M LoC Raw Data Objects Detector Signals Tracks Clusters Vertices Software 10M LoC Physics Analysis Objects Leptons, photons, Jets

Zenodo – GitHub bridge .zenodo.json

Code ↔ Data ↔ Paper

Interpretable Data Reduction / Analysis Raw Reconstructed Reduced Calibration data Conditions data Calibrate Filter Transform Formatters Filter/Selection algorithms Reconstructed Select Combine Statistical Models Raw data objects into physics analysis objects Complex data structures into simple vectors Reduced Anonymised Standardised Annotated Published

Closed to be Open Publication moment is too late! Capture before the knowledge evaporates While data is live Open ≠ Simple Logins & Roles Fine grained ACLs Sophisticated security model Life-cycles and workflows

Repeatable Capture Entire workflow With data, code, statistical models, documentation Environment, Virtual Machines

Reproducible v2 v1 v3 v1 v2 v1 2010 2011 2012 Data Inflation !

Verification / Repetition / Reproduction Good software development practice: Code test suite Unit & regression Publish data and analysis code together Workflow and environment captured Automated test of the result rerun confirmed

OpenData Portal

Conclusions Information is a valuable asset that is multiplied when it is shared Digital Libraries have a valuable part to play As do Big Data stores  Open Data, on the path to Open Science Discoverable, Accessible, Intelligible, Assessable, Useable

Tim.Smith@cern.ch http://www.cern.ch http://zenodo.org @zenodo_org