EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats Chris Jones Valentin Kuznetsov Dan Riley Greg Sharp CLEO Collaboration.

Slides:

Advertisements

Similar presentations

M. D'Amato, M. Mennea, L.Silvestris INFN-Bari CMS Data Model 9-11 Aprile 2001, Catania I Workshop INFN Grid CMS DATA MODEL M. D’Amato, M. Mennea, L. Silvestris.

Advertisements

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Oct 31, 2000Database Management -- Fall R. Larson Database Management: Introduction to Terms and Concepts University of California, Berkeley School.

Reconstruction and Analysis on Demand: A Success Story Christopher D. Jones Cornell University, USA.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,

The Event as an Object-Relational Database: Avoiding the Dependency Nightmare Christopher D. Jones Cornell University, USA.

1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.

Case Study - GFS.

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.

A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.

1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

David N. Brown Lawrence Berkeley National Lab Representing the BaBar Collaboration The BaBar Mini  BaBar  BaBar’s Data Formats  Design of the Mini 

Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:

Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.

Event Data History David Adams BNL Atlas Software Week December 2001.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.

NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

The Advanced Data Searching System The Advanced Data Searching System with 24 February APCTP 2010 J.H Kim & S. I Ahn & K. Cho on behalf of the Belle-II.

1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.

Databases in CMS Conditions DB workshop 8 th /9 th December 2003 Frank Glege.

David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.

DBS/DLS Data Management and Discovery Lee Lueking 3 December, 2006 Asia and EU-Grid Workshop 1-4 December, 2006.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software.

CBM ECAL simulation status Prokudin Mikhail ITEP.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

File Systems cs550 Operating Systems David Monismith.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Why A Software Review? Now have experience of real data and first major analysis results –What have we learned? –How should that change what we do next.

3/6: Data Management, pt. 2 Refresh your memory Relational Data Model

Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.

Oct HPS Collaboration Meeting Jeremy McCormick (SLAC) HPS Web 2.0 OR Web Apps and Databases (Oh My!) Jeremy McCormick (SLAC)

1 Copyright © 2011 Tata Consultancy Services Limited Virtual Access Storage Method (VSAM) and Numeric Intrinsic Functions (NUMVAL and NUMVAL-C) LG - TMF148.

Access Methods File store information When it is used it is accessed & read into memory Some systems provide only one access method IBM support many access.

Review CS File Systems - Partitions What is a hard disk partition?

LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

The Database Project a starting work by Arnauld Albert, Cristiano Bozza.

1 CAA 2009 Cross Cal 9, Jesus College, Cambridge, UK, March 2009 Caveats, Versions, Quality and Documentation Specification Chris Perry.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.

MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.

David Lawrence JLab May 11, /11/101Reconstruction Framework -- GlueX Collab. meeting -- D. Lawrence.

Legume: Master Reference System

Introduction To DBMS.

BESIII data processing

Off-line Event Building

Database Replication and Monitoring

Module 11: File Structure

CS522 Advanced database Systems

Introduction to Visual Basic 2008 Programming

Zhangxi Lin, The Rawls College,

Existing Perl/Oracle Pipeline

File System Structure How do I organize a disk into a file system?

CSI 400/500 Operating Systems Spring 2009

Searching Business Data with MOSS 2007 Enterprise Search

Introduction to Database Systems

Komponentbasert utvikling Den sanne objektorientering

Secondary Storage Management Brian Bershad

Spreadsheets, Modelling & Databases

Secondary Storage Management Hank Levy

Presentation transcript:

EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats Chris Jones Valentin Kuznetsov Dan Riley Greg Sharp CLEO Collaboration Cornell University

Goals Fast and scalable e.g. run through events w/ data in memory: >10,000ev/s Event data stays in original file format e.g. CLEO’s object format, Root, etc. Can manage data and MC Data will be versioned can always get back the version of data you used before Can choose runs based on ‘run conditions’ e.g. run energy, status of RICH subdetector, etc. Handles overlapping skims in same job Easy to add/supercede data to an event e.g. can add post-reconstruction info: e.g. π 0 and reconstructed D’s No dependence on proprietary software e.g. drop our use of Objectivity

EventStore ‘Sizes’ Personal For individual physicists (e.g. laptops) Holds personal skims No separate processes (e.g. databases) needed to run Group For large offsite collaborators or on-site groups Holds a large subset of our data All data on disk Requires running a 3rd party database Collaboration Cornell Site Holds all of our data with replication for improved performance Interacts with HSM Requires running a 3rd party database Share everything except choice of file meta-data DB

Data Organization Data is organized into ‘grades’ raw data directly from detector physics Reconstruction output approved for analysis Skims are defined within a grade physics: all, qcd, tau, … skims are just indicies, independent of data clustering different skims can reference the same event extremely easy to add additional skims

Adding Data Easy to add new objects to events in an existing grade e.g. could add π 0 s to physics grade after the post-reconsruction calibration has been done Can avoid common run time calculations e.g. shower energy, π 0 finding, … save CPU time guarantee consistency when reprocessing

File Meta Data Meta data about all files are stored in a relational DB System independent of choice of DB Presently using SQLite for ‘personal’ mySQL for ‘group and collaboration’ Meta data stored Logical File ID (64-bit number) to file path What files belong with which grade, skim and run Versioning information

Indexing Data Three types of files are used when reading data Index translates (run, event, MC ID) to location record index index file has fixed record length fast random access Location knows where in a set of data files can be found each data unit gives us random access versioning implemented using different location files specialized for each data storage format location file has fixed record length Data store any way to store ‘data’ should work implemented for two file formats with variable record sizes Why not use a relational database for event indexing? Indexing is read only Info is accessed every event must be very fast Index is traversed on the client scales well to many clients reading same data

Reading Data IndexLocation Data runeventMC ID index offset track shower Ks raw FF FF FF 34 FF QCD Raw Tau

Performance: Sequential Access Compare reading sequentially the same data files as a chain of files versus using the EventStore EventStore is a constant 15% slower

Performance: Event List Access Compare using an event list to access the same data files using a chain of files versus using the EventStore The EventStore scales better the more events that are skipped

Versioning When starting a new analysis, usually want the most recent reconstruction When adding new data to an existing analysis, want to go back over the same data Specify version by giving a date notation: yyyymmdd e.g. eventstore in Do not have to specify date of a version change, EventStore will find the closest version just before that date In analysis, physicists use the date they first processed the data

Versioning: Evolution If data is reprocessed, a new date stamp must be used to distinguish the data e.g. if CLEO reprocesses data31 must create new date for physics grade physics has first processing of data31 physics has more recent processing of data31 When new datasets are added, CLEO officers can append it to any date stamp e.g. newly recon data35 can be added to physics and When new data types are added, they can be placed in corresponding date stamp e.g. π 0 s derived from may be stored there If data type is replaced, need new date stamp

π0π0 data33 recon data32 recon Versioning: Evolution π0π0 run number π0π0 data31 recon data31 recon π0π0 Version Time

Versioning: Information Each ‘chunk’ of data added will have its own specific versioning information e.g. Recon Feb13_04_P2 for data32 reconstructed data used software release Feb13_04_P2 with last change to any item that affects recon no later than 2004/03/12 The version date-stamp is a ‘logical’ version made up of individual specific versions which describe a non- overlapping run range CLEO officers decide what specific versions should be used together to form a logical version Want to make tool that given a date stamp and a run range, it will tell you how to generate your MC

Run Selection Support multiple ways of specifying runs run range runs datasets datasets energy energy 1.89 runs whose energy ‘cluster’ around this beam energy energy psi(3770) energy psi(3770)-off detector state detector mu unused mu not used in this analysis so can use runs where mu bad detector rich used Data obtained by querying a Web Service Run meta-data is centralized Can also be accessed via a web browser

Conclusion We have deployed the EventStore Features Adaptable to any legacy data formats and relational databases Provides random access to events Allows incremental addition of data Enables independence of event indexing and data clustering Reuses data files for different versions Users reactions have been very favorable