The ATLAS TAGs Database - Experiences and further developments Elisabeth Vinek, CERN & University of Vienna on behalf of the TAGs developers group.

Slides:



Advertisements
Similar presentations
1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.
Advertisements

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Reconstruction and Analysis on Demand: A Success Story Christopher D. Jones Cornell University, USA.
EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats Chris Jones Valentin Kuznetsov Dan Riley Greg Sharp CLEO Collaboration.
Conditions Metadata for TAGs Elizabeth Gallas, (Ryan Buckingham, Jeff Tseng) - Oxford ATLAS Software & Computing Workshop CERN – April 19-23, 2010.
29 July 2008Elizabeth Gallas1 An introduction to “TAG”s for ATLAS analysis Elizabeth Gallas Oxford Oxford ATLAS Physics Meeting Tuesday 29 July 2008.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.
L3 Filtering: status and plans D  Computing Review Meeting: 9 th May 2002 Terry Wyatt, on behalf of the L3 Algorithms group. For more details of current.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL March 18, 2004 ATLAS Software Workshop Grid session.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
LHC: ATLAS Experiment meeting “Conditions” data challenge Elizabeth Gallas - Oxford - August 29, 2009 XLDB3.
30 Jan 2009Elizabeth Gallas1 Introduction to TAGs Elizabeth Gallas Oxford ATLAS-UK Distributed Computing Tutorial January 2009.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
David N. Brown Lawrence Berkeley National Lab Representing the BaBar Collaboration The BaBar Mini  BaBar  BaBar’s Data Formats  Design of the Mini 
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Databases E. Leonardi, P. Valente. Conditions DB Conditions=Dynamic parameters non-event time-varying Conditions database (CondDB) General definition:
Event Data History David Adams BNL Atlas Software Week December 2001.
Tracker data quality monitoring based on event display M.S. Mennea – G. Zito University & INFN Bari - Italy.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
DPDs and Trigger Plans for Derived Physics Data Follow up and trigger specific issues Ricardo Gonçalo and Fabrizio Salvatore RHUL.
TAGs and Early Data David Malon for the ATLAS TAG Team U.S. ATLAS Analysis Workshop Argonne National Laboratory January.
Infrastructure for QA and automatic trending F. Bellini, M. Germain ALICE Offline Week, 19 th November 2014.
The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
Analysis trains – Status & experience from operation Mihaela Gheata.
Conditions Metadata for TAGs Elizabeth Gallas, (Ryan Buckingham, Jeff Tseng) - Oxford ATLAS Software & Computing Workshop CERN – April 19-23, 2010.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Moritz Backes, Clemencia Mora-Herrera Département de Physique Nucléaire et Corpusculaire, Université de Genève ATLAS Reconstruction Meeting 8 June 2010.
A New Tool For Measuring Detector Performance in ATLAS ● Arno Straessner – TU Dresden Matthias Schott – CERN on behalf of the ATLAS Collaboration Computing.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.
A Flexible Distributed Event-level Metadata System for ATLAS David Malon*, Jack Cranshaw, Kristo Karr (Argonne), Julius Hrivnac, Arthur Schaffer (LAL Orsay)
Development of the CMS Databases and Interfaces for CMS Experiment: Current Status and Future Plans D.A Oleinik, A.Sh. Petrosyan, R.N.Semenov, I.A. Filozova,
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
1 Offline Week, October 28 th 2009 PWG3-Muon: Analysis Status From ESD to AOD:  inclusion of MC branch in the AOD  standard AOD creation for PDC09 files.
David Adams ATLAS Datasets for the Grid and for ATLAS David Adams BNL September 24, 2003 ATLAS Software Workshop Database Session CERN.
11th November Richard Hawkings Richard Hawkings (CERN) ATLAS reconstruction jobs & conditions DB access  Conditions database basic concepts  Types.
TAGS in the Analysis Model Jack Cranshaw, Argonne National Lab September 10, 2009.
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.
ELSSISuite Services QIZHI ZHANG Argonne National Laboratory on behalf of the TAG developers group ATLAS Software and Computing Week, 4~8 April, 2011.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Conditions Metadata for TAGs Elizabeth Gallas, (Ryan Buckingham, Jeff Tseng) - Oxford ATLAS Software & Computing Workshop CERN – April 19-23, 2010.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
TAG and iELSSI Progress Elisabeth Vinek, CERN & University of Vienna on behalf of the TAG developers group.
David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
ATLAS Physics Analysis Framework James R. Catmore Lancaster University.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
ATLAS TAGs: Tools from the ELSSI Suite Elizabeth Gallas - Oxford ATLAS-UK Distributed Computing Tutorial Edinburgh, UK – March 21-22, 2011.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
ATLAS Distributed Computing Tutorial Tags: What, Why, When, Where and How? Mike Kenyon University of Glasgow.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
ATLAS TAG Services Jack Cranshaw with support from Thomas Doherty, Julius Hrivnac, Marcin Nowak.
Database Replication and Monitoring
ALICE analysis preservation
Data Lifecycle Review and Outlook
ExaO: Software Defined Data Distribution for Exascale Sciences
ATLAS TAGs: Tools from the ELSSI Suite
ATLAS DC2 & Continuous production
Presentation transcript:

The ATLAS TAGs Database - Experiences and further developments Elisabeth Vinek, CERN & University of Vienna on behalf of the TAGs developers group

Outline  I. Overview:  What are the TAGs?  Physics Content  TAG Formats  II. Components of the TAG system: files, databases and services  The TAG database(s)  Data distribution  Cataloging of data and services  ELSSI: the TAG browser  III. Experiences...  from managing a very large database  from maintaining a distributed architecture  IV. Further developments  On-demand selection of TAG services 2 ATLAS TAG Database

I. ATLAS TAGs: event-level metadata  TAGs are event-by-event metadata records containing:  key quantities that identify and describe the event, intended to be useful for event selection, and  sufficient navigational information to allow access to the event data at all prior processing stages: RAW, ESD, and AOD (and possibly more, e.g., for Monte Carlo data)  TAG is not an acronym!  Contains more than 200 variables for each event to make event selection easy and efficient  Variables decided by 2006 Task force and maintained by the PAT (Physics Analysis Tools) group  TAG contents are quite stable since then  Data representation evolves only for easier use  (e.g. Trigger decision at three levels) ATLAS TAG Database3

I. TAG Content  ~200 variables/event:  Event identification (run number, event, lumi block,timestamp)  Trigger decisions at all three levels (bit encoded)  Numbers of electrons, muons, photons, taus, jets  pT, eta, phi for highest-pT objects  Global quantities (e.g. missing ET)  Detector status and quality words  For each Physics & Performance group, a 32-bit word is reserved to flag the interesting events for their analysis  Sufficient information to point back to the previous processing stages (i.e. RAW/ESD/AOD) ATLAS TAG Database4

I. TAG Formats  TAGs are produced in central ATLAS reconstruction  RAW Data  ESD (Event Summary Data) ~500 kB/event  AOD (Analysis Object Data) ~100 kB/event (Egamma)  TAG ~1kB/event planned – in reality ~3-4kB/event  TAG Formats:  File based TAGs are built from AOD content and written into files when AOD files are merged.  TAGs are like other file based data.  POOL ROOT format, can be browsed by ROOT, but is actually a POOL standard like the data files.  Organized into datasets and distributed to the appropriate Tiers of ATLAS.  Relational TAGs are uploaded to Oracle databases ATLAS TAG Database5

II. Components of the TAG system ATLAS TAG Database6 TAG ROOT Files TAG site n TAG site2 TAG site1 TAG data & services catalog DATASERVICES skim extract TAG ROOT File ESD/AOD ROOT File lookup queries upload produces site m skim extract Conditions metadata

II. The TAG databases, TAG uploads  Relational databases (Oracle 10g/11g) at several sites  Data organized by RAW streams within a project  The physical „unit“ is a (POOL) collection  The upload of the TAGs to all sites is done by the Tier-0 Management System, as the last step in the reconstruction chain  „Posttagupload“ running on each DB after the upload to manage the data efficiently: index creation, partitioning, monitoring etc ATLAS TAG Database7

II. Data Distribution  Relational TAGs are not distributed to all Tier-1s or Tier-2s as are file- based TAGs; sites (Tier-1s as well as Tier-2s) are hosting them on a volountary basis.  Requires providing an Oracle database service on a Terabyte scale.  Current TAG sites: CERN, DESY, PIC, BNL, TRIUMF  Current data distribution and volumes: ATLAS TAG Database8 CERNDESYPICBNLTRIUMF data TB1.14 TB2.24 TB250 GB670 GB data09320 GB300 GB-360 GB350 GB mc TB--- mc GB---

II. Cataloging data and services  A „replica catalog“ is needed to keep track of relational data distribution.  Implemented as a database schema, in a DWH-like star design  Updated automatically by the Tier-0 management system via an API  Used:  by the TAG browser to show available data,  by all TAG services to establish the connection to the right site hosting the required data  to mark data for deletion  to gather performance statistics on the upload processed  A „services catalog“ is needed to keep track of all services, their deployments, status and metrics.  Implemented as a database schema, 2 layers:  Stats gathering  Aggregation – „offline“ computation of metrics per service ATLAS TAG Database9

II. TAG catalogs ATLAS TAG Database10

II. ELSSI: Event Level Selection Service Interface  Interface for querying the TAG databases  Selection based on runs, streams, trigger decisions and physics attributes  Access with certificate (VO atlas) ATLAS TAG Database11 Table, Histograms, AOD GUIDs extract, skimming

III. Experiences from managing very large databases  Much effort has been put in managing the huge amount of data efficiently on the database level:  Schema and tablespace strategy to enable easy data deletion  Indexing:  All indexing strategy since the beginning  B-tree and bitmap indexes, depending on variables – space issues!  Horizontal table partitioning to group runs in physical units and to allow read/write optimization  Compression: saves a lot of space! However, limitations in compression led to tests on vertical partitioning (ongoing)  Data deletion kept in synch with file deletion from sites  Periodic cross-checks needed on various levels to ensure data consistency and synchonicity with file-based TAGs ATLAS TAG Database12

III. Experiences from managing a distributed architecture  Coordination effort to keep software version at several sites in synch  Aumation on the DB level, but service deployments are managed locally.  Efforts to also deploy web services at other sites than CERN.  Data distribution:  As for now, all data except MC is at complete CERN -> can be used as reliable data source. In general, a reprocessing pass should always be complete at a site, but this can change.  Challenges will come when data is more scattered and not a full reprocessing pass at one site -> a query as requested by the user cannot be made by connecting to only one site -> query distribution?  Load balancing between databases and services – until now the user chooses the site to query, but this will change -> automated site picking includes automated load-balancing ATLAS TAG Database13

IV. On demand selection of (TAG) services  In an automated load balancing environment, several decisions are to be taken when a user makes a request:  Which DB site to connect to?  Which browser to connect to?  Which web services to use? ...in order to satisfy the user‘s request within a defined time (quality of service baseline) and with ensuring that the whole distributed system is able to satisfy as many requests as possible (load balanced system, avoiding bottlenecks as much as possible).  Information needed to take these decisions:  Service deployments, status and metrics (-> services catalog)  Data distribution (-> data catalog)  User input  Typical usage patterns and disibution of requests  Service selection to take place on demand, at request time.  Investigation of approaches and algorithms underway.  The model will have to be able to easily adapt to:  New services  Evolving objective functions ATLAS TAG Database14

Conclusions  TAG content and operational processed are stable, but more sites might join to host TAG data and/or services -> evolving infrastructure.  As in the file-based world, central catalogs are in place that summarize data and services distribution, as well as services metrics.  Experiences since data taking:  TAGs are uploaded to the databases without much delay  Separate process for upload of reprocessed data  Importance of efficient space management (including data deletion)  Strategies adopted by ATLAS DBAs to manage the important volume of TAG data proved to be efficient.  New use cases arise (event selection)  Effort now on automating request distribution and load balancing.  Hypernews for questions about TAGs: hn-atlas-physicsMetadata ATLAS TAG Database15