A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014.

Slides:

Advertisements

Similar presentations

Characteristic Functions. Want: YearCodeQ1AmtQ2AmtQ3AmtQ4Amt 2001e (from fin_data table in Sybase Sample Database) Have: Yearquartercodeamount.

Advertisements

Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.

M.S. Access Module CAS 133 Russ Erdman. M.S. Access Module Assignment Overview Two options for the unit: All students complete Units A, B and C In class.

VBA Modules, Functions, Variables, and Constants

Using Objects and Properties

Phonegap Bridge – File System CIS 136 Building Mobile Apps 1.

Page 1 ISMT E-120 Desktop Applications for Managers Introduction to Microsoft Access.

CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.

CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

ASP.NET Programming with C# and SQL Server First Edition

PHP Programming with MySQL Slide 8-1 CHAPTER 8 Working with Databases and MySQL.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

1 Overview of Databases. 2 Content Databases Example: Access Structure Query language (SQL)

DAY 14: ACCESS CHAPTER 1 Tazin Afrin October 03,

PHP meets MySQL.

Microsoft Access 2003 Define some key Access terminology: Field – A single characteristic or attribute of a person, place, object, event, or idea. Record.

Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.

Constraints  Constraints are used to enforce rules at table level.  Constraints prevent the deletion of a table if there is dependencies.  The following.

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Key Data Management Tasks in Stata

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

PHP and MySQL CS How Web Site Architectures Work  User’s browser sends HTTP request.  The request may be a form where the action is to call PHP.

SQL Server Indexes Indexes. Overview Indexes are used to help speed search results in a database. A careful use of indexes can greatly improve search.

Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

Stored Procedure. Objective At the end of the session you will be able to know :  What are Stored Procedures?  Create a Stored Procedure  Execute a.

Views Lesson 7.

Database structure and space Management. Database Structure An ORACLE database has both a physical and logical structure. By separating physical and logical.

1 Introduction to Oracle Chapter 1. 2 Before Databases Information was kept in files: Each field describes one piece of information about student Fields.

1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.

MySQL More… 1. More on SQL In MySQL, the Information Schema is the “Catalog” in the SQL standard SQL has three components: Data definition Data manipulation.

Commercial RDBMSs Access and Oracle. Access DBMS Architchecture  Can be used as a standalone system on a single PC: -JET Engine -Microsoft Data Engine.

Denny Cherry twitter.com/mrdenny.

Chapter 8 Part 2 SQL-99 Schema Definition, Constraints, Queries, and Views.

Advanced SQL: Triggers & Assertions

Constraints cis 407 Types of Constraints & Naming Key Constraints Unique Constraints Check Constraints Default Constraints Misc Rules and Defaults Triggers.

Chapter 4 Constraints Oracle 10g: SQL. Oracle 10g: SQL 2 Objectives Explain the purpose of constraints in a table Distinguish among PRIMARY KEY, FOREIGN.

The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University.

LHC Physics Analysis and Databases or: “How to discover the Higgs Boson inside a database” Maaike Limper.

SQL/Lesson 7/Slide 1 of 32 Implementing Indexes Objectives In this lesson, you will learn to: * Create a clustered index * Create a nonclustered index.

SQL SERVER DAYS 2011 Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Session 1 Module 1: Introduction to Data Integrity

Physics Analysis inside the Oracle DB Progress report 10 Octobre 2013.

1 Intro stored procedures Declaring parameters Using in a sproc Intro to transactions Concurrency control & recovery States of transactions Desirable.

Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.

TAGS in the Analysis Model Jack Cranshaw, Argonne National Lab September 10, 2009.

An SQL-based approach to Physics Analysis M. Limper.

DAY 14: ACCESS CHAPTER 1 RAHUL KAVI October 8,

1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.

Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.

Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.

Computer Science & Engineering 2111 Database Objects 1 CSE 2111 Introduction to Database Management Systems.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.

ATLAS Distributed Computing Tutorial Tags: What, Why, When, Where and How? Mike Kenyon University of Glasgow.

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

Understanding Core Database Concepts Lesson 1. Objectives.

CHP - 9 File Structures.

ECONOMETRICS ii – spring 2018

Table Indexing for the .NET Developer

Steve Hood SimpleSQLServer.com

Author: Kaiqing Fan Company: Mastech Digital Inc.

Session - 6 Sequence - 1 SQL: The Structured Query Language:

Lab 3 and HRP259 Lab and Combining (with SQL)

Chapter 11 Managing Databases with SQL Server 2000

Understanding Core Database Concepts

Presentation transcript:

A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014

Introduction Physics Analysis is currently file-based Scanning through large datasets can be cumbersome – The idea: send jobs to the computing grid – In practice: bored waiting for grid-job to finish, scientists filter datasets, throwing away data until it fits on the physicist’ laptop What if we could provide access to large datasets via a database?

Introduction In-database physics analysis: SQL goes in, results come out! J/ψ Ψ(3686)

My data Test sample of 127 ntuple-files of collision-data recorded by ATLAS experiment => subset re-presenting 3 fat ‘LHC runs’, ~0.5% of total dataset 7.1 million events total 6022 “branches” per event – 2053 “scalar”-type branches – 3527 “vector”-type branches – 379 “vector-of-vector”-type branches – 63 “vector-of-vector-of-vector”-type branches ~200 GB of data

Ntuple branch examples 2053 scalar-type variables Missing Energy: one value per branch event Float_t MET_RefFinal_em_etx; Float_t MET_RefFinal_em_ety; Float_t MET_RefFinal_em_phi; Float_t MET_RefFinal_em_et; Float_t MET_RefFinal_em_sumet; Float_t MET_RefFinal_etx; Float_t MET_RefFinal_ety; Float_t MET_RefFinal_phi; Event Filter: one value per branch per event Bool_t EF_2b55_loose_j145_j55_a4tchad; Bool_t EF_2e12Tvh_loose1; Bool_t EF_2e5_tight1_Jpsi; Bool_t EF_2e7T_loose1_mu6; Bool_t EF_2e7T_medium1_mu6; Bool_t EF_2g15vh_medium_g10_medium; Bool_t EF_2g20vh_medium; Lots of variables but relatively small fraction of the total dataset

Ntuple branch examples One value per electron per event vector *el_E; vector *el_Et; vector *el_pt; vector *el_m; vector *el_eta; vector *el_phi; vector *el_px; vector *el_py; vector *el_pz; vector *el_charge; vector *el_author; 3527 vector-type variables One value per muon per event vector *mu_allauthor; vector *mu_author; vector *mu_beta; vector *mu_isMuonLikelihood; vector *mu_matchchi2; vector *mu_matchndof; vector *mu_etcone20; vector *mu_etcone30; vector *mu_etcone40; vector *mu_nucone20; vector *mu_nucone30; vector *mu_nucone40; One value per photon per event vector *ph_CaloPointing_eta; vector *ph_CaloPointing_sigma_eta; vector *ph_CaloPointing_zvertex; vector *ph_CaloPointing_sigma_zvertex; vector *ph_HPV_eta; vector *ph_HPV_sigma_eta; vector *ph_HPV_zvertex; vector *ph_HPV_sigma_zvertex; vector *ph_NN_passes; vector *ph_NN_discriminant; Representing the bulk of the data (many particles per event!) Analysis relies heavily on filtering events by selection particles with certain properties

Ntuple branch examples 379 vector-of-vector type variables One value per ‘SpaceTime’-measurement on each muon per event: vector > *mu_SpaceTime_detID; vector > *mu_SpaceTime_t; vector > *mu_SpaceTime_tError; vector > *mu_SpaceTime_weight; One value per vertex per photon per event: vector > *ph_vx_px; vector > *ph_vx_py; vector > *ph_vx_pz; vector > *ph_vx_E; vector > *ph_vx_m; vector > *ph_vx_nTracks; Used for certain reconstruction performance studies To be stored in CLOB or separate table…

Ntuple branch examples 63 vector-of-vector-vector type variables One value per track per vertex per photon per event: vector > > *ph_vx_convTrk_nSiHits; vector > > *ph_vx_convTrk_TRTHighTHitsRatio; vector > > *ph_vx_convTrk_TRTHighTOutliersRatio; vector > > *ph_vx_convTrk_eProbabilityComb; Not using any of these in my queries, typically used for final corrections or certain in-depth studies of reconstruction performance To be stored in CLOB or separate table…

Converting ntuples to tables Self-made program to convert ntuples into database tables One physics-object is represented by one table Each table still has hundreds of columns!

SQL analysis SQL analysis involves predicate filtering to select good objects and JOINs to put information from different tables together: CitusDB+column-store extension looks interesting: Object selection involves only a few out of many columns => would benefit from column storage When preselection passes many objects, JOINs can potentially become huge => would benefit from sharding, with shard-distribution based on EventNumber to reduce JOIN-size

Storing ntuple-data into to CitusDB Re-wrote my program to store data in CitusDB – Read data from all branches with specific prefix – Write data as comma-delimited values in temporary value – After csv-file passes 5000 lines of data, store data into CitusDB Program trigger command-line argument for psql to execute psql- macro Psql-macro uses \STAGE command to load data

Create table statement CREATE FOREIGN TABLE eventdata203779_c (RunNumber INTEGER NOT NULL,EventNumber INTEGER NOT NULL, lbn INTEGER NOT NULL,"bunch_configID" INT,"timestamp" INT,"timestamp_ns" INT,"bcid" INT,"detmask0" INT,"detmask1" INT,"actualIntPerXing" FLOAT,"averageIntPerXing" FLOAT,"pixelFlags" INT,"sctFlags" INT,"trtFlags" INT,"larFlags" INT,"tileFlags" INT,"fwdFlags" INT,"coreFlags" INT,"pixelError" INT,"sctError" INT,"trtError" INT,"larError" INT,"tileError" INT,"fwdError" INT,"coreError" INT,"streamDecision_Egamma" BOOLEAN,"streamDecision_Muons" BOOLEAN,"streamDecision_JetTauEtmiss" BOOLEAN,"isSimulation" BOOLEAN,"isCalibration" BOOLEAN,"isTestBeam" BOOLEAN,"el_n" INT,"v0_n" INT,"ph_n" INT,"mu_n" INT,"tau_n" INT,"trk_n" INT,"jet_n" INT,"vxp_n" INT,"top_hfor_type" INT,"Muon_Total_Staco_STVF_etx" FLOAT,"Muon_Total_Staco_STVF_ety" FLOAT,"Muon_Total_Staco_STVF_phi" FLOAT,"Muon_Total_Staco_STVF_et" FLOAT,"Muon_Total_Staco_STVF_sumet" FLOAT,"Muon_Total_Staco_STVF_top_etx" FLOAT,"Muon_Total_Staco_STVF_top_ety" FLOAT,"Muon_Total_Staco_STVF_top_phi" FLOAT,"Muon_Total_Staco_STVF_top_et" FLOAT,"Muon_Total_Staco_STVF_top_sumet" FLOAT,"mb_n" INT,"collcand_passCaloTime" BOOLEAN,"collcand_passMBTSTime" BOOLEAN,"collcand_passTrigger" BOOLEAN,"collcand_pass" BOOLEAN) DISTRIBUTE BY APPEND (EventNumber) SERVER cstore_server OPTIONS(filename '', compression 'pglz'); Create foreign tables stored using column-store extension Distribute shards by EventNumber Keep data from the same event together Facilitate joins between different tables One table per RunNumber: distribute shards by (RunNumber,EventNumber) not possible

\STAGE statement \STAGE eventdata203779_c FROM '/data_citusdb/csv/NTUP_TOPEL NTUP_TOPEL _ root.1.eventdata.csv' (FORMAT CSV) Gives: \copy: ERROR: copy column list is not supported I can’t define columns when inserting into foreign tables using \STAGE Too bad, I found it useful to specify the columns as different ntuple can contain different branches: if column is not specified in csv, it should insert null Similarly I’d like to have an option to add columns (is this possible? Didn’t look at it yet) Instead I’ll use simple \STAGE command: \STAGE eventdata203779_c (RunNumber,EventNumber,lbn,"bunch_configID","timestamp","timestamp_ns","bcid","detmask0","detmask1","act ualIntPerXing","averageIntPerXing","pixelFlags","sctFlags","trtFlags","larFlags","tileFlags","fwdFlags","coreFlags","pi xelError","sctError","trtError","larError","tileError","fwdError","coreError","streamDecision_Egamma","streamDecis ion_Muons","streamDecision_JetTauEtmiss","isSimulation","isCalibration","isTestBeam","el_n","v0_n","ph_n","mu _n","tau_n","trk_n","jet_n","vxp_n","top_hfor_type","Muon_Total_Staco_STVF_etx","Muon_Total_Staco_STVF_ety ","Muon_Total_Staco_STVF_phi","Muon_Total_Staco_STVF_et","Muon_Total_Staco_STVF_sumet","Muon_Total_St aco_STVF_top_etx","Muon_Total_Staco_STVF_top_ety","Muon_Total_Staco_STVF_top_phi","Muon_Total_Staco_S TVF_top_et","Muon_Total_Staco_STVF_top_sumet","mb_n","collcand_passCaloTime","collcand_passMBTSTime"," collcand_passTrigger","collcand_pass") FROM '/data1/citus_db/csv/NTUP_TOPEL NTUP_TOPEL _ root.1.eventdata.csv' (FORMAT CSV)

Primary Key issues I’d like to set primary key set on (RunNumber,EventNumber,ObjectNumber). Ntuple-files occassionaly store the same event twice Due to the way experiments records data from ‘streams’, some overlap from different streams Sorting out doubles is yet another hassle for physicists to deal with, using a database with primary-key ensures unique event are store… but: Currently using \STAGE insert of all data in the entire.csv-file fails when it find 1 double among the lines Work-around= no primary key constraint for now…

Other issues While testing, I’m frequently deciding to recreate some tables, but how do I drop FOREIGN table including the shards?

Storing ntuple-data into to CitusDB Example of insert-program churning through the data…

Query test 15:34 Do I still have time to test something before the call??

To-do Insert all my data: I need to find a good way to use my 18 disks per node (mystery errors were coming from my raid setup) Maybe I just mount each disk separately and run one worker per disk? Get some queries going!