A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014.

A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014

Introduction Physics Analysis is currently file-based Scanning through large datasets can be cumbersome – The idea: send jobs to the computing grid – In practice: bored waiting for grid-job to finish, scientists filter datasets, throwing away data until it fits on the physicist’ laptop What if we could provide access to large datasets via a database?

Introduction In-database physics analysis: SQL goes in, results come out! J/ψ Ψ(3686)

My data Test sample of 127 ntuple-files of collision-data recorded by ATLAS experiment => subset re-presenting 3 fat ‘LHC runs’, ~0.5% of total dataset 7.1 million events total 6022 “branches” per event – 2053 “scalar”-type branches – 3527 “vector”-type branches – 379 “vector-of-vector”-type branches – 63 “vector-of-vector-of-vector”-type branches ~200 GB of data

Ntuple branch examples 2053 scalar-type variables Missing Energy: one value per branch event Float_t MET_RefFinal_em_etx; Float_t MET_RefFinal_em_ety; Float_t MET_RefFinal_em_phi; Float_t MET_RefFinal_em_et; Float_t MET_RefFinal_em_sumet; Float_t MET_RefFinal_etx; Float_t MET_RefFinal_ety; Float_t MET_RefFinal_phi; Event Filter: one value per branch per event Bool_t EF_2b55_loose_j145_j55_a4tchad; Bool_t EF_2e12Tvh_loose1; Bool_t EF_2e5_tight1_Jpsi; Bool_t EF_2e7T_loose1_mu6; Bool_t EF_2e7T_medium1_mu6; Bool_t EF_2g15vh_medium_g10_medium; Bool_t EF_2g20vh_medium; Lots of variables but relatively small fraction of the total dataset

Ntuple branch examples One value per electron per event vector *el_E; vector *el_Et; vector *el_pt; vector *el_m; vector *el_eta; vector *el_phi; vector *el_px; vector *el_py; vector *el_pz; vector *el_charge; vector *el_author; 3527 vector-type variables One value per muon per event vector *mu_allauthor; vector *mu_author; vector *mu_beta; vector *mu_isMuonLikelihood; vector *mu_matchchi2; vector *mu_matchndof; vector *mu_etcone20; vector *mu_etcone30; vector *mu_etcone40; vector *mu_nucone20; vector *mu_nucone30; vector *mu_nucone40; One value per photon per event vector *ph_CaloPointing_eta; vector *ph_CaloPointing_sigma_eta; vector *ph_CaloPointing_zvertex; vector *ph_CaloPointing_sigma_zvertex; vector *ph_HPV_eta; vector *ph_HPV_sigma_eta; vector *ph_HPV_zvertex; vector *ph_HPV_sigma_zvertex; vector *ph_NN_passes; vector *ph_NN_discriminant; Representing the bulk of the data (many particles per event!) Analysis relies heavily on filtering events by selection particles with certain properties

Ntuple branch examples 379 vector-of-vector type variables One value per ‘SpaceTime’-measurement on each muon per event: vector > *mu_SpaceTime_detID; vector > *mu_SpaceTime_t; vector > *mu_SpaceTime_tError; vector > *mu_SpaceTime_weight; One value per vertex per photon per event: vector > *ph_vx_px; vector > *ph_vx_py; vector > *ph_vx_pz; vector > *ph_vx_E; vector > *ph_vx_m; vector > *ph_vx_nTracks; Used for certain reconstruction performance studies To be stored in CLOB or separate table…

Ntuple branch examples 63 vector-of-vector-vector type variables One value per track per vertex per photon per event: vector > > *ph_vx_convTrk_nSiHits; vector > > *ph_vx_convTrk_TRTHighTHitsRatio; vector > > *ph_vx_convTrk_TRTHighTOutliersRatio; vector > > *ph_vx_convTrk_eProbabilityComb; Not using any of these in my queries, typically used for final corrections or certain in-depth studies of reconstruction performance To be stored in CLOB or separate table…

Converting ntuples to tables Self-made program to convert ntuples into database tables One physics-object is represented by one table Each table still has hundreds of columns!

SQL analysis SQL analysis involves predicate filtering to select good objects and JOINs to put information from different tables together: CitusDB+column-store extension looks interesting: Object selection involves only a few out of many columns => would benefit from column storage When preselection passes many objects, JOINs can potentially become huge => would benefit from sharding, with shard-distribution based on EventNumber to reduce JOIN-size

Storing ntuple-data into to CitusDB Re-wrote my program to store data in CitusDB – Read data from all branches with specific prefix – Write data as comma-delimited values in temporary value – After csv-file passes 5000 lines of data, store data into CitusDB Program trigger command-line argument for psql to execute psql- macro Psql-macro uses \STAGE command to load data

Create table statement CREATE FOREIGN TABLE eventdata203779_c (RunNumber INTEGER NOT NULL,EventNumber INTEGER NOT NULL, lbn INTEGER NOT NULL,"bunch_configID" INT,"timestamp" INT,"timestamp_ns" INT,"bcid" INT,"detmask0" INT,"detmask1" INT,"actualIntPerXing" FLOAT,"averageIntPerXing" FLOAT,"pixelFlags" INT,"sctFlags" INT,"trtFlags" INT,"larFlags" INT,"tileFlags" INT,"fwdFlags" INT,"coreFlags" INT,"pixelError" INT,"sctError" INT,"trtError" INT,"larError" INT,"tileError" INT,"fwdError" INT,"coreError" INT,"streamDecision_Egamma" BOOLEAN,"streamDecision_Muons" BOOLEAN,"streamDecision_JetTauEtmiss" BOOLEAN,"isSimulation" BOOLEAN,"isCalibration" BOOLEAN,"isTestBeam" BOOLEAN,"el_n" INT,"v0_n" INT,"ph_n" INT,"mu_n" INT,"tau_n" INT,"trk_n" INT,"jet_n" INT,"vxp_n" INT,"top_hfor_type" INT,"Muon_Total_Staco_STVF_etx" FLOAT,"Muon_Total_Staco_STVF_ety" FLOAT,"Muon_Total_Staco_STVF_phi" FLOAT,"Muon_Total_Staco_STVF_et" FLOAT,"Muon_Total_Staco_STVF_sumet" FLOAT,"Muon_Total_Staco_STVF_top_etx" FLOAT,"Muon_Total_Staco_STVF_top_ety" FLOAT,"Muon_Total_Staco_STVF_top_phi" FLOAT,"Muon_Total_Staco_STVF_top_et" FLOAT,"Muon_Total_Staco_STVF_top_sumet" FLOAT,"mb_n" INT,"collcand_passCaloTime" BOOLEAN,"collcand_passMBTSTime" BOOLEAN,"collcand_passTrigger" BOOLEAN,"collcand_pass" BOOLEAN) DISTRIBUTE BY APPEND (EventNumber) SERVER cstore_server OPTIONS(filename '', compression 'pglz'); Create foreign tables stored using column-store extension Distribute shards by EventNumber Keep data from the same event together Facilitate joins between different tables One table per RunNumber: distribute shards by (RunNumber,EventNumber) not possible

\STAGE statement \STAGE eventdata203779_c FROM '/data_citusdb/csv/NTUP_TOPEL.00872780.NTUP_TOPEL.00872780._000001.root.1.eventdata.csv' (FORMAT CSV) Gives: \copy: ERROR: copy column list is not supported I can’t define columns when inserting into foreign tables using \STAGE Too bad, I found it useful to specify the columns as different ntuple can contain different branches: if column is not specified in csv, it should insert null Similarly I’d like to have an option to add columns (is this possible? Didn’t look at it yet) Instead I’ll use simple \STAGE command: \STAGE eventdata203779_c (RunNumber,EventNumber,lbn,"bunch_configID","timestamp","timestamp_ns","bcid","detmask0","detmask1","act ualIntPerXing","averageIntPerXing","pixelFlags","sctFlags","trtFlags","larFlags","tileFlags","fwdFlags","coreFlags","pi xelError","sctError","trtError","larError","tileError","fwdError","coreError","streamDecision_Egamma","streamDecis ion_Muons","streamDecision_JetTauEtmiss","isSimulation","isCalibration","isTestBeam","el_n","v0_n","ph_n","mu _n","tau_n","trk_n","jet_n","vxp_n","top_hfor_type","Muon_Total_Staco_STVF_etx","Muon_Total_Staco_STVF_ety ","Muon_Total_Staco_STVF_phi","Muon_Total_Staco_STVF_et","Muon_Total_Staco_STVF_sumet","Muon_Total_St aco_STVF_top_etx","Muon_Total_Staco_STVF_top_ety","Muon_Total_Staco_STVF_top_phi","Muon_Total_Staco_S TVF_top_et","Muon_Total_Staco_STVF_top_sumet","mb_n","collcand_passCaloTime","collcand_passMBTSTime"," collcand_passTrigger","collcand_pass") FROM '/data1/citus_db/csv/NTUP_TOPEL.00872780.NTUP_TOPEL.00872780._000001.root.1.eventdata.csv' (FORMAT CSV)

Primary Key issues I’d like to set primary key set on (RunNumber,EventNumber,ObjectNumber). Ntuple-files occassionaly store the same event twice Due to the way experiments records data from ‘streams’, some overlap from different streams Sorting out doubles is yet another hassle for physicists to deal with, using a database with primary-key ensures unique event are store… but: Currently using \STAGE insert of all data in the entire.csv-file fails when it find 1 double among the 50000 lines Work-around= no primary key constraint for now…

Other issues While testing, I’m frequently deciding to recreate some tables, but how do I drop FOREIGN table including the shards?

Storing ntuple-data into to CitusDB Example of insert-program churning through the data…

Query test 15:34 Do I still have time to test something before the call??

To-do Insert all my data: I need to find a good way to use my 18 disks per node (mystery errors were coming from my raid setup) Maybe I just mount each disk separately and run one worker per disk? Get some queries going!

A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014.

Similar presentations

Presentation on theme: "A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014.

Similar presentations

Presentation on theme: "A first look at CitusDB & in- database physics analysis M. Limper 19/06/2014."— Presentation transcript:

Similar presentations

About project

Feedback