Optimizing CMS Data Formats for Analysis Peerut Boonchokchuay August 11 th, 2015 1.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

An Exercise in Improving SAS Performance on Mainframe Processors
Starfish: A Self-tuning System for Big Data Analytics.
From Quark to Jet: A Beautiful Journey Lecture 1 1 iCSC2014, Tyler Dorland, DESY From Quark to Jet: A Beautiful Journey Lecture 1 Beauty Physics, Tracking,
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Franco Barnard M&V Project Engineer Baseline Service Level Adjustments of ECM on Compressed Air Systems 16 August 2012.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Compressibility of WML and WMLScript byte code:Initial results Eetu Ojanen and Jari Veijalainen Department of Computer Science and Information Systems.
Investigation of TDMA solution for the hidden terminal problem (With “ Wavion ” ) Final Report Presentation.
CMS Full Simulation for Run-2 M. Hildrith, V. Ivanchenko, D. Lange CHEP'15 1.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Page 1 © 2001 Hewlett-Packard Company Tools for Measuring System and Application Performance Introduction GlancePlus Introduction Glance Motif Glance Character.
Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
The FIX Protocol as an Effective Solution for Algorithmic Trading Kevin Houstoun, Co-chair FPL Global Technical Committee, Consultant to HSBC.
Wahid Bhimji University of Edinburgh J. Cranshaw, P. van Gemmeren, D. Malon, R. D. Schaffer, and I. Vukotic On behalf of the ATLAS collaboration CHEP 2012.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Storage Manager Overview L3 Review of SM Software, 28 Oct Storage Manager Functions Event data Filter Farm StorageManager DQM data Event data DQM.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Jet Flow for Tagging High-p T Top Quarks David Nisson UC Davis Boosted Top Quark Group Phy 130B Final Project.
 Optimization and usage of D3PD Ilija Vukotic CAF - PAF 19 April 2011 Lyon.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
David N. Brown Lawrence Berkeley National Lab Representing the BaBar Collaboration The BaBar Mini  BaBar  BaBar’s Data Formats  Design of the Mini 
Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Analysis of the ROOT Persistence I/O Memory Footprint in LHCb Ivan Valenčík Supervisor Markus Frank 19 th September 2012.
Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.
Analysis infrastructure/framework A collection of questions, observations, suggestions concerning analysis infrastructure and framework Compiled by Marco.
Evolutionary Art with Multiple Expression Programming By Quentin Freeman.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
PWG3 Analysis: status, experience, requests Andrea Dainese on behalf of PWG3 ALICE Offline Week, CERN, Andrea Dainese 1.
David Adams ATLAS Virtual Data in ATLAS David Adams BNL May 5, 2002 US ATLAS core/grid software meeting.
A New Operating Tool for Coding in Lossless Image Compression Radu Rădescu University POLITEHNICA of Bucharest, Faculty of Electronics, Telecommunications.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Claudio Grandi INFN-Bologna CHEP 2000Abstract B 029 Object Oriented simulation of the Level 1 Trigger system of a CMS muon chamber Claudio Grandi INFN-Bologna.
The “Comparator” Atlfast vs. Full Reco Automated Comparison Chris Collins-Tooth 19 th February 2006.
1 Offline Week, October 28 th 2009 PWG3-Muon: Analysis Status From ESD to AOD:  inclusion of MC branch in the AOD  standard AOD creation for PDC09 files.
Workflows and Data Management. Workflow and DM Run3 and after: conditions m LHCb major upgrade is for Run3 (2020 horizon)! o Luminosity x 5 ( )
Sunpyo Hong, Hyesoon Kim
Valeri Tioukov GS May ROOT-based framework for the reconstruction of emulsions data Set of tools for storage, interactive reconstruction and analysis.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Vincenzo Innocente, CERN/EPUser Collections1 Grid Scenarios in CMS Vincenzo Innocente CERN/EP Simulation, Reconstruction and Analysis scenarios.
Software Engineering Prof. Dr. Bertrand Meyer March 2007 – June 2007 Chair of Software Engineering Lecture #20: Profiling NetBeans Profiler 6.0.
Analysis experience at GSIAF Marian Ivanov. HEP data analysis ● Typical HEP data analysis (physic analysis, calibration, alignment) and any statistical.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
Main parameters of Russian Tier2 for ATLAS (RuTier-2 model) Russia-CERN JWGC meeting A.Minaenko IHEP (Protvino)
The interface to EvtGen in CMS Roberto Covarelli University of Rochester EvtGen workshop, 06/12/2010.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP October 2010 Taipei.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
David Lange Lawrence Livermore National Laboratory
Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Atlas IO improvements and Future prospects
PERFORMANCE EVALUATIONS
ALICE experience with ROOT I/O
Tree based validation tool for track reconstruction
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
ALICE Computing Model in Run3
EE282h: Review Session #3 Programming Assignment #3:
Presentation transcript:

Optimizing CMS Data Formats for Analysis Peerut Boonchokchuay August 11 th,

Outline Introduction to MiniAOD Objectives for Project Method Developed Example Results Conclusion and Next Steps 2

CMS MiniAOD Data Format First introduced in 2014, MiniAOD is new CMS data format for analysis in LHC Run2 – Data stored in MiniAOD file comprises of high-level object, i.e., tracks, jets, electrons – MiniAOD is a subset of AOD which is, in turn, a subset of RECO – output of Tier-0 event reconstruction MiniAOD is 10x smaller than AOD – One advantage is that the miniAOD can be recreated quickly to improve the original physics object definitions using the latest algorithms 3 Left: Right: RAW RECO AOD Mini AOD

Objectives As with any analysis formats, users can benefit from faster processing times and smaller data sizes (to fit more data on your laptop!) To investigate ways to improve the MiniAOD performance in terms of: 1.Time to read events from file 2.Size of output file while withstanding: 1.Job memory usage (RSS of CMSSW application) 2.Time to write events to file 4

CMS MiniAOD data flow Analysis Input Source Input Source MiniAOD.root MiniAOD.root AOD.root Input Source Input Source Output Source MiniAOD object creation MiniAOD.root Step 1: Create MiniAOD Step 2: Analyze MiniAOD 5 MiniAOD modules and parameters are based on ROOT

Method Study MiniAOD Creation Process Execution time Write time Compression time Output file size Memory usage Study MiniAOD Reading Process Execution time Read time Decompression time Adjust Parameter Settings Number of events Basket Size Compression Algorithm Compression Level ….. Measure Performance Tools igProf RSS monitoring In terms of: CPU time memory usage 6

7 Example of igProf Results

8

By adjusting basket size from the default byte to byte, we can improve MiniAOD file size by 8% and read back time by 9% 9 Default value Recommended value Example Results: MiniAOD Reading Process

Example Results: MiniAOD Creation Process 10 Default value Recommended value Increasing the basket size to bytes also does not significantly increase the total RSS of the application used to create the CMS miniAOD

11 Example Results: Performance vs File Compression Algorithm Of two algorithms, LZMA and ZLIB, the first is more complex, hence taking longer processing time, but yields smaller output size. Write time for LZMA rises drastically as compression level increases whereas ZLIB consumes almost the same amount of time Compression level has trivial effect on read time for both algorithms, however, LZMA read time could take up for 40% longer than that of ZLIB

Conclusion We investigated ways to improve CMS MiniAOD performance. We found that optimizing some parameter settings could result in significant performance gain. Basket size tuning could improve read back time as well as output file size of MiniAOD while having a relatively small increase in job memory Other parameters we investigated, including the compression algorithm used were already set close to their optimal point Next step: Investigate ways to improve performance by changing the miniAOD object definitions 12

Thank You 13