ACAT Lassi A. Tuura, Northeastern University CMS Data Analysis Current Status and Future Strategy On behalf of CMS Collaboration Lassi A. Tuura Northeastern University, Boston
June, 2002 Lassi A. Tuura, Northeastern University 2Overview v The Context — CMS Analysis Today v Data Analysis Environment Architecture r Overview r COBRA r IGUANA r GRID/Production v Tomorrow and Beyond r Leveraging current frameworks in the Grid-enriched analysis environment r Clarens client-server prototype r Other prototype activities
June, 2002 Lassi A. Tuura, Northeastern University 3 Challenges:Complexity Geographic Dispersion Direct Access To Data Migration from Reconstruction to Trigger Environments: Real-Time Event Filter, Online Monitoring Pre-emptive Simulation, Reconstruction, Analysis Interactive Statistical Analysis Context
June, 2002 Lassi A. Tuura, Northeastern University 4 Current CMS Production Pythia Zebra files with HITS HEPEVT Ntuples CMSIM (GEANT3) ORCA/COBRA Digitization (merge signal and pile-up) Objectivity Database ORCA/COBRA ooHit Formatter Objectivity Database OSCAR/COBRA (GEANT4) ORCA User Analysis Ntuples or Root files Objectivity Database IGUANA Interactive Analysis
June, 2002 Lassi A. Tuura, Northeastern University 5 Complexity of Production TB toward T1 4TB toward T2 File Transfer by GDMP and by perl Scripts over scp/bbcp 17TBData Size (Not including fz files from Simulation) ~11,000Number of Files 6-8 Number of Production Passes for each Dataset (including analysis group processing done by production) 176 CPUsLargest Local Center ~1000Number of CPU’s 21Number of Computing Centers 11Number of Regional Centers
June, 2002 Lassi A. Tuura, Northeastern University 6 Interactive Analysis Lizard Qt plotter ANAPHE histogram extended with pointers to CMS events Emacs used to edit a CMS C++ plugin to create and fill histograms OpenInventor-based display of selected event Python shell with Lizard & CMS modules Most of analysis is done using NTUPLEs in PAW, some in ROOT
June, 2002 Lassi A. Tuura, Northeastern University 7 Behind the Scenes: Frameworks Federationwizards Detector/EventDisplay Data Browser Analysis job wizards Generic analysis Tools ORCA FAMOS Objytools GRID OSCAR COBRA Distributed Data Store & Computing Infrastructure CMStools Consistent User Interface Coherent basic tools and mechanisms
June, 2002 Lassi A. Tuura, Northeastern University 8 ODBMS GEANT 3 / 4 CLHEP PAW Replacement C++ Standard Library + Extension Toolkits C++ Standard Library + Extension Toolkits Frameworks Disected Calibration Objects Calibration Objects Generic Application Framework Physics modules Grid-Uploadable BasicServices Adapters and Extensions Configuration Objects Configuration Objects Event Objects Event Objects (Grid-aware) Data-Products SpecificFrameworks Event Filter Reconstruction Algorithms Physics Analysis Data Monitoring
June, 2002 Lassi A. Tuura, Northeastern University 9 v Several frameworks provide the environment together r Open: No central framework with all functionality – Frameworks are designed to be extensible – … and to collaborate with other software r Coherent: User sees “final” smooth interface – Achieved by integrating the frameworks together – … but the user does not do this work him/herself ! r Design applied at both framework and object design level v Successfully applied in many parts of CMS software r Applications, persistency; sub-frameworks; visualisation; … r No loss of usability, functionality or performance r Has made it easy to integrate directly with many existing tools v This is nothing novel — it is part of the standard risk- mitigation strategy of any modern industrial solution Framework Design Basis
June, 2002 Lassi A. Tuura, Northeastern University 10 Frameworks: COBRA Federationwizards Detector/EventDisplay Data Browser Analysis job wizards Generic analysis Tools ORCA FAMOS Objytools GRID OSCAR COBRA Distributed Data Store & Computing Infrastructure CMStools Consistent User Interface Coherent basic tools and mechanisms
June, 2002 Lassi A. Tuura, Northeastern University 11 COBRA: Main Components v Push- and pull-mode execution—and any mixture r Reconstruction-on-demand is a key concept in COBRA r Detector-centric reconstruction—push data from event r Reconstruction-unit-centric reconstruction—pull/create data as needed v Event data and related structures r Basic support for commonly needed objects (hits, digis, containers, …) v Application environments r Basic application frameworks, various semi-specialised applications r Lots of error-handling and recovery code (automatic recovery after crash, …) v Meta data: a key component r Data chunking, system and user collections, data streams, file management, job concepts, configuration and setup records, redirected navigation after reprocessing, …
June, 2002 Lassi A. Tuura, Northeastern University 12 COBRA: Main Strengths v Algorithms in plug-ins r “Publish-yourself-plug-ins”—self-describing data producers v Strong meta-data facilities r Reconstruction-on-demand matches data product concept very well – Grid virtual data products concept really just an extension r Convenient mapping of data products to chunks: files, containers, … r Scatter / gather: decompose jobs, gather data – One logical job can be chopped into many physical processes, we still know it is logically the same job no matter which process it is running in v Adapts automatically to many environments without special configuration: interactive, batch, farm, stand-alone, trigger, … r Through appropriate use of enabling techniques (transactions, locking, refs) r No data post-processing required r Well-matched to production tools (IMPALA)
June, 2002 Lassi A. Tuura, Northeastern University 13 Storage Manager Storage Manager Schema Manager Schema Manager Transaction Manager Transaction Manager C++ Binding File I/O Lock Server Lock Server Page Server Page Server Catalog Manager DDL Source Processing DDL Source Processing Meta Data Meta Data Object Access Object Access MSS, Grid & Farm Interface MSS, Grid & Farm Interface Objectivity
June, 2002 Lassi A. Tuura, Northeastern University 14 Refs & Navigation Refs & Navigation Queries Cache Management Cache Management Storage Manager Storage Manager Schema Manager Schema Manager Transaction Manager Transaction Manager C++ Binding File I/O Lock Server Lock Server Page Server Page Server Catalog Manager DDL Source Processing DDL Source Processing Meta Data Meta Data Object Access Object Access MSS, Grid & Farm Interface MSS, Grid & Farm Interface Objectivity
June, 2002 Lassi A. Tuura, Northeastern University 15 Object Naming Object Naming Configurations (Data Sets) Configurations (Data Sets) Collections Run Resume & Crash Recovery Run Resume & Crash Recovery Storage Manager Storage Manager Schema Manager Schema Manager Transaction Manager Transaction Manager C++ Binding File I/O Lock Server Lock Server Page Server Page Server Catalog Manager DDL Source Processing DDL Source Processing Meta Data Meta Data Object Access Object Access MSS, Grid & Farm Interface MSS, Grid & Farm Interface Objectivity
June, 2002 Lassi A. Tuura, Northeastern University 16 File Size Control File Size Control Farm Management Farm Management System Management System Management Storage Manager Storage Manager Schema Manager Schema Manager Transaction Manager Transaction Manager C++ Binding File I/O Lock Server Lock Server Page Server Page Server Catalog Manager DDL Source Processing DDL Source Processing Meta Data Meta Data Object Access Object Access MSS, Grid & Farm Interface MSS, Grid & Farm Interface Objectivity
June, 2002 Lassi A. Tuura, Northeastern University 17 Frameworks: IGUANA Federationwizards Detector/EventDisplay Data Browser Analysis job wizards Generic analysis Tools ORCA FAMOS Objytools GRID OSCAR COBRA Distributed Data Store & Computing Infrastructure CMStools Consistent User Interface Coherent basic tools and mechanisms
June, 2002 Lassi A. Tuura, Northeastern University 18 User Interface and Visualisation v IGUANA: a generic toolkit for user interfaces and visualisation r Builds on existing high-quality libraries (Qt, OpenInventor, Anaphe, …) r Used to implement specific visualisation applications in other projects v Main technical focus: provide a platform that makes it easy to integrate GUIs as a coherent whole, to provide application services and to visualise any application object r Many categories / layers: GUI gadgets & support, application environment, data visualisers, data representation methods, control panels, … r Designed to integrate with and into other applications r Virtually everything is in plug-ins (can still be statically linked) Plug-In Cache Object Factory Object Factory Component Database Plug-In Cache Plug-In Object Factory Attached Unattached
June, 2002 Lassi A. Tuura, Northeastern University 19 Illustration: 3D Visualisation QMainWindow Browser Site QMDIShell Browser Site QMDIShell Browser Site 3D Browser Twig Browser
June, 2002 Lassi A. Tuura, Northeastern University 20 IGUANA GUI Integration Integration Action Visualise Results, Modify Objects, Further Interaction
June, 2002 Lassi A. Tuura, Northeastern University 21 Tomorrow and Beyond v Leverage the current frameworks on the grid r Many native COBRA concepts match well with grid – (Virtual) data products ~ reconstruction-on-demand – Recording and matching configuration and setup information – Production interfaces: catalogs, redirection, MSS hooks – Scatter/gather job decomposition, production environment r COBRA-based applications can be encapsulated for distributed analysis r IGUANA already separates application objects, model and viewer – Many possibilities for introducing distributed links r IGUANA+COBRA provides a platform for a coherent, well-integrated interface no matter where the code runs and data comes and goes – Both have loads of knobs and hooks for integration v Aiming at adapting the existing software where possible r Adapt and work within CMS software (COBRA, ORCA, …) and existing analysis tools (ROOT, Lizard, …)—don’t replace them
June, 2002 Lassi A. Tuura, Northeastern University 22 Client RPC Web Server Clarens Service http/https Prototypes: Clarens Web Portals v Grid-enabling the working environment for physicists' data analysis v Communication with clients via the commodity XML-RPC protocol Implementation independence v Server implemented in C++: access to the CMS OO analysis toolkit v Server provides a remote API to Grid tools r The Virtual Data Toolkit: Object collection access r Data movement between tier centres using GSI-FTP r CMS analysis software (ORCA/COBRA) r Security services provided by the Grid (GSI) r No Globus needed on client side, only certificate
June, 2002 Lassi A. Tuura, Northeastern University 23 Tool plugin module Production system and data repositories ORCA analysis farm(s) (or distributed `farm’ using grid queues) RDBMS based data warehouse(s) PIAF/Proof/.. type analysis farm(s) Local disk User TAGs/AODs data flow Physics Query flow Tier 1/2 Tier 0/1/2 Tier 3/4/5 Production data flow TAG and AOD extraction/conversion/transport services Data extraction Web service(s) Local analysis tool: Lizard/ROOT/… Web browser Query Web service(s) Prototypes: Clarens Web Portals…
June, 2002 Lassi A. Tuura, Northeastern University 24 Other Prototypes v Tag database optimisation r Fast sample selection is crucial r Various models already tried r Experimenting with RDBMS v MOP: distributed job submission system r Allows submission of CMS production jobs from a central location, run on remote locations, and return results – Job Specification: IMPALA – Replication: GDMP – Globus GRAM – Job Scheduling: Condor-G and local systems