Download presentation
Presentation is loading. Please wait.
Published byHenry Sherman Modified over 9 years ago
1
CHEP `03 http://cmsdoc.cern.ch/cmsoo/cmsoo.html March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo Innocente CERN/EP
2
March 24, 2003 Vincenzo Innocente CERN/EP 2Abstract v CMS Data Analysis: Current Status and Future Strategy v We present the current status of CMS data analysis architecture and describe work on future Grid-based distributed analysis prototypes. CMS has two main software frameworks related to data analysis: COBRA, the main framework, and IGUANA, the interactive visualisation framework. Software using these frameworks is used today in the world-wide production and analysis of CMS data. We describe their overall design and present examples of their current use with emphasis on interactive analysis. CMS is currently developing remote analysis prototypes, including one based on Clarens, a Grid-enabled client-server tool. Use of the prototypes by CMS physicists will guide us in forming a Grid-enriched analysis strategy. The status of this work is presented, as is an outline of how we plan to leverage the power of our existing frameworks in the migration of CMS software to the Grid.
3
March 24, 2003 Vincenzo Innocente CERN/EP 3Analysis v Analysis is not to use the tool to plot an histogram, but the full chain from accessing event data up to producing the final plot for publication v Analysis is an iterative process: r Reduce data samples to more interesting subsets (selection) r Compute higher level information r Calculate statistical entities v Several steps: r Run analysis job on full dataset (few times) r Use interactive analysis tool to run several times on reduced dataset and make plots Still in the early stage of defining an Analysis Model Today we work with raw data r Reconstruction and analysis are mixed up – (analysis and debugging are mixed up!) r Software development, production and analysis in parallel r No clear concept of high level persistent objects (DST) r Each physics group has its own analysis package and “standard ntuple” CMS is a laboratory for experimenting analysis solutions
4
March 24, 2003 Vincenzo Innocente CERN/EP 4 Getting ready for April `07 CMS is engaged in an aggressive program of “data challenges” of increasing complexity. Each is focus on a given aspect, all encompass the whole data analysis process: r Simulation, reconstruction, statistical analysis r Organized production, end-user batch job, interactive work v Past: Data Challenge `02 r Focus on High Level Trigger studies v Present: Data Challenge `04 r Focus on “real-time” mission critical tasks v Future: Data Challenge `06 r Focus on distributed physics analysis
5
March 24, 2003 Vincenzo Innocente CERN/EP 5 HLT Production 2002 v Focused on High Level Trigger studies r 6 M events = 150 Physics channels r 19000 files = 500 Event Collections = 20 TB NoPU: 2.5M, 2x10 33 PU:4.4M, 10 34 PU: 3.8M, filter: 2.9M r 100 000 jobs, 45 years CPU (wall-clock) r 11 Regional Centers – > 20 sites in USA, Europe, Russia – ~ 1000 CPUs r More than 10 TB traveled on the WAN r More than 100 physics involved in the final analysis v GEANT3, Objectivity, Paw, Root v CMS Object Reconstruction & Analysis Framework (COBRA) and applications (ORCA) Successful validation of CMS High Level Trigger Algorithms Rejection factors, computing performance, reconstruction-framework
6
March 24, 2003 Vincenzo Innocente CERN/EP 6 Data challenge 2004 v DC04, to be completed in April 2004 r Reconstruct 50 million events r Copes with 25 Hz at 2×10 33 cms -2 s -1 for 1 month r These are supposed to be events in the Tier-0 center, i.e. events passing the HLT – From the computing point of view the test is the same if these events are simple minimum bias – This is a great opportunity to reconstruct events which can be used for full analysis (Physics-TDR) r Define and validate datasets for analysis – Identify reconstruction and analysis objects each group would like to have for the full analysis – Develop selection algorithms necessary to obtain the required sample r Prepare for “mission critical” analysis test event model – Look at calibration and alignment r Physics and computing validation of Geant4 detector simulation
7
March 24, 2003 Vincenzo Innocente CERN/EP 7 How data analysis begins The result of the reconstruction will be saved along with the raw data in an Object database Monitoring, calibration HLTFU Filter Unit SU Server Unit PU Processing Unit Online Offline Express lines Reconstruction, reprocessing, analysis latency: (minutes – hours)
8
March 24, 2003 Vincenzo Innocente CERN/EP 8 Data Challenge 2004 … …… … Digitization ORCA Digis: raw data bx MB … …… … MC ntuples Event generation PYTHIA b/ e/ JetMet Analysis Iguana/ Root/PAW Ntuples: MC info, tracks, etc DST stripping ORCA ………… … …… … Reconstruction, L1, HLT ORCA DST Detector simulation OSCAR Detector Hits MB … …… … Calibration
9
March 24, 2003 Vincenzo Innocente CERN/EP 9 Jets CaloClustersTkTracks CaloRecHitsTkRecHits CaloDataFramesTkDigis JetReconstructor < cut r < r cut Calib-A Align-C TkHits Random Nr. CaloHits Reconstruction DAQ or Simulation High granularity “DAG” Calibrations and detail detector and physics studies require access to few objects per event. These studies will need also to access to “conditions” data associated to these objects. Access pattern to the very same object may be very different for different use cases: F a flexible definition of “datasets” (associated to use-cases) is required.
10
March 24, 2003 Vincenzo Innocente CERN/EP 10 Analysis Environments Real Time Event Filtering and Monitoring r Data driven pipeline r High reliability Pre-emptive Simulation, Reconstruction and Event Classification r Massive parallel batch-sequential process r Excellent error recovery and rollback mechanisms r Excellent scheduling and bookkeeping systems Interactive Statistical Analysis r Rapid Application Development environment r Excellent visualization and browsing tools r Human “readable” navigation
11
March 24, 2003 Vincenzo Innocente CERN/EP 11 Three Computing Environments: Different Challenges v Centralized quasi-online processing r Keep-up with the rate r Validate and distribute data efficiently v Distributed organized processing r Automatization v Interactive chaotic analysis r Efficient access to data and “Metadata” r Management of “private” data r Rapid Application Development
12
March 24, 2003 Vincenzo Innocente CERN/EP 12 The Ultimate Challenge: A Coherent Analysis Environment v Beyond the interactive analysis tool (User point of view) r Data analysis & presentation: N-tuples, histograms, fitting, plotting, … v A great range of other activities with fuzzy boundaries (Developer point of view) r Batch r Interactive from “pointy-clicky” to Emacs-like power tool to scripting r Setting up configuration management tools, application frameworks and reconstruction packages r Data store operations: Replicating entire data stores; Copying runs, events, event parts between stores; Not just copying but also doing something more complicated—filtering, reconstruction, analysis, … r Browsing data stores down to object detail level r 2D and 3D visualisation r Moving code across final analysis, reconstruction and triggers Today this involves (too) many tools
13
March 24, 2003 Vincenzo Innocente CERN/EP 13 Federation wizards Detector/Event Display Data Browser Analysis job wizards Generic analysis Tools ORCA FAMOS LCGtools GRID OSCAR COBRA Distributed Data Store & Computing Infrastructure CMStools Architecture Overview Consistent User Interface Coherent set of basic tools and mechanisms Software development and installation
14
March 24, 2003 Vincenzo Innocente CERN/EP 14 Simulation, Reconstruction & Analysis Software System Specific Framework Object Persistency Geant3/4 CLHEP Analysis Tools C++ standard library Extension toolkit Reconstruction Algorithms Data Monitoring Event Filter Physics Analysis Calibration Objects Event Objects Configuration Objects Generic Application Framework Physics modules adapters and extensions Basic Services Grid-Aware Data-Products Grid-enabled Application Framework Uploadable on the Grid LCG
15
March 24, 2003 Vincenzo Innocente CERN/EP 15 Qt plotter Histogram extended with pointers to CMS events Emacs used to edit CMS C++ plugin to create and fill histograms OpenInventor-based display of selected event Python shell with external & CMS modules
16
March 24, 2003 Vincenzo Innocente CERN/EP 16 Varied components and data flows One Portal Production system and data repositories ORCA analysis farm(s) (or distributed `farm’ using grid queues) RDBMS based data warehouse(s) PIAF/Proof/.. type analysis farm(s) Local disk User TAGs/AODs data flow Physics Query flow Tier 1/2 Tier 0/1/2 Tier 3/4/5 Production data flow TAG and AOD extraction/conversion/transport services Data extraction Web service(s) Local analysis tool: Iguana/ROOT/… Web browser Query Web service(s) Tool plugin module
17
March 24, 2003 Vincenzo Innocente CERN/EP 17 CLARENS: a Portal to the Grid Grid-enabling the working environment for physicists' data analysis Clarens consists of a server communicating with various clients via the commodity XML-RPC protocol. This ensures implementation independence. The server will provide a remote API to Grid tools: Client RPC Web Server Clarens Service http/https The Virtual Data Toolkit: Object collection access Data movement between Tier centres using GSI- FTP CMS analysis software (ORCA/COBRA), Security services provided by the Grid (GSI) No Globus needed on client side, only certificate Current prototype is running on the Caltech proto-Tier2
18
March 24, 2003 Vincenzo Innocente CERN/EP 18 Clarens Architecture Common protocol spoken by all types of clients to all types of services Implement service once for all clients Implement client access to service once for each client type using common protocol already implemented for “all” languages (C++, Java, Fortran, etc. :-) Common protocol is XML-RPC with SOAP close to working, CORBA doable, but would require different server above Clarens (uses IIOP, not HTTP) Handles authentication using Grid certificates, connection management, data serialization, optionally encryption Implementation uses stable, well-known server infrastructure (Apache) that is debugged/audited over a long period by many Clarens layer itself implemented in Python, but can be reimplemented in C++ should performance be inadequate More information at http://clarens.sourceforge.net, along with a web-based demo
19
March 24, 2003 Vincenzo Innocente CERN/EP 19 Example of analysis on the grid Web Server Clarens Service Remote batch service: resource allocations, control, monitoring Local analysis Environment: Data cache browser, presenter Resource broker? Remote web service: act as gateway between users and remote facility
20
March 24, 2003 Vincenzo Innocente CERN/EP 20Summary Success of analysis software will be measured by the ability to provide at the same time a simple, coherent and stable view to the physicists retaining the flexibility required to achieve the maximal computing efficiency CMS is responding to this challenge developing an analysis software architecture based on a layered structure r a consistent interface to the physicist – Customizable – Implemented in many flavors (Qt, python, root, web-browser) r A flexible application framework – Mainly responsible of managing event-data with high-granularity r A set of back-end services – Specialized for different use-cases and computing environments
21
March 24, 2003 Vincenzo Innocente CERN/EP 21Summary v “Spring 2002” production successfully: r Distributed organized production r Distributed “traditional” analysis r Validation of High Level Trigger strategy v Next target (DC04): r one month of mission critical analysis r Test of analysis and computing model
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.