Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork1 Software Frameworks for HEP Data Analysis Vincenzo Innocente CERN/EP
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork2 Data Analysis Micro-Process Physics analysis is to a large degree an iterative process of Reducing data samples to more interesting subsets Distilling the sample into information at higher abstraction level By summarising lower level information By calculating statistical entities from the samples A large part of the work can be done on very high-level entities in an interactive analysis and presentation tool Hence focus on tools that work on simple summary information (DSTs, N-tuples, tag databases,...) Additional tools for detector and event visualisation Experiment Reduce Distil Interpret PhysicsPaper
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork3 HEP Experiment-Data Analysis Detector Control Online Monitoring Environmental data store Request part of event Simulation store Data Quality Calibrations Group Analysis User Analysis on demand Request part of event Request part of event Store rec-Obj and calibrations Quasi-online Reconstruction Request part of event Store rec-Obj Persistent Object Store Manager Database Management System Event Filter Object Formatter PhysicsPaper
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork4 Mission Get data from a HEP detector Publish result (mass and width of a particle decaying in e + e - couples) before those living on the other side of the Ring/Continent/Ocean Mission still the same: New challenges require innovative software solutions
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork5 Offline Architecture: New Requirements Bigger Experiment, higher rate, more data Larger and dispersed user community performing non trivial queries against a large event store Make best use of new IT technologies Increased demand of both flexibility and coherence ability to plug-in new algorithms ability to run the same algorithms in multiple environments guarantees of quality and reproducibility high-performance user-friendliness
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork6 Analysis Environments Real Time Event Filtering and Monitoring Data driven pipeline Highly reliability Pre-emptive Simulation, Reconstruction and Event Classification Massive parallel batch-sequential process Excellent error recovery and rollback mechanisms Excellent scheduling and bookkeeping systems Interactive Statistical Analysis Rapid Application Development environment Excellent visualization and browsing tools Human “readable” navigation
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork7 Migration Today Nobel price becomes trigger for tomorrow (and background the day after) Boundaries between running environments are fuzzy “Physics Analysis” algorithms should migrate up to the online to make the trigger more selective Robust batch systems should be made available for physics analysis of large data sample The result of offline calibrations should be fed back to online to make the trigger more efficient
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork8 File Distributed Data Store Data Browser Analysis job wizards Simulation Reconstruction PersistencyServices NetworkServices Coherent Analysis Environment Visualization BatchServices VisualizationTools AnalysisTools Software Development
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork9 The Challenge Beyond the interactive analysis tool (User point of view) Data analysis & presentation: N-tuples, histograms, fitting, plotting, … A great range of other activities with fuzzy boundaries (Developer point of view) Batch Interactive from “pointy-clicky” to Emacs-like power tool to scripting Setting up configuration management tools, application frameworks and reconstruction packages Data store operations: Replicating entire data stores; Copying runs, events, event parts between stores; Not just copying but also doing something more complicated—filtering, reconstruction, analysis, … Browsing data stores down to object detail level 2D and 3D visualisation Moving code across final analysis, reconstruction and triggers Today this involves (too) many tools
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork10 Collaborating Frameworks: The Enabling Technology
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork11 What a Framework is A Framework is a reusable “semi-complete” application that can be specialized to produce custom applications (R.Johnson in JOOP 1988) Frameworks Provide a default behavior Can be customized and extended by mean of OO techniques such as inheritance or object composition Frameworks are specific to a particular area: May provide system-level support services May encapsulate expertise at some application level May encapsulate expertise for a given problem domain
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork12 Framework Dynamics Customized Extension (client plug-in) Client API Framework API Flow of control Call backs Framework: Controls flow of execution Defines object interaction (implementing design patterns) Calls client (plug-in) functions May offer a traditional “client API” for integration in more specialized frameworks Clients specialize framework behavior: Inheriting from framework classes Overwriting their methods Instantiating other framework classes Interacting directly with other, more general, frameworks
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork13 What Frameworks are not Toolkit libraries ( C++ std, Posix, Nag-lib, CERNlib) a toolkit is passive: control stays in the user code Programs (PowerPoint, PAW) have a well defined behavior customization by “input parameters” Design Patters Abstract design and architecture knowledge Do not directly yield reusable code Languages (XML, Java, Python) New languages comes together with such a large set of support and application libraries that make them to be considered as frameworks for rapid application development, integration and/or communication
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork14 Framework-based Software Posix C++ std OpenGL System Libraries Support Frameworks Application Frameworks Problem Domain Framework Sub-Domain Framework Thread ODBMS GUI Network XML
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork15 Framework Architecture Reuse of application frameworks Common look&feel Uniform data-access Common problem-domain framework Consistent behavior Reuse of well established mechanisms Reduced maintenance, faster deployment, easer migration
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork16 Analysis & Reconstruction Framework ODBMS Geant3/4 CLHEP Paw Replacement C++ standard library Extension toolkit Reconstruction Algorithms Data Monitoring Event Filter Physics Analysis Calibration Objects Event Objects Configuration Objects Generic Application Framework Physics modules Utility Toolkit Specific Framework adapters and extensions
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork17 Why Frameworks Physicists concentrate on the development of reconstruction and analysis algorithms as plug-in modules Frameworks orchestrates instances of these modules hides system related complexities Allows for sharing of code for common or related tasks. Changes into the physics reconstruction and analysis logic affect only plug-ins Changes in system services, migration to new IT technologies, affect only the framework
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork18 Questions What is the role of an experiment-specific framework How it integrates with more generic frameworks How the user can have a coherent and consistent view of the Analysis process How new tools (new frameworks) can be integrated without disrupting the existing architecture
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork19 Difficult Balance The most profoundly elegant framework will never be reused unless the cost of understanding it and then reusing its abstractions is lower than the programmer’s perceived cost of writing them from scratch (G.Booch, 1994) Flexibility (many abstractions) Wide range of applications Great potentiality of extension and migration Difficult to understand, to use Rigidity (few abstractions, many concrete classes) Easy to use Limited range of applications Difficult to migrate, extend
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork20 Coherent, Monolithic Solution Framework Kernel is expanded to cover the whole problem domain User see The Framework New tools should be incorporated into the framework Imported classes should be modified to derive from framework base-classes to keep coherency Persistency is implemented by the framework Example: MS
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork21 Incoherent Solution The experiment kernel deals just with one problem: event processing External tools are kept as they are: Communication through I/O converters Persistency is just one (or more) of the external tools Users see a different environment for each part of the problem domain
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork22 Coherent, Non-invasive Solution Users see a standard environment that acts also as integration glue The experiment kernel is composed of a hierarchy of application-frameworks reusable in various parts of the problem domain External frameworks are integrated directly, if they conform to the standard environment, or through wrappers, if not. Persistency is encapsulated by one of the kernel application-frameworks
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork23 Python Python is an interpreted, object-oriented language introduced at the beginning of the `90s It had a fast spread particularly among scientific communities in search for a rapid application development tool able to integrate efficiently already existing, highly optimized, scientific software Python provides: Scripting functionalities such as Perl or Tcl Runtime dynamic loading A standard OO library for system level support Simple mechanisms for interfacing to C++ objects A large body of open-source modules covering a wide spectrum of application domains, scientific in particular
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork24 Python as a glue Integration in Python is non-intrusive Export to Python just the class interface: encapsulation is preserved Original (C++) representation is respected: no translation, no conversion Additional Python-specific extensions do not impact original design and functionalities Binding with Python is at Runtime Batch applications need not to be Python aware Interactive applications can be extended (actually constructed) and modified at runtime
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork25 Examples (personal experience): Exporting the interface of an application framework such as Objectivity/DB took few hours CERN/IT Physics analysis environment (ANAPHE) provides a complete Python binding (Lizard) which does not affect the core C++ library Seamless integration of CMS framework kernel (COBRA) and CERN/IT ANAPHE library through their (independent) python interface Direct application of other Python modules (regular expression, string/list manipulation, numerics, etc) on ANAPHE or COBRA objects
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork26 Lizard Qt plotter ANAPHE histogram Extended with pointers to CMS events Emacs used to edit CMS C++ plugin to create and fill histograms OpenInventor-based display of selected event Python shell with Lizard & CMS modules
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork27 Coherent Analysis Environment File Distributed Data Store PersistencyServices NetworkServices BatchServices Visualization Simulation Reconstruction VisualizationTools Data Browser Analysis job wizards AnalysisTools Software Development
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork28 HEP Data Event Collection CollectionMeta-Data Event Electrons Electrons Tracker Alignment Tracks Tracks Ecal calibration Ecal calibration User Tag (N-tuple) Event-Collection Meta-Data Environmental data Detector and Accelerator status Calibrations, Alignments (luminosity, selection criteria, …) … Event Data, User Data Navigation is essential for an effective physics analysis Complexity requires coherent access mechanisms
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork29 Framework for Persistency (DataBase) Persistency breaks encapsulation To store and retrieve an object it is required to know its concrete type and its complete state End-user developed converters (streamer operators) Reuse of classes that does give access to their full state to clients is impossible Stored schema by source parsing or user description Ideally just an extended virtual memory (in time and space) In reality much more to manage Access concurrence Tertiary storage Replication
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork30 DataBase Management System DBMS Server Distributed, Hierarchical, File Storage System Application (Distributed) DBMS Client Application Representation Persistent Data Representation Database internal Representation Database Storage (Server+Files) Tertiary Storage (Tapes) NETWORK
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork31 Successful DBMS Coherent data-view at problem-domain level Efficient data caching mechanism Variety of data&process distribution models Transparent and flexible interface to storage (disks and tapes) Cannot be achieved with a single product Requires a set of flexible, collaborating frameworks
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork32 Conclusions (Challenges) Today HEP Experiment Bigger, higher rate, more data, last longer Larger and dispersed user community IT Ubiquitous Develops fast Become obsolete even faster Traditional HEP analysis software architectures Monolithic Incoherent
Vincenzo Innocente, CHEP Beijing 9/01FrameAtWork33 Conclusions (Solutions) Hierarchy of non-intrusive, loosely-connected Frameworks Easier Maintenance, Evolution, Migration Standard framework acting as “glue|” Easier integration Coherent user view Powerful flexible persistency mechanism Uniform Transparent data access