Perspective on Future Data AnalysisL1 Computing in High Energy Physics 2003 La Jolla 24 March Ren é Brun CERN Perspective on Future Data Analysis in HENP.

1 Perspective on Future Data AnalysisL1 Computing in High Energy Physics 2003 La Jolla 24 March Ren é Brun CERN Perspective on Future Data Analysis in HENP

2 Ren é Brun CHEP03 Perspective on Future Data Analysis2 Data Analysis ?? Data Analysis has been traditionally associated with the final stages of data processing, ie Physics Analysis. In this talk, I will cover a more general aspect of Data Analysis (in the true sense). How to interact with data at all stages of data processing (batch or interactive modes)? Can we imagine an experiment-independent way to achieve this?

3 Ren é Brun CHEP03 Perspective on Future Data Analysis3 Evolution To understand the possible directions, we must understand some messages from the past, the solid recipes! One important message is “Make it simple”. Heavy experiment frameworks are often perceived as a serious obstacle and push users to use more basic but universal frameworks.

4 Ren é Brun CHEP03 Perspective on Future Data Analysis4 Once upon a time (seventies) With the first electronic (as opposed to bubble chamber) experiments, data analysis was experiment specific, an activity after the data taking. The only common software was the histograming package (eg Hbook),the fitting package (eg Minuit), some plotting packages and independent routines in cernlib (linear algebra and small utilities) Data structures = Fortran common blocks

5 Ren é Brun CHEP03 Perspective on Future Data Analysis5 Early Eighties With the growing complexity of the experiments and corresponding software, we see the development of Data Structures management systems (hydra, zbook-->zebra, bos). These systems are able to write/read complex bank collections. Zebra had a self-describing bank format with built-in support for bank evolution. Most data processed in batch, but many prototypes of interactive systems start to appear (htv, gep, then paw..)

6 Ren é Brun CHEP03 Perspective on Future Data Analysis6 PAW Designed in 1985. Stable since 1993 Row-Wise-Ntuples. OK for small data sets, interactive histograming with cuts. Column-Wise-Ntuples. A major step illustrating the advantage of structured data sets PAW: a success not so much because of its technical merits but perceived as a tool widely available stability since many years: an important element

7 Ren é Brun CHEP03 Perspective on Future Data Analysis7 1993-->2000 (1) Move from Fortran to OO Took far more time than expected new language(s) new programming techniques basic infrastructure not available to compete with existing libraries and tools conflicts between projects ad-hoc software in experiments

8 Ren é Brun CHEP03 Perspective on Future Data Analysis8 1993-->2000 (2) False hopes with OODBMS (or too early?) OODBMS -->Objectivity OO models designed for Objy batch oriented Interactive use via conversion to PAW ntuples central data base does not fit well with GRID concepts Licensing problems and more

9 Perspective on Future Data AnalysisL9 Data Analysis Models

10 Ren é Brun CHEP03 Perspective on Future Data Analysis10 From the desktop to the GRID Desktop Local/remote Storage Online/Offline Farms GRID New data analysis tools must be able to use in parallel remote CPUS, storage elements and networks in a transparent way for a user at a desktop

11 Ren é Brun CHEP03 Perspective on Future Data Analysis11 My laptop in 200X Using a naïve extrapolation of Moore’s law for a state of the art laptop Year CPU/Ghz RAM/GB disk/GB 2003 2.4 0.5 60 2005 5 1 150 2007 10 2 300 2009 20 4 600 2011 40 8 1000 Nice ! But less than 1/1000 of what I need

12 Ren é Brun CHEP03 Perspective on Future Data Analysis12 Batch-mode Local analysis Conventional model: The user has full control on the event loop. The program produces histograms, ntuples or trees. The selection is via user private code Histograms are then added (tool or in the interactive session) ntuples/trees are combined into a chain and analyzed interactively.

13 Ren é Brun CHEP03 Perspective on Future Data Analysis13 Batch Analysis on the GRID From a user viewpoint, a simple extrapolation of the local batch analysis. In practice, must involve all the GRID machinery: authentication, resource brokers, sandboxes. Viewing the current status (histograms) must be possible. Advantage: Stateless, can process large data volumes. Advanced systems already exist (see talk by Andreas Wagner)

14 Ren é Brun CHEP03 Perspective on Future Data Analysis14 AliEnFS & Distributed Analysis ******************************************* * * * W E L C O M E to R O O T * * * * Version 3.03/09 3 December 2002 * * * * You are welcome to visit our Web site * * * * * ******************************************* Compiled for linux with thread support. CINT/ROOT C/C++ Interpreter version 5.15.61, Oct 6 2002 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }. root [0]newanalysis->Submit(); Analysis Macro MSS CE merged Trees +Histograms ? Query for Input Data MSS VFS Kernel LUFS Kernel Space AliEnFS AliEn API User Space castor:// soap:// root:// https:// /alien/ alice/atlas/ data/ prod/ mc/ a/b/ Linux File System MSS

15 Ren é Brun CHEP03 Perspective on Future Data Analysis15 Interactive Local Analysis On a public cluster, or the user’s laptop. Tools like PAW or successor are used for visualization and ntuples/trees analysis.

16 Ren é Brun CHEP03 Perspective on Future Data Analysis16 GRID: Interactive Analysis Case 1 Data transfer to user’s laptop Optional Run/File catalog Optional GRID software Optional run/File Catalog Remote file server eg rootd Trees Analysis scripts are interpreted or compiled on the local machine

17 Ren é Brun CHEP03 Perspective on Future Data Analysis17 GRID: Interactive Analysis Case 2 Remote data processing Optional Run/File catalog Optional GRID software Optional run/File Catalog Remote data analyzer eg proofd Trees Commands, scripts histograms Analysis scripts are interpreted or compiled on the remote machine

18 Ren é Brun CHEP03 Perspective on Future Data Analysis18 GRID: Interactive Analysis Case 3 Remote data processing Run/File catalog Full GRID software Run/File Catalog Remote data analyzer eg proofd Trees Commands, scripts Histograms,trees Trees slave Analysis scripts are interpreted or compiled on the remote master(s)

19 Perspective on Future Data AnalysisL19 Data Analysis Projects

20 Ren é Brun CHEP03 Perspective on Future Data Analysis20 Tools for data analysis PAW: started in 1985, no major developments since 1994. HippoDraw: started in 1991 ROOT: started in 1995, continuous developments JAS: started in 1995, continuous developments Open Scientist: ? LHC++/Anaphe: 1996-->2002 PI: new project in the LHC Computing Grid, just starting now

21 Ren é Brun CHEP03 Perspective on Future Data Analysis21 PAW The reference since 18 years (1985), Used by most collaborations ported on many platforms, small (3 to 15 MB) many criticisms during the development phase applauded since it is stable maintained by Olivier Couet (ROOT team) Usage still growing 0.1 FTE

22 Ren é Brun CHEP03 Perspective on Future Data Analysis22 HippoDraw Author: Paul Kunz show the way in 1991/1992 Usage: Paul + “a 50 year-old CERN physicist” Seems to be in constant prototyping phases Good to have this type of prototype to illustrate new possible interactive techniques. 1 FTE ?

23 Ren é Brun CHEP03 Perspective on Future Data Analysis23 ROOT In constant development since 1995 Used by many collaborations and outside HEP More than 10000 distributions of binary tar files in February 6 +2+..FTE

24 Ren é Brun CHEP03 Perspective on Future Data Analysis24 JAS Started in 1995. (Tony Johnson) Current version 2. JAS3 presented at this CHEP For the Java world. How to cooperate with C++ frameworks? 3 FTE ?

25 Ren é Brun CHEP03 Perspective on Future Data Analysis25 In AIDA you believe ? The Abstract Interfaces for Data Analysis project was started by the defunct LHC++ and continued by Anaphe (now stopped). Supported by JAS and Open Scientist Goal: define abstract interfaces to facilitate cooperation between developers and facilitate migration of users to new products Versions 1, 2 and 3 (version 4 for PI ?)

26 Ren é Brun CHEP03 Perspective on Future Data Analysis26 In AIDA I don’t believe Abstract Interfaces are fundamental in modern systems to make a system more modular and adaptable. But, common abstract interfaces are not a good idea. They force a lowest common denominator They require international agreements Users will be confused (what is common and not) you become slave of a deal: against creativity It is more important to agree on object interchange formats and data base access You can easily change a few hundred lines of code. You cannot copy Terabytes of data

27 Ren é Brun CHEP03 Perspective on Future Data Analysis27 The LCG PI project Fresh from the oven One of the projects recently launched by the Applications Area of the LCG project. Ideas: promote the use of AIDA (version 4) Python for scripting interface to ROOT & CINT in gestation see Vincenzo

28 Ren é Brun CHEP03 Perspective on Future Data Analysis28 User & Developer views Users Requests very rarely requests for grandiose new features zillions of tiny new features zillions of tiny improvements want consolidation & stability Developers view want to implement the sexy features target modularity (more complex installation?) maintenance & helpdesk: a problem or a chance?

29 Ren é Brun CHEP03 Perspective on Future Data Analysis29 Lessons from the past It takes time to develop a general tool more than 7 years for PAW, ROOT and JAS User feedback is essential in the development phase People like stable systems Efficient access to data sets is a prerequisite 24h x 7days x 12 months x N years online support is vital

30 Ren é Brun CHEP03 Perspective on Future Data Analysis30 Develop/Debug/maintain In an Interactive system with N basic functions, the number of combinations may be unlimited, (Not NxN, but N! ) 10% of the time to develop first 90% of the code. 90% of the time to develop the remaining 10%

31 Ren é Brun CHEP03 Perspective on Future Data Analysis31 Time to develop LCG

32 Perspective on Future Data AnalysisL32 Technical aspects

33 Ren é Brun CHEP03 Perspective on Future Data Analysis33 Desktop Plug-in Manager and Dictionary GUI Graphics 2-d, 3-d Event Displays Histograming & Fitting Statistics tools Scripting Data/Program organization

34 Ren é Brun CHEP03 Perspective on Future Data Analysis34 Plug-in Manager Object Dictionary I/O managerInterpreterI/O manager Plug-in manager Basic Services, GUI, Math.. User Shared lib Exp Shared libs General Utility Shared lib Exp Shared libs

35 Ren é Brun CHEP03 Perspective on Future Data Analysis35 The Object Dictionary Object Dictionary Data dictionaryFunctions dictionary Compiled code Interpreted scripts GUI Command line I/O Inspectors Browsers

36 Ren é Brun CHEP03 Perspective on Future Data Analysis36 Scripting for data analysis After KUIP and Tk/Tcl era Command line Interface required Scripts interpreted or/and byte-code interpreted automatic compilation and linking call compiled or interpreted code compiled code must be able to call interpreted code (GUI and configuration scripts) Big bonus if compiled and interpreted languages are the same Scripting and object dictionary symbiosis Remote execution of scripts (in parallel)

37 Ren é Brun CHEP03 Perspective on Future Data Analysis37 Languages & scripting C++ Compiled code Python/Perl scripts GUI with signal/slots Interactive User C++ Interpreted scripts Batch User

38 Ren é Brun CHEP03 Perspective on Future Data Analysis38 Comparing scripts Very interesting project from Subir Sarkar Cooperation between Java and a C++ framework based on Object Dictionary

39 Ren é Brun CHEP03 Perspective on Future Data Analysis39 GUI(s) Constant evolution +Microsoft MFC, Win32 API Signals/Slots principle: very nice. It helps designing large and modular GUI systems Interpreters help GUI builders/editors 1983 Vax/VMS SMS VT100 1985 GKS Textronix 1989 MOTIF Unix workstations 2001 Qt Linux/Laptops 1997 Java/Swing The Web

40 Ren é Brun CHEP03 Perspective on Future Data Analysis40 2-D graphics An area where constant improvements are required. Better plotters, better fonts,... Better drivers: postscript, SVG, XML, etc Publication quality is a must. This requirement alone explains why many proposed data analysis systems do not penetrate experiments

41 Ren é Brun CHEP03 Perspective on Future Data Analysis41 3-D graphics Data structures: Objects scene Scene renderers: OpenGL, Open Inventor Most difficult is detector geometry graphics z-buffer algorithms OK for fast real time fancy graphics, not OK for good debugging (shape outline is important on top of z-buffer views). Vector Postscript (or PDF/SVG) must be available (not Postscript from OpenGL triangles) see talks about GraXML and Persint

42 Ren é Brun CHEP03 Perspective on Future Data Analysis42 Example with PERSINT/ATLAS

43 Ren é Brun CHEP03 Perspective on Future Data Analysis43 Event Displays The most successful event displays so far were 2-D projections (see Aleph, Atlas/Atlantis) A lot of work with 3-d graphics in many experiments (see talks about Iguana) Client-server model Access to framework objects, browsers One could have expected a bigger role for Java! Mismatch with experiment C++ frameworks? Possible directions standardize object exchange (SOAP/XML/Root I/O) standardize low level graphics exchange (HEPREP)

44 Ren é Brun CHEP03 Perspective on Future Data Analysis44 Histograming This should be a stable area Thread Safety Binning on parallel systems Merging on batch/parallel systems

45 Ren é Brun CHEP03 Perspective on Future Data Analysis45 Fitting Minuit: the standard Fumili: was nice and fast Upgrade of Minuit with new algorithms including Fumili in the pipeline several GUIs on top a very powerful package developed by BaBar see talk on RooFit by D.Kirkby

46 Ren é Brun CHEP03 Perspective on Future Data Analysis46 Statistics & Math Many tools and algorithms exist GSL ? Gnu R-Math project TerraFerma Initiative Subject of discussions at many workshops confidence limits workshops ACAT FermiLab and Moscow Durham Need to be federated in a coherent framework

47 Ren é Brun CHEP03 Perspective on Future Data Analysis47 Lost with Complexity? In large collaborations, users are often lost when confronted to the complexity of big simulation and reconstruction programs: What is the data organization? How are algorithms organized? The hierarchy? The problem is amplified by the use of dynamically configurable systems, dynamic linking and polymorphism Browsing data and algorithms is a must

48 Ren é Brun CHEP03 Perspective on Future Data Analysis48 Folders/ white boards Folders help understanding complex hierarchical structures Language Independent Could be GRID-aware

49 Ren é Brun CHEP03 Perspective on Future Data Analysis49 Why Folders ? This diagram shows a system without folders. The objects have pointers to each other to access each other's data. Pointers are an efficient way to share data between classes. However, a direct pointer creates a direct coupling between classes. This design can become a very tangled web of dependencies in a system with a large number of classes.

50 Ren é Brun CHEP03 Perspective on Future Data Analysis50 Why Folders ? In the diagram below, a reference to the data is in the folder and the consumers refer to the folder rather than each other to access the data. A naming and search service provides an alternative. It loosely couples the classes and greatly enhances I/O operations. In this way, folders separate the data from the algorithms and greatly improve the modularity of an application by minimizing the class dependencies.

51 Ren é Brun CHEP03 Perspective on Future Data Analysis51 Tasks/Algorithms In the same way that Folders can be used to organize the data, one can use Tasks to organize a hierarchy of algorithms. Tasks can be organized into a hierarchical tree of tasks and displayed in the browser. A Task is an abstraction with standard functions to Begin,Execute,Finish. Each Task derived class may contain other Tasks that can be executed recursively, such that a complex program can be dynamically built and executed by invoking the services of the top level task or one of its subtasks. Tasks help understanding the organization and sequence of execution of large programs

52 Perspective on Future Data AnalysisL52 Directions

53 Ren é Brun CHEP03 Perspective on Future Data Analysis53 Exchange/Compatibility If we assume that several data analysis tools will be around (HEP made or commercial), it is important to exchange objects between these tools (drag&drop, network or files). The SOAP/XML have emerged as standards to exchange low level volume of objects. Several technical solutions are possible. The winning solutions will be the ones that will be able to automatize the process by exploiting all the information in the object dictionary.

54 Ren é Brun CHEP03 Perspective on Future Data Analysis54 Follow Microsoft ? The SOAP/XML are one of the key components of.NET (and also of the MS competition). MS is preparing a new OS (Longhorn ?) for 2005. This new OS will introduce an Object distributed data base. This may have a serious impact on the GRID software and on our tools.

55 Ren é Brun CHEP03 Perspective on Future Data Analysis55 Access Patterns Understand data access patterns to objects in one file to subsets of objects in many collections relations with run/file catalogs persistent reference pointers Optimize design of containers for processing in batch interactive parallel processing cache management and proxies

56 Ren é Brun CHEP03 Perspective on Future Data Analysis56 Query processor Extend/Develop powerful query systems that minimize the amount of programming Optimize I/O (read only the strict necessary) are able to process data in parallel, hiding the complexity of parallelism to the end user. can be executed again and again, possibly learning from the previous passes. Are robust against network failures, CRTL/C, programming errors. Can be run in GUI mode, interpreted or compiled mode

57 Ren é Brun CHEP03 Perspective on Future Data Analysis57 Event Collections Develop/Extend objects able to keep a summary of previous runs Event collections with their iterators well matched to the query processor (event+run, UUID, tree entry serial number). Special objects: masks, bit slice index to speed up searches in large collections. The system must be able to run with and without the run/file catalog

58 Ren é Brun CHEP03 Perspective on Future Data Analysis58 Exploiting meta information The normal data analysis mode requires access to the user classes. However, experience shows that users also expect (as it was the case for PAW) to be able to process their data sets without the classes/shared libraries used to generate these data sets, still supporting automatic schema evolution. The class meta information is saved in the data set. Simple queries involving only data class attributes must be possible without the code. This requirement has consequences on the way the object dictionary is used.

59 Ren é Brun CHEP03 Perspective on Future Data Analysis59 Dependencies & Simplicity Minimize component dependencies to facilitate software distribution/portability The winning tools will be the ones that are easy to port to new systems (OS/compilers) depend only on other systems also easy to port are used in real conditions to guarantee feedback are able to evolve very quickly to adapt to new situations and new requirements.

60 Ren é Brun CHEP03 Perspective on Future Data Analysis60 Integration with GRID soft The data analysis software is an integral part of the GRID software. It drives the process, not the inverse. This implies a close cooperation between teams working on tools for data analysis and teams working on the GRID plumbing: resource brokers, authentication,etc, and GRID high level tools like Condor. The Batch line and the Interactive line must be developed in a complementary way.

61 Ren é Brun CHEP03 Perspective on Future Data Analysis61 Trends Summary Histogram Ntuple viewers Data Presenters Efficient Access to large and structured event collections Interaction with user & experiment classes Parallelism on the GRID Batch/Interactive Access to Catalogs Resource Brokers Process migration Progress Monitors Proxies/caches Virtual data sets More and more GRID oriented data analysis More and more experiment-independent software

62 Ren é Brun CHEP03 Perspective on Future Data Analysis62 Acknowledgements For a long time, data analysis has been the last wheel of the car. Many thanks to the organizing committee for giving me the opportunity to present my views on the subject. Enjoy this conference

