Download presentation
Presentation is loading. Please wait.
Published byBrook Armstrong Modified over 9 years ago
1
CHEP, 21-27 March 2009, Prague P. Mato /CERN
2
Distributed Data Analysis is very wide subject and I don’t like catalogue-like talks Narrowing the scope of the presentation to the perspective of the ‘physicists’, discussing issues that affects them directly My presentation will be LHC centric, which is very relevant for the current phase we are now. -- Sorry Thanks to all the people that has help me to prepare this presentation 26/9/09Distributed Data Analysis and Tools2
3
The full data processing chain from reconstructed event data up to producing the final plots for publication Data analysis is a iterative process ◦ Reduce data samples to more interesting subsets (selection) ◦ Compute higher level information, redo some reconstruction, etc. ◦ Calculate statistical entities Algorithm development is essential in analysis ◦ The ingeniousness is materialized in code 26/9/09Distributed Data Analysis and Tools3
4
The large amount of data to be analyzed and the computing requirements prevents the idea of non-distributed data analysis The scale of ‘distribution’ goes from a local cluster to computer center or to the whole grid(s) Distributed analysis complicates the life of the physicists ◦ In addition to the analysis code he/she has to worry about many other technical issues ◦ 26/9/09Distributed Data Analysis and Tools4
5
26/9/09Distributed Data Analysis and Tools5 Data is generated at the experiment, process and distributed worldwide (T1, T2, T3) The analysis will process, reduce, transform and select parts of the data iteratively until it can fit in a single computer How this is realized?
6
All elements there and still valid ◦ Less organized activity (chaotic) ◦ Input data defined by asking questions ◦ Data scattered all over the world ◦ Own algorithms ◦ Data provenance ◦ Software version management ◦ Resource estimation ◦ Interactivity Advocating for a sophisticated WMS ◦ Common to all VO’s ◦ Plugins to VO’s specific tools/services Workload Management System Dataset Query User Algorithms User Output † Common use cases for a HEP Common Application Layer for Analysis, LCG-2003 Other Services 26/9/096Distributed Data Analysis and Tools
7
“If there is no special middleware support [for analysis], the job may not benefit from being run in the grid environment, and analysis may even take a step backward from pre-grid days” 26/9/09Distributed Data Analysis and Tools7
8
The implementation has evolved into a number of VO specific “middleware” using a small set of basic services ◦ E.g. DIRAC, PanDA, AliEn, Glide-In Development of “user-friendly” and ‘intelligent” interfaces to hide the complexity ◦ E.g. Crab, Ganga Not optimal for small VOs that cannot afford to develop specific services/interface ◦ Or individuals with special needs 26/9/09Distributed Data Analysis and Tools8 VO specific WMS, DSC Grid middleware Basic Services [VO specific] Front-end interface Computing & Storage resources
9
Specialization of the VO’s Frameworks and Data Models for data analysis to process ESD/AOD ◦ CMS Physics Analysis Toolkit (PAT), ATLAS Analysis Framework, LHCb DaVinci/LoKi/Bender, ALICE Analysis Framework ◦ In same cases selecting subset of Framework libraries ◦ Collaboration approved analysis algorithms and tools Other [scripting] languages have a role here ◦ PYTHON is getting very popular in addition to CINT macros ◦ Ideal for prototyping new ideas User typically develops its own Algorithm(s) based on these frameworks but also is willing to replace parts of the official release 26/9/09Distributed Data Analysis and Tools9
10
26/9/09Distributed Data Analysis and Tools10 GangaALICE Crab
11
Both Ganga and ALICE provide an interactive shell to configure and automate analysis jobs (Python, CINT) ◦ In addition Ganga provides a GUI Crab has a thin client. Most of the work (automation, recovery, monitoring, etc) is done in a server ◦ This functionality is delegated to the VO specific WMS for the other cases Ganga offers a convenient overview of all user jobs (job repository) enabling automation Both Crab and Ganga are able to pack local user libraries and environment automatically making use of the configuration tool knowledge ◦ For ALICE the user provides.par files with the sources 26/9/09Distributed Data Analysis and Tools11
12
1. Algorithm development and testing starts locally and small ◦ Single computer small cluster 2. Grows to a large data and computation task ◦ Large cluster the Grid 3. Final analysis is again more local and small ◦ Small cluster single computer Ideally the analysis activity should be a continuum in terms of tools, software frameworks, models, etc. ◦ LHC experiments are starting to offer this to their physicists ◦ Ganga is a good example. From inside the same session you can do a large data job and do final analysis with the results 26/9/09Distributed Data Analysis and Tools12
13
The user specifies on what data to run the analysis using VO specific dataset catalogs ◦ Specification is based on a query ◦ The front-end interfaces provide functionality to facilitate the catalog queries Each experiment has developed event tags mechanisms for sparse input data selection Data scattered over the world ◦ Computing model and policies of the experiment dictate the placement of data ◦ Read-only data with several replicas ◦ Portions of the data copied to local clusters (CAF, T3, etc) for local access 26/9/09Distributed Data Analysis and Tools13
14
Small output data files such like histogram files are returned to the client session (using the sandbox) ◦ Usually limited to few MB Large output files are typically put in Storage Elements (e.g. Castor) and registered in the grid file catalogue (e.g. LFC) and can be used as input for other Grid jobs (iterative process) Tools such as CRAB and Ganga (ATLAS) provides strong links with VO’s Distributed Data Management/Transfer systems (eg. DQ2, PhEDEx) to place output where user wants it 26/9/09Distributed Data Analysis and Tools14
15
The goal is to make it easy for physicists Distributed analysis as simple as doing it locally ◦ Which is already complicated enough!! ◦ Hiding the technical details is a must In Ganga changing the back-end from LSF to DIRAC requires to change one parameter In ALICE changing from PROOF to AliEn requires to change one name and provide a AliEn plugin configuration In CRAB changing from local batch to gLite requires a single parameter change in the configuration file 26/9/09Distributed Data Analysis and Tools15
16
PROOF Output list AMAM O1 AM task1 task2 task3 taskN InputsOutputs AM task1 task2 task3 taskN InputsOutputs AM task1 task2 task3 taskN InputsOutputs AM task1 task2 task3 taskN InputsOutputs AM task1 task2 task3 taskN InputsOutputs AM task1 task2 task3 taskN InputsOutputs Input list AM 26/9/09Distributed Data Analysis and Tools16 Analysis Manager task1 task2 task3 taskN Input chainOutputs Worker AliAnalysisSelector TSelector AM->StartAnalysis(“proof”) MyAnalysis.C CLIENT O2 On O O O Master O2 O1 On Terminate() SlaveBegin() Process() SlaveTerminate() Andrei Gheata
17
A large variety of frontends and backends It is great, but it may add confusion and complicate user support 26/9/09Distributed Data Analysis and Tools17
18
Distributed analysis relies on the software installed in the remote nodes (e.g local cluster, Grid) ◦ Experiment’s officially released software is taken care by VOs ◦ Installation procedures for big VO are well oiled ◦ Problem for small VOs / Individuals Physicist’s add-ons and private analysis algorithms need to be send along with the job ◦ Every user tool provides some level of support for this ◦ The exact matching of the OS version/compiler (platform) is required when sending binaries The later imposes strong constrains on the platform uniformity of the different facilities ◦ Local interactive service Local facility Grid 26/9/09Distributed Data Analysis and Tools18
19
CernVM is a Virtual Appliance that provides a complete, portable and easy to configure user environment for developing and running analysis locally and on the Grid independently of physical software and hardware platform It comes with the read only file system (CVMFS) optimized for software distribution ◦ Very little fraction of the software is actually used (~10%) ◦ Very aggressive local caching, web proxy cache (squids) ◦ Operational in off-line mode 26/9/09Distributed Data Analysis and Tools19 CernVM CVMFS CernVM CVMFS CernVM CVMFS LAN/ WAN https://
20
The CernVM platform is stating to be used by Physicists to develop/test/debug data analysis ◦ With a laptop you carry the complete development environment and the Grid UI with you ◦ Managing all phases of analysis from the same ‘window’ Ideally the same environment should be used to execute their jobs in the Grid ◦ Validation with large datasets ◦ Decoupling application software from system software and hardware Can the existing ‘Grid’ be adapted to CernVM? 26/9/09Distributed Data Analysis and Tools20
21
Job splitting (parallelization) is essential to be able to analyze large data samples in a limited time ◦ Very lasting jobs are more unreliable Tools such as PROOF splits dynamically the analysis job at the sub-file level (packets) offering [quasi] interactivity with the user All the other Grid submission tools provides parallelization by splitting the list of input files ◦ Sub-jobs constrained by input data location The more difficult part is the result merging ◦ Standard automation of the most common cases ◦ User intervention for more complicated ones 26/9/09Distributed Data Analysis and Tools21
22
Majority of today computing resources are based on multi-core architectures ◦ Exploiting these multi-core architectures (MT, MP) can optimize the use of resources (memory, I/O) ◦ See V. Innocente’s presentation Submitting a single job per node that utilizes all available cores can be advantageous ◦ Efficient in resources, mainly increasing the fraction of shared memory ◦ Scale down the number of jobs that the WMS needs to handle 26/9/09Distributed Data Analysis and Tools22
23
Grouping data analysis is way to optimize when going over a large part or the full dataset ◦ Requires the support of the framework, (a model) ◦ …and some discipline Examples: ◦ Alice is using the AliAnalysisManager framework to optimize CPU/IO ratio (85% savings reported) ◦ LHCb is grouping pre-selections in their stripping jobs 26/9/09Distributed Data Analysis and Tools23
24
At the time of HEPCAL-II resource estimation was an important issue ◦ How much CPU time this analysis would take, what will be the output data size, etc. In practice Physicists can estimate resources pretty well since test analysis are performed with small data samples before submitting large jobs ◦ Proper reporting of the ‘cost’ of each job with standardized units could facilitate this estimation ◦ In the old times of CERNVM a job summary with the CPU time in ‘CERN units’ was printed in each job 26/9/09Distributed Data Analysis and Tools24
25
Job failures are very common (E.g. ~45% of the CMS analysis jobs do not terminate successful) ◦ The reasons are very diverse (data access, stalled, upload data, application failure,…) Proper reporting of job failures is essential for diagnosing and handling them efficiently ◦ Detailed monitoring, log files, etc. Handling failures may imply to provide corrections in configurations, code, re-submission, managing site backlists, etc. ◦ Automated correction actions can handled by severs (e.g. CRAB) ◦ Scripting support available to users (e.g. Ganga) 26/9/09Distributed Data Analysis and Tools25 [1]: jobs.select(status=‘failed’).resubmit() [2]: jobs.select(name= ‘testjob’).kill() [3]: newjobs = jobs.select( status=’new’) [4]: newjobs.select( name= ’urgent’).submit()
26
Monitoring is essential for the users and also for administrators Physicists may use the web based interfaces to find out information about their jobs ◦ Each WMS have develop a very complete monitoring tools ◦ The details available are really impressive (e.g. Panda Monitor) Often the connection with the submission tools is poor ◦ Not well integrated 26/9/09Distributed Data Analysis and Tools26
27
If the front-end submission tool understands the analysis application [framework] it can become extremely helpful to the users E.g. the Ganga application component can ◦ Setup the correct environment, collect user shareable libraries, analyze configuration files and follow dependencies, determine inputs and outputs and register them automatically, etc. The technical solution to achieve this is by implementing ‘plugins’ for each type of application 26/9/09Distributed Data Analysis and Tools27
28
Fundamentally the way analysis is being done has not changed very much ◦ The initial dreams that the Grid will change dramatically the paradigm has not happen ◦ Parts of the analysis with large data jobs will be done in batch and parts will be done more locally and interactively Each collaboration has developed tools to cope with the large data and computational requirements and to simply the life of physicists ◦ Turned out that the model/architecture of these tools are very similar but they are not in common ◦ The number of users of these tools are increasing rapidly 26/9/09Distributed Data Analysis and Tools28
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.