Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL June 23, 2003 GAE workshop Caltech.

Similar presentations


Presentation on theme: "David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL June 23, 2003 GAE workshop Caltech."— Presentation transcript:

1 David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL June 23, 2003 GAE workshop Caltech

2 David Adams ATLAS June 23, 2003 DIAL GAE workshop2 Contents Goals of DIAL What is DIAL? Design Application Task Result Scheduler Dataset properties Exchange format Status Development plans Interactivity issues

3 David Adams ATLAS June 23, 2003 DIAL GAE workshop3 Goals of DIAL 1. Demonstrate the feasibility of interactive analysis of large datasets How much data can we analyze interactively? 2. Set requirements for GRID services Datasets, schedulers, jobs, results, resource discovery, authentication, allocation,... 3. Provide ATLAS with a useful analysis tool For current and upcoming data challenges Real world deliverable Another experiment would show generality

4 David Adams ATLAS June 23, 2003 DIAL GAE workshop4 What is DIAL? Distributed Data and processing Interactive Iterative processing with prompt response – (seconds rather than hours) Analysis of Fill histograms, select events, … Large datasets Any event data (not just ntuples or tag)

5 David Adams ATLAS June 23, 2003 DIAL GAE workshop5 What is DIAL? (cont) DIAL provides a connection between Interactive analysis framework –Fitting, presentation graphics, … –E.g. ROOT and Data processing application –Natural to the data of interest –E.g. athena for ATLAS DIAL distributes processing Among sites, farms, nodes To provide user with desired response time Look to other projects to provide most infrastructure

6 David Adams ATLAS June 23, 2003 DIAL GAE workshop6

7 David Adams ATLAS June 23, 2003 DIAL GAE workshop7 Design DIAL has the following major components Dataset describing the data of interest Application defined by experiment/site Task is user extension to the application Result (generated by applying application and task to an input dataset) Scheduler directs the processing

8 David Adams ATLAS June 23, 2003 DIAL GAE workshop8 User Analysis Job 1 Job 2 ApplicationTask Dataset 1 Scheduler 1. Create or locate 2. select3. Create or select 4. select 8. run(app,tsk,ds1) 5. submit(app,tsk,ds) 8. run(app,tsk,ds2) 6. split Dataset Dataset 2 7. create e.g. ROOT e.g. athena Result 9. fill 10. gather Result 9. fill ResultCode

9 David Adams ATLAS June 23, 2003 DIAL GAE workshop9 Application User specifies application with Name and version –E.g. dial_cbnt 0.3 Scheduler Maps this specification to an executable E.g. ChildScheduler uses mapping files at each site or node

10 David Adams ATLAS June 23, 2003 DIAL GAE workshop10 Task Task allows users to extend application Application Defines the task syntax Provides means to build and install task –E.g. compile and dynamic load Examples: Empty histograms plus code to fill them Sequence of algorithms

11 David Adams ATLAS June 23, 2003 DIAL GAE workshop11 Result Result is filled during processing Examples –Histogram –Event list –New dataset –Combination of the above Returned to the user Should be small Logical file identifiers rather than the data in those files

12 David Adams ATLAS June 23, 2003 DIAL GAE workshop12 Schedulers A DIAL scheduler provides means to Submit a job Terminate a job Monitor a job –Status –Events processed –Partial results Verify availability of an application Install and verify the presence of a task for a given application

13 David Adams ATLAS June 23, 2003 DIAL GAE workshop13 Schedulers (cont) Schedulers form a hierarchy Corresponding to that of compute nodes –Grid, site, farm, node Each scheduler splits job into sub-jobs and distributes these over lower-level schedulers Lowest level ChildScheduler starts processes to carry out the sub-jobs Scheduler concatenates results for its sub-jobs User may enter the hierarchy at any level Client-server communication

14 David Adams ATLAS June 23, 2003 DIAL GAE workshop14

15 David Adams ATLAS June 23, 2003 DIAL GAE workshop15 Dataset properties Datasets have the following properties 0. Identity 1. Content 2. Location (of the data) 3. Mapping (content to location) 4. Provenance (of the dataset) 5. History (of the production) 6. Labels (describing dataset)

16 David Adams ATLAS June 23, 2003 DIAL GAE workshop16 Dataset properties (cont) All the above are “metadata” Dataset also has “data” Details: 0. Identity –Index or name to distinguish datasets 1. Content –Tells what kind of data is carried by the dataset –E.g. for HEP event data >List of event identifiers >Content for each event (raw, tracks, jets, …)

17 David Adams ATLAS June 23, 2003 DIAL GAE workshop17 Dataset properties (cont) 2. Location –Most interesting example is a collection of logical files –Could also be >physical files >DB >data store such as POOL 3. Mapping –For each piece of content, where (in the location) is the associated data –For collection of logical files, which file and where in the file

18 David Adams ATLAS June 23, 2003 DIAL GAE workshop18 Dataset properties (cont) 4. Provenance –Parent datasets –Transformation >Applied to parent dataset to obtain this dataset –Sufficient to construct dataset (virtual data) 5. History –Production history beyond provenance –Division into jobs, which compute nodes, time stamps, …

19 David Adams ATLAS June 23, 2003 DIAL GAE workshop19 Dataset properties (cont) 6. Labels –Additional data characterizing the dataset –For example >Motivation for dataset E.g. starting point for Higgs searches >Results of dataset evaluation E.g. approved for basis of publications >Other global properties E.g. integrated luminosity for an event dataset

20 David Adams ATLAS June 23, 2003 DIAL GAE workshop20 Dataset operations Operations that datasets should support Content selection Splitting Merging Content selection User can create a new dataset by selecting content from an existing dataset For an event dataset, there are two categories –Event selection >Apply an algorithm which flags which events to accept –Event content selection >e.g. keep tracks, drop jets

21 David Adams ATLAS June 23, 2003 DIAL GAE workshop21 Dataset operations (cont) Splitting Distributed processing requires that the input dataset be split into sub-datasets for processing Natural to split along file boundaries For event datasets, the split is along event boundaries Merging Combine multiple datasets to form a new dataset For event datasets, there are again two dimensions –Same event content and data for different events >E.g. data from two different run periods –Same events and different content >E.g. raw data and the corresponding reconstructed data >Or reconstructed data and relevant conditions data

22 David Adams ATLAS June 23, 2003 DIAL GAE workshop22 Dataset operations (cont) Event dataset operations

23 David Adams ATLAS June 23, 2003 DIAL GAE workshop23 Dataset implementation The dataset properties span many realms Different properties will be stored in different ways Properties used for selection naturally reside in relational DB tables –Content, provenance, metadata Properties required for end applications on worker nodes might expect properties to reside in files –Content, location mapping Some property data will likely be replicated –both RDB and files Identification of the different properties is a first step in implementing datasets

24 David Adams ATLAS June 23, 2003 DIAL GAE workshop24 Dataset implementation (cont) Interface for schedulers Central problem is how to split dataset Account for –Data location –Compute cycle locations –Matching these Interactive analysis (fast response) is a special challenge How can different dataset providers share a scheduler? Options: –Common dataset interface which provides the required information >Logical file mapping or >Composite structure (sub-datasets) –Each provider also provides a service to do the splitting

25 David Adams ATLAS June 23, 2003 DIAL GAE workshop25 Exchange format DIAL components are exchanged Between –User and scheduler –Scheduler and scheduler –Scheduler and application executable Components have an XML representation Exchange mechanism can be –C++ objects –SOAP –Files Mechanism defined by scheduler

26 David Adams ATLAS June 23, 2003 DIAL GAE workshop26 Status All DIAL components in place http://www.usatlas.bnl.gov/~dladams/dial Release 0.30 made last week –Includes demo to demonstrate look and feel But scheduler is very simple –Only local ChildScheduler is implemented –Grid, site, farm and client-server schedulers not yet implemented More details in CHEP paper at http://www.usatlas.bnl.gov/~dladams/dial/talks/dial_chep2003.pdf

27 David Adams ATLAS June 23, 2003 DIAL GAE workshop27 Status (cont) Dataset implemented as a separate system http://www.usatlas.bnl.gov/~dladams/dataset Implementations: –ATLAS combined ntuple hbook file >CbntDataset >in release 0.3 –ATLAS AthenaRoot file >Holds Monte Carlo generator information >In 0.2; not in 0.3 (easy to add) –Athena-Pool files >when they are available

28 David Adams ATLAS June 23, 2003 DIAL GAE workshop28 Status (cont) DIAL and dataset classes imported to ROOT ROOT can be used as interactive user interface –All DIAL and dataset classes and methods available at command prompt –DIAL and dataset libraries must be loaded Import done with ACLiC Need to add result for TH1 and any other classes of interest

29 David Adams ATLAS June 23, 2003 DIAL GAE workshop29 Status (cont) Applications Test program dialproc counts events Wrapper around PAW dial_cbnt –Processes CbntDataset –Task provides >HBOOK file with empty histograms >Fortran subroutine to fill histograms

30 David Adams ATLAS June 23, 2003 DIAL GAE workshop30 Status (cont) Interface for logical files was added recently Keep results small –Carry logical file instead of data in the file Includes abstract interface for a file catalog –Implemented: >Local directory >AFS –Planned: >Magda (under development) >RLS (add flavors as required)

31 David Adams ATLAS June 23, 2003 DIAL GAE workshop31 Development plans Farm Scheduler Distribute processing over a single farm –Condor, LSF, ssh, STAR, …? Provides useful tool for distributed processing Remote access to scheduler Job submission from anywhere Web service Add policy to scheduler interface Response time How to split dataset for distributed processing

32 David Adams ATLAS June 23, 2003 DIAL GAE workshop32 Development plans (cont) Grid schedulers Distribute data and processing over multiple sites Interact with dataset, file and replica catalogs Authentication, authorization, resource location and allocation, … ATLAS POOL dataset After ATLAS incorporates POOL ATLAS athena as application

33 David Adams ATLAS June 23, 2003 DIAL GAE workshop33 Interactivity issues Important aspect is latency Interactive system provides means for user to specify maximum acceptable response time All actions must take place within this time –Locate data and resources –Splitting and matchmaking –Job submission –Gathering of results Longer latency for first pass over a dataset –Record state for later passes –Still must be able to adjust to changing conditions

34 David Adams ATLAS June 23, 2003 DIAL GAE workshop34 Interactivity issues (cont) Interactive and batch must share resources Share implies more available resources for both Interactive use varies significantly –Time of day –Time to the next conference –Discovery of interesting events Interactive request must be able to preempt long- running batch jobs –But allocation determined by sites, experiments, …


Download ppt "David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL June 23, 2003 GAE workshop Caltech."

Similar presentations


Ads by Google