Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL November 17, 2003 SC2003 Phoenix.

Similar presentations


Presentation on theme: "David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL November 17, 2003 SC2003 Phoenix."— Presentation transcript:

1 David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL November 17, 2003 SC2003 Phoenix

2 David Adams ATLAS DIAL SC2003November 15-18, 20032 Contents Goals of DIAL What is DIAL? Design JDL Datasets Grid2003 More information

3 David Adams ATLAS DIAL SC2003November 15-18, 20033 Goals of DIAL 1. Demonstrate the feasibility of interactive analysis of large datasets How much data can we analyze interactively? 2. Set requirements for GRID services In particular those specific to interactive analysis –Job definition: application, task, dataset –Gathering and relaying results –Real time monitoring (partial results) –Resource management: discovery, allocation, sharing 3. Provide ATLAS with a useful analysis tool For current and upcoming data challenges Like to add another experiment to show generality

4 David Adams ATLAS DIAL SC2003November 15-18, 20034 What is DIAL? DIAL provides a connection between Interactive analysis framework –Fitting, presentation graphics, … –E.g. ROOT, JAS, … and Data processing application –Natural to the data of interest –E.g. athena for ATLAS DIAL distributes processing Among sites, farms, nodes To provide user with desired response time Look to other projects to provide most infrastructure

5 David Adams ATLAS DIAL SC2003November 15-18, 20035

6 David Adams ATLAS DIAL SC2003November 15-18, 20036 Design DIAL has the following major components Dataset describing the data of interest Application defined by experiment/site Task is user extension to the application Job uses application and task to process a dataset Result is the output of a job Scheduler creates and manages jobs Together these define a high-level JDL (job definition language) Figure shows how these components interact →

7 David Adams ATLAS DIAL SC2003November 15-18, 20037 User Analysis Job 1 Job 2 ApplicationTask Dataset 1 Scheduler 1. Create or locate 2. select3. Create or select 4. select 8. run(app,tsk,ds1) 5. submit(app,tsk,ds) 8. run(app,tsk,ds2) 6. split Dataset Dataset 2 7. create e.g. ROOT e.g. athena Result 9. fill 10. gather Result 9. fill ResultCode

8 David Adams ATLAS DIAL SC2003November 15-18, 20038 Analysis job rates At what rate is a site processing sub-jobs? Assume 1000 CPU’s at a “site” Continuum of requests with the following extremes Large scale data production with 3–30 hours/job: 30-300 jobs/hour (1 job/minute) Fine for batch and grid schedulers For interactive analysis with 1-10 seconds/job 100-1000 jobs/sec (10000 jobs/minute) Challenging for grid and batch schedulers Handle with hierarchy of schedulers –Each scheduler hnadles a fraction of the rate –But each level adds latency

9 David Adams ATLAS DIAL SC2003November 15-18, 20039 DIAL scheduler hierarchy

10 David Adams ATLAS DIAL SC2003November 15-18, 200310 JDL High level job definition language Enable users to specify task without reference to executables, data files or sites Scheduler decides where and how to process data Analysis implies user is easily able to customize task Common language Enable different experiments and non-HEP activities to share schedulers PPDG activity to define such a language –Led by Gabriele Carcassi (STAR) –Similar to DIAL (application, task. dataset, …) –XML based

11 David Adams ATLAS DIAL SC2003November 15-18, 200311 JDL (DIAL perspective)

12 David Adams ATLAS DIAL SC2003November 15-18, 200312 Datasets Want to provide a high-level data view Unit of processing is called “dataset” Many properties beyond data location Location is not just a list of files (physical or logical) –Multiple logical file set representations –Representation might be tables in an RDB –Or object list in an ODB –Or … Properties and categories follow

13 David Adams ATLAS DIAL SC2003November 15-18, 200313 Dataset properties 0. Identity Dataset must have an unique index and/or name 1. Content Description of the type of data in the dataset –Event or non-event data –Simulation, reconstruction, –ESD, AOD, … –Jets, tracks, electrons,… 2. Location Where to find the data –Logical files, physical files, site,… 3. Mapping Which content is at which location?

14 David Adams ATLAS DIAL SC2003November 15-18, 200314 Dataset properties (cont) 4. Provenance Prescription for creating the data E.g. input dataset and transformation 5. History Details of production beyond provenance –How production was split into jobs, –Processing node and time for each job, … 6. Labels Assigned metadata outside other categories, e.g. –Integrated luminosity –Result of quality checks –Flag indicating ok for use in published analyses

15 David Adams ATLAS DIAL SC2003November 15-18, 200315 Dataset properties (cont) 7. Mutability May dataset be modified? Possible states: locked, unlocked, extensible, … 8. Compositeness Dataset made up of other datasets. Two cases: –Construction: provenance is the list of sub-datasets >E.g. the summer dataset is defined to be the union of the June, July and August datasets. –Assignment: factorization into sub-datasets >Typically to reflect data placement >E.g. a representation of a global dataset might include sub-datasets in New York, Paris and Moscow.

16 David Adams ATLAS DIAL SC2003November 15-18, 200316 Dataset categories Categorize datasets according to the extent of their location information Virtual dataset (VDS) no location Nonvirtual dataset (NVDS) Logical dataset (LDS) –Collection of logical files Physical dataset (PDS) –Collection of physical files Staged dataset –NVDS with mapping of sub-datasets to CPU or process

17 David Adams ATLAS DIAL SC2003November 15-18, 200317 Dataset category associations (example) VDS 1 LDS 1-1 {LF1 LF2} LDS 1-2 {LF3} PDS 1-1-1 {PF1A PF2A} PDS 1-1-2 {PF1B PF2B} PDS 1-1-3 {PF1A PF2B} PDS 1-2-1 {PF3A} PDS 1-2-2 {PF3B} Virtual Logical Physical

18 David Adams ATLAS DIAL SC2003November 15-18, 200318 Present dataset implementation includes Virtual dataset (VDS) –Portable representation of dataset without location Logical dataset (LDS) and physical dataset (PDS) –Add location expressed in terms of logical files Dataset database (DDB) –Repository of (immutable) datasets indexed by ID Dataset selection catalog (DSC) –Enables users to select a VDS Dataset replica catalog (DRC) –Enables “system” to locate an NVDS representation of a VDS Dataset file catalog (DFC) –Maps single-file datasets to LFN Dataset implementation C++ classes w/ XML rep MySQL tables Files indexed by name

19 David Adams ATLAS DIAL SC2003November 15-18, 200319 Dataset implementation

20 David Adams ATLAS DIAL SC2003November 15-18, 200320 Grid2003 datasets Define dataset Provenance, # events and list of LFN’s Assign dataset name Create entry in DSC Produce data with GCE/Pegasus/Chimera Transfer files to BNL disk directory Poll destination directory for new files Register files in Magda Create single-file dataset (LDS) for each registered file Store each dataset in DDB (dataset database) Register LFN-to-dataset association in DFC

21 David Adams ATLAS DIAL SC2003November 15-18, 200321 Grid2003 datasets (cont) At regular intervals Merge the current set of single-file datasets to create the latest merged LDS Use this LDS to create a VDS Register the VDS-LDS association in the DRC Update DSC entry with the new VDS

22 David Adams ATLAS DIAL SC2003November 15-18, 200322 Grid2003 analysis User selects dataset from DSC Dataset by name will change as data comes in Dataset by ID is snapshot at time of selection User submits job If by name, use DSC to get current dataset ID Use ID to extract VDS from DDB Submit dataset and task to DIAL scheduler Scheduler Uses DRC to find LDS corresponding to VDS –In principle, the “best” choice for the given task Splits LDS into single-file sub-datasets Processes each, gathers and merges results

23 David Adams ATLAS DIAL SC2003November 15-18, 200323 Discovery! The first Grid2003 dataset was analyzed and a mass was calculated from the four leading electrons 48k events Simulated mass was130 GeV Electron E T > 5GeV

24 David Adams ATLAS DIAL SC2003November 15-18, 200324 More information DIAL http://www.usatlas.bnl.gov/~dladams/dial Datasets http://www.usatlas.bnl.gov/~dladams/dataset Grid2003 Datasets http://www.usatlas.bnl.gov/~dladams/dataset/grid3


Download ppt "David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL November 17, 2003 SC2003 Phoenix."

Similar presentations


Ads by Google