Download presentation
Presentation is loading. Please wait.
1
Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002
2
2 Overview Overview of the D0 Experiment Introduction into Computing and the paradigm of Distributed Computing SAM – the advanced data handling system Global Job And Information Management (JIM) – the current Grid project Collaborative Grid work
3
3 The DØ Experiment P-pbar collider experiment 2TeV Detector (Real) Data 1,000,000 Channels (793k from Silicon Microstrip Tracker), 5-15% read at a time Event size 250KB (25% increase in RunIIb) Recorded event rate 25 Hz RunIIa, 50 Hz (projected) RunIIb On-line Data Rate 0.5 TB/day, Total 1TB/day Est. 3 year totals (incl Processing and analysis): Over 10 9 events, 1-2 PB Monte Carlo Data 6 remote processing centers Estimate ~300 TB in next 2 years.
4
4 The Collaboration 600+ people 78 Institutions 18 countries Is a large Virtual Organization whose members share resources for solving common problems
5
5
6
6 Analysis Assumptions Num of Jobs % of DataSetDuration CPU/evt, 500 MHz Long630% 12 weeks 5 sec Medium5010% 4 weeks 1 sec Short1501% 1 week 0.1 sec
7
7 Data Storage The Enstore Mass Storage System, http://www- isd.fnal.gov/enstore/index.html http://www- isd.fnal.gov/enstore/index.htmlhttp://www- isd.fnal.gov/enstore/index.html All data is stored on tape in Automated Tape Library (ATL) – robot, including derived datasets Enstore is attached to the network, accessible via a cp-like command Other, remote MSS’s may be used (the distributed ownership paradigm – grid computing)
8
8 Data Handling - SAM Responds to the above challenges in: Amounts of data Rate of access (processing) The degree to which the user base is distributed Major goals and requirements Reliably store (real and MC) produced data Distribute the data globally to remote analysis centers Catalogue the data – contents, status, locations, processing history, user datasets etc Manage resources
9
9 SAM Highlights SAM is Sequential data Access via Meta-data http://d0db.fnal.gov/sam http://d0db.fnal.gov/sam Joint project between D0 and Computing Division started in 1997 to meet the Run II data handling needs Employs a centrally managed RDBMS (Oracle) for meta-data catalog Processing takes place at stations Actual data is managed by a fully distributed set of collaborating servers (see architecture later)
10
10 SAM Advanced Features Uniform interfaces for data access modes Online system, reconstruction farm, Monte- Carlo farm, analysis server are all subclasses of the station. Uniform capabilities for processing at FNAL and remote centers On-demand data caching and forwarding (intra- cluster and global) Resource management: Co-allocation of compute and data resources (interfaces with batch system abstraction) Fair share allocation and scheduling
11
11 Components of a SAM Station Station & Cache Manager File Storage Server File Stager(s) Project Masters /Consumers eworkers File Storage Clients MSS or Other Station MSS or Other Station Data flow Control Producers/ Cache Disk Temp Disk
12
12 SAM as a Distributed System Database Server(s) (Central Database) Name Server Global Resource Manager(s) Log server Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local To Site Shared Locally Arrows indicate Control and data flow
13
13 Data Site WAN SAM as a Distributed System optimizer Logger Shared locally, optional Shared Globally (standard): Database Server optimizer Logger FNAL
14
14 Data Site WAN Data Flow Routing+Caching=Replication
15
15 SAM as a Data Grid Provides high-level collective services of reliable data storage and replication Embraces multiple MSS’s (Enstore, HPSS, etc) local resource management systems (LSF, FBS, PBS, Condor), several different file transfer protocols (bbftp, kerberos rcp, grid ftp, …) Optionally uses Grid technologies and tools Condor as a Batch system (in use) Globus FTP for data transfers (ready for deployment) From de facto to de jure…
16
16 Fabric Tape Storage Elements Request Formulator and Planner Client Applications Compute Elements Indicates component that will be replaced Disk Storage Elements LANs and WANs Resource and Services Catalog Replica Catalog Meta- data Catalog Authentication and Security GSI SAM-specific user, group, node, station registrationBbftp ‘cookie’ Connectivity and Resource CORBAUDP File transfer protocols - ftp, bbftp, rcp GridFTP Mass Storage systems protocols e.g. encp, hpss Collective Services Catalog protocols Significant Event LoggerNaming ServiceDatabase ManagerCatalog Manager SAM Resource Management Batch Systems - LSF, FBS, PBS, Condor Data Mover Job Services Storage Manager Job ManagerCache Manager Request Manager “Dataset Editor” “File Storage Server”“Project Master”“Station Master” WebPython codes, Java codesCommand line D0 Framework C++ codes “Stager”“Optimiser” Code Repository Name in “quotes” is SAM-given software component name or addedenhancedusing PPDG and Grid tools
17
17 Dzero SAM Deployment Map Processing Center Analysis site
18
18 SAM usage statistics for DZero 497 registered SAM users in production 360 of them have at some time run at least one SAM project 132 of them have run more than 100 SAM projects 323 of them have run a SAM project at some time in the past year 195 of them have run a SAM project in the past 2 months 48 registered stations, 340 registered nodes 115TB of data on tape 63,235 cached files currently (over 1 million entries total) 702,089 physical and virtual data files known to SAM 535,048 physical files (90K raw, 300K MC related) 71,246 “analysis” projects ever ran http://d0db.fnal.gov/sam_data_browsing/ for more info http://d0db.fnal.gov/sam_data_browsing/
19
19 SAM + JIM Grid So we can reliably replicate a TB of data, what’s next? It is handling of jobs, not data, that constitutes the top of the services pyramid Need services for job submission, brokering and reliable execution Need resource discovery and opportunistic computing (shared vs dedicated resources) Need monitoring of the global system and jobs Job and Information Management (JIM) emerged
20
20 JIM and SAM-Grid (NB: Please hear Gabriele Garzoglio’s talk) Project started in 2001 as part of the PPDG collaboration to handle D0’s expanded needs. Recently included CDF These are real Grid problems and we are incorporating (adopting) or developing Grid solutions http://www-d0.fnal.gov/computing/grid http://www-d0.fnal.gov/computing/grid PPDG, GridPP, iVDGL, DataTAG and other Grid Projects
21
21 SAMGrid Principal Components (NB Please come to Gabriele’s talk) Job Definition and Management: The preliminary job management architecture is aggressively based on the Condor technology provided by through our collaboration with University of Wisconsin CS Group. Monitoring and Information Services: We assign a critical role to this part of the system and widen the boundaries of this component to include all services that provide, or receive, information relevant for job and data management. Data Handling: The existing SAM Data Handling system, when properly abstracted, plays a principal role in the overall architecture and has direct effects on the Job Management services.
22
22 SAM-Grid Architecture Job Handling Monitoring and Information Data Handling Request Broker Compute Element Resource Site Gatekeeper Logging and Bookkeeping Job Scheduler Info Processor And Converter Replica Catalog DH Resource Management Data Delivery and Caching Resource Info JH Client AAA Batch System Condor-G Condor MMS GRAM GSI SAM Grid sensors (All) Job Status Updates MDS-2 Condor Class Ads Grid RC Principal Component Service Implementation Or Library Information
23
23 SAMGrid: Collaboration of Collaborations HEP Experiments are traditionally collaborative Computing solutions in the Grid era: new types of collaboration Sharing solution within experiment – UTA MCFarm software etc Collaboration between experiments – D0 and CDF joining forces an important event for SAM and FNAL Collaboration among the grid players: Physicists, Computer Scientists (Condor and Globus teams), Physics-oriented computer professionals (such as myself)
24
24 Conclusions The Dzero experiment is one of the largest currently running experiments and presents computing challenges The advanced data handling system, SAM, has matured. It is fully distributed, its model is proven sound and we expect to scale to meet RunII needs for both D0 and CDF Expanded needs are in the area of job and information management The recent challenges are typical of the Grid Computing and D0 engages actively, in collaboration with Computer scientists and other Grid participants More in Gabriele Garzoglio’s talk
25
25 The Milestone Dependencies Job Def Doc Execute unstructured MC and SAM analysis jobs with basic brokering Tech. Rev. doc. Execute unstructured SAM analysis jobs UC doc Arch. Doc Execute User-routed MC Jobs Prototype Grid with RB, JSS, GMA-based MIS Study JDLsUse Cases Condor GMA, MDS GSI SAM GSI In SAM Condor In SAM Basic SAM Res Info Service Toy Grid with JSS, basic Monitoring MDS TestBed Status Mon-ing of unstructured jobs Basic System Mon-ing CondorG TestBed SAM Grid- ready Reliable Execution of structured, locally distributed MC and SAM analysis jobs with basic brokering Scheduling criteria for data-intensive jobs, JH-DH interaction design Monitoring of structured jobs DH Mon-ing JH, MIS fully distributed JDL 6 Mo 9-19 Mo Now
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.