Download presentation
Presentation is loading. Please wait.
Published bySonny Iskandar Modified over 5 years ago
1
HEP Data Grid for DØ and Its Regional Grid, DØ SAR
Jae Yu Univ. of Texas, Arlington 3rd International Workshop on HEP Data Grid KyungPook National University Aug. 26 – 28, 2004
2
Outline The problem DØ Data Handling System and Computing Architecture
Remote Computing Activities DØ Remote Analysis Model DØSAR and the regional grid Summary 8/27/2004
3
The Problem The DØ Experiment
Has been taking data for the past 3 years and will continue throughout much of the decade The immediacy!!! Current data size close to 1PB and will be over 5 PB by the end Detectors are complicated Need many people to construct and make them work Collaboration is large and scattered all over the world Allow software development at remote institutions Optimized resource management, job scheduling, and monitoring tools Efficient and transparent data delivery and sharing Use the opportunity of having large data set in furthering grid computing technology Improve computational capability for education Improve quality of life 8/27/2004
4
DØ and CDF at Fermilab Tevatron
World’s Highest Energy proton-anti-proton collider Ecm=1.96 TeV (=6.3x10-7J/p 13M Joules on 10-6m2) Equivalent to the kinetic energy of a 20t truck at a speed 80 mi/hr Chicago CDF Tevatron p p DØ 8/27/2004
5
DØ Collaboration 650 Collaborators 78 Institutions 18 Countries
1/3 a problem of an LHC Experiment 8/27/2004
6
Remote Computing #1: Monitoring of the DØ Experiment
Detector Monitoring data sent in real time over the internet 9 am NIKHEF Amsterdam 2 am Fermilab DØ physicists Worldwide, use the internet and monitoring programs to examine collider data in real time and to evaluate detector performance and data quality. They use web tools to report this information back to their colleagues at Fermilab. DØ DØ detector The online monitoring project has been developed by DØ physicists and is coordinated by Michele Sanders, Imperial & Elliot Cheu, Arizona. 8/27/2004
7
Data Handling System Architecture
Regional Centers Central Analysis ClueD0 Remote Farms Central Farm Robotic Storage Raw Data RECO Data RECO MC User Data SAM (Sequential Access to Metadata) catalogs and manages data access The glue holdng it all together and an extremely successful project. 8/27/2004
8
The SAM Data Handling System
Depends on/uses: Central ORACLE database at Fermilab (central metadata repository) ENSTORE mass storage system at Fermilab (central data repository) Provides access to: Various file transfer protocols: bbftp, GridFTP, rcp, dccp.. Several mass storage systems: ENSTORE, HPSS, TSM Provides tools to define datasets: Web-based GUI Command line interface For DØ, has deployed 7 stations at FNAL and > 20 offsite. 8/27/2004
9
Data Access History via SAM
9x1010 90 billion events consumed by DØ in Run II using SAM Used for primary reconstruction and analysis at FNAL Used for remote simulation, reprocessing and analysis 3 billion served at a remote site 3x109 8/27/2004
10
International Computing #2 & #3: Worldwide Simulation and Processing
UK Remote Data Reconstruction sites NIKHEF Karlsruhe GridKA Lyon CCIN2P3 Westgrid Canada Fermilab DØ Experiment Prague Lancaster Remote Simulation sites Tata Institute Texas Sao Paulo Michigan St. Imperial Michigan Indiana Kansas SAM data transfer Reconstructed data Oklahoma Boston Wuppertal Munich Arizona Louisiana Simulation files Worldwide Computing Winter 2004 Partial list of stations for remote data analysis 8/27/2004
11
Offsite Data Re-processing
Successfully reprocessed 550M events, 200 pb-1 At Fermilab: 450 M events At GRIDKA, UK, IN2P3, NIKHEF, WESTGRID: 100M Events , data transfers >20 TB First steps to a grid Remote submission Transfer of tens of TB scale data sets Non-grid based re-processing Grid-based reprocessing planned end of CY04 Transfers exceeding 100 TB 8/27/2004
12
(Regional Analysis Center)
DØ Analysis Systems … CluEDØ ~350 nodes dØmino ~50TB … RAC (Regional Analysis Center) … CAB CLuEDO desktop cluster at DØ administered by DØ collaborators, Batch system (FNBS), local fileservers for analysis. Domino a legacy system Older analyses, file server, single event access Central routing station to offsite. CAB (Central Analysis Backend) at Feynman Computing Center PC/Linux dØmino back-end supplied & administrated by computing division 400 dual 2GHZ nodes, each with 80 GB disk Regional Analysis Centers (RAC) 8/27/2004
13
DØ Remote Computing History
SAM in place: pre-2001 Formed the DØRACE and DØGrid teams: Sept. 2001 DØ Remote Analysis Model Proposed: Nov. 2001 8/27/2004
14
DØ Remote Analysis Model (DØRAM)
Normal Interaction Communication Path Occasional Interaction Central Analysis Center (CAC) DAS …. IAC ... … RAC Fermilab Regional Analysis Centers Data and Resource hub MC Production Data processing Data analysis Institutional Analysis Centers Desktop Analysis Stations 8/27/2004
15
DØ Remote Computing History
SAM in place: pre-2001 Formed the DØRACE and DØGrid teams: Sept. 2001 DØ Remote Analysis Model Proposed: Nov. 2001 Proposal for RAC accepted and endorsed by DØ: June - Aug. 2002 UTA awarded MRI for RAC: June 2002 Prototype RAC established at Karlsruhe: Aug. – Nov. 2002 Formation of DØ Southern Analysis Region: Apr. 2003 DØ Offsite re-processing: Nov – Feb. 2004 Activation of 1st US RAC at UTA: Nov. 2003 Formation and activation of DØSAR Grid for MC: Apr. 2004 8/27/2004
16
DØ Southern Analysis Region (DØSAR)
One of the regional grids within the DØGrid Consortium coordinating activities to maximize computing, human and analysis resources Formed around the RAC at UTA Eleven institutes and twelve clusters MC farm clusters Mixture of dedicated and multi-purpose, rack mounted Desktop Condor farm 8/27/2004
17
DØSAR Consortium Second Generation IAC’s First Generation IAC’s
Cinvestav, Mexico Universidade Estadual Paulista, Brazil University of Kansas Kansas State University First Generation IAC’s University of Texas at Arlington Louisiana Tech University Langston University University of Oklahoma Tata Institute (India) Each 1st generation institution is paired with a 2nd generation institution to help expedite implementation of D0SAR capabilities Third Generation IAC’s Ole Miss, MS Rice University, TX University of Arizona, Tucson, AZ Both 1st and 2nd generation institutions can then help the 3rd generation institutions implement D0SAR capabilities 8/27/2004
18
Centralized Deployment Models
Started with Lab-centric SAM infrastructure in place, … …transition to hierarchically distributed Model 8/27/2004
19
DØRAM Implementation UTA has the first US DØRAC
Mainz Wuppertal Munich Aachen Bonn GridKa (Karlsruhe) Mexico/Brazil OU/LU UAZ Rice LTU UTA KU KSU Ole Miss UTA has the first US DØRAC DØSAR formed around UTA 8/27/2004
20
UTA – RAC (DPCC) 84 P4 Xeon 2.4GHz CPU = 202 GHz
5TB of FBC + 3.2TB IDE Internal GFS File system 100 P4 Xeon 2.6GHz CPU = 260 GHz 64TB of IDE RAID + 4TB internal NFS File system Total CPU: 462 GHz Total disk: 76.2TB Total Memory: 168Gbyte Network bandwidth: 68Gb/sec HEP – CSE Joint Project DØ+ATLAS CSE Research 8/27/2004
21
The tools at DØSAR Sequential Access via Metadata (SAM) Batch Systems
Existing and battle tested data replication and cataloging system Batch Systems Condor Three of the DØSAR farms consists of desktop machines under condor PBS Most the dedicated DØSAR farms use this system Grid framework: JIM = Job Inventory Management, PPDG Provide framework for grid operation Job submission, monitoring, match making and scheduling Built upon Condor-G and globus Interfaced to two job managers runjob: More generic grid-enabled system 1 US + 5 EU MC sites McFarm: 5 US DØSAR sites 8/27/2004
22
Tevatron Grid Framework (SAMGrid or JIM)
UTA 8/27/2004
23
The tools cnt’d Monte Carlo Farm (McFarm) management Cloned to other institutions Increased the total number of offsite MC farms by 5 Various Monitoring Software Ganglia resource monitoring Piped to MonaLISA as a VO McFarmGraph: MC Job status monitoring McPerM: Farm performance monitor McQue: Anticipated grid resource occupation monitor DØSAR Grid: Submit requests onto a local machine and the requests gets transferred to a submission site and executed at an execution site 8/27/2004
24
DØSAR Computing & Human Resources
Institutions CPU(GHz) [future] Storage (TB) People Cinvestav 13 1.1 1F Langston 22 1.3 1F+1GA LTU 25+[12] 3.0 1F+1PD+2GA KU 12 2.5 1F+1PD KSU 40 3.5 1F+2GA OU 19+[270] (tape) 4F+3PD+2GA Sao Paulo 115+[300] 4.5 2F+Many Tata Institute 78 1.6 1F+1Sys UTA 520 74 2.5F+1sys+1.5PD+3GA Total 844 + [582] 93.3 + 120 (tape) 14.5F+2sys+6.5PD+10GA 8/27/2004
25
Ganglia Grid Resource Monitoring
Operating since Apr. 2003 8/27/2004
26
Job Status Monitoring: McFarmGraph
Operating since Sept. 2003 8/27/2004
27
Farm Performance Monitor: McPerM
Operating since Sept. 2003 Designed, implemented and improved by UTA Students 8/27/2004
28
Queue Monitor: McQue Prototype in commissioning
Number of Jobs % of Total Available CPUs Time from Present hours Anticipated CPU Occupation Jobs in Distribute Queue Prototype in commissioning 8/27/2004
29
DØSAR Strategy Maximally exploit existing software and utilities to enable as many sites to contribute to the experiment Setup all IAC’s with DØ Software and data analysis environment Install Condor (or PBS) batch control system on desktop farms or dedicated clusters Install McFarm MC Local Production Control Software Produce MC events on IAC machines Enable various monitoring software Install SAMGrid and interface it with McFarm Submit jobs through SAMGrid and monitor them Perform analysis at individual’s desk 8/27/2004
30
DØSAR Computing & Human Resources
Institutions CPU(GHz) [future] Storage (TB) People Cinvestav 13 1.1 1F Langston 22 1.3 1F+1GA LTU 25+[12] 3.0 1F+1PD+2GA KU 12 2.5 1F+1PD KSU 40 3.5 1F+2GA OU 19+[270] (tape) 4F+3PD+2GA Sao Paulo 115+[300] 4.5 2F+Many Tata Institute 78 1.6 1F+1Sys UTA 520 74 2.5F+1sys+1.5PD+3GA Total 844 + [582] 93.3 + 120 (tape) 14.5F+2sys+6.5PD+10GA 8/27/2004
31
DØSARGrid Status DØSAR-Grid
Total of seven clusters producing MC events At the 3rd biennial workshop in Louisiana Tech. Univ, April 2004 Five grid-enabled clusters form DØSARGrid for MC production Simulated data production on grid in progress Preparing to add 3 more MC and 2 more grid-enabled sites at the next workshop in Sept. 2004 Investigating to work with the JIM team at Fermilab for further software tasks Large amount of documents and regional expertise in grid computing accumulated in the consortium 8/27/2004
32
How does current DØSARGrid work?
Client Site DØ Grid JDL Sub. Sites Ded. Clst. Desktop. Clst. Exe. Sites SAM Reg. Grids 8/27/2004
33
DØSAR MC Delivery Stat. Institution Inception NMC (TMB) x106 LTU
6/2003 0.6 LU 7/2003 1.3 OU 4/2003 1.0 Tata, India 3.3 Sao Paulo, Brazil 4/2004 UTA-HEP 1/2003 3.5 UTA–RAC 12/2003 9.0 DØSAR Total As of 8/25/04 18.7 8/27/2004 D0 Grid/Remote Computing April Joel Snow Langston University
34
Actual DØ Data Re-processing at UTA
Completed and delivered 200M events in July, 2004 8/27/2004
35
Network Bandwidth Occupation
ATLAS DC2 DØ TMBFix Sustained Operation ATLAS DC2 OC12 Upgrade expected at the end of 04 or early 05 8/27/2004
36
Benefits of Regional Consortium
Construct end-to-end service environment in a smaller, manageable scale Train and accumulate local expertise Easier access to help Smaller group to work coherently and closely Easier to share expertise Draw additional resources from variety of funding sources Promote interdisciplinary collaboration Increase intellectual resources: Enable remote participants to be more actively contribute to the collaboration 8/27/2004
37
Some Successes in Funding at DØSAR
Funds from NSF MRI for UTA – RAC: 2002 Construction of the first U.S. university based RAC EPSCoR + University funds for LTU – IAC: 2003 Increase IAC compute resources Human resource for further development Brazilian National Funds for Univ. of Sao Paulo: 2003 Construction of a prototype-RAC for SA Further funding very likely EPSCoR funds for OU & LU – IAC’s: 2004 Compute resources for IAC 8/27/2004
38
Summary DØGrid is operating in MC production with SAMGrid framework
Generic (runjob): 1 U.S + 5 EU sites McFarm: 5 DØSAR site Large amount of offsite documents and expertise accumulated Moving toward grid-based data re-processing and analysis Massive data re-processing in late CY04 Data storage at RACs for the consortium Higher level of complexity Improved infrastructure necessary for end-to-end grid services, especially network bandwidths NLR and other regional network (10Gbps) improvement plans Started working with AMPATH, Oklahoma, Louisiana, Brazilian Consortia (Tentatively named the BOLT Network) for the last mile… Start working with global grid efforts Allow working on interoperability 8/27/2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.