HEP Data Grid for DØ and Its Regional Grid, DØ SAR

HEP Data Grid for DØ and Its Regional Grid, DØ SAR
Jae Yu Univ. of Texas, Arlington 3rd International Workshop on HEP Data Grid KyungPook National University Aug. 26 – 28, 2004

Outline The problem DØ Data Handling System and Computing Architecture
Remote Computing Activities DØ Remote Analysis Model DØSAR and the regional grid Summary 8/27/2004

The Problem The DØ Experiment
Has been taking data for the past 3 years and will continue throughout much of the decade The immediacy!!! Current data size close to 1PB and will be over 5 PB by the end Detectors are complicated  Need many people to construct and make them work Collaboration is large and scattered all over the world Allow software development at remote institutions Optimized resource management, job scheduling, and monitoring tools Efficient and transparent data delivery and sharing Use the opportunity of having large data set in furthering grid computing technology Improve computational capability for education Improve quality of life 8/27/2004

DØ and CDF at Fermilab Tevatron
World’s Highest Energy proton-anti-proton collider Ecm=1.96 TeV (=6.3x10-7J/p 13M Joules on 10-6m2) Equivalent to the kinetic energy of a 20t truck at a speed 80 mi/hr Chicago  CDF Tevatron p p DØ 8/27/2004

DØ Collaboration 650 Collaborators 78 Institutions 18 Countries
1/3 a problem of an LHC Experiment 8/27/2004

Remote Computing #1: Monitoring of the DØ Experiment
Detector Monitoring data sent in real time over the internet 9 am NIKHEF Amsterdam 2 am Fermilab DØ physicists Worldwide, use the internet and monitoring programs to examine collider data in real time and to evaluate detector performance and data quality. They use web tools to report this information back to their colleagues at Fermilab. DØ DØ detector The online monitoring project has been developed by DØ physicists and is coordinated by Michele Sanders, Imperial & Elliot Cheu, Arizona. 8/27/2004

Data Handling System Architecture
Regional Centers Central Analysis ClueD0 Remote Farms Central Farm Robotic Storage Raw Data RECO Data RECO MC User Data SAM (Sequential Access to Metadata) catalogs and manages data access The glue holdng it all together and an extremely successful project. 8/27/2004

The SAM Data Handling System
Depends on/uses: Central ORACLE database at Fermilab (central metadata repository) ENSTORE mass storage system at Fermilab (central data repository) Provides access to: Various file transfer protocols: bbftp, GridFTP, rcp, dccp.. Several mass storage systems: ENSTORE, HPSS, TSM Provides tools to define datasets: Web-based GUI Command line interface For DØ, has deployed 7 stations at FNAL and > 20 offsite. 8/27/2004

Data Access History via SAM
9x1010 90 billion events consumed by DØ in Run II using SAM Used for primary reconstruction and analysis at FNAL Used for remote simulation, reprocessing and analysis  3 billion served at a remote site 3x109 8/27/2004

International Computing #2 & #3: Worldwide Simulation and Processing
UK Remote Data Reconstruction sites NIKHEF Karlsruhe GridKA Lyon CCIN2P3 Westgrid Canada Fermilab DØ Experiment Prague Lancaster Remote Simulation sites Tata Institute Texas Sao Paulo Michigan St. Imperial Michigan Indiana Kansas SAM data transfer Reconstructed data Oklahoma Boston Wuppertal Munich Arizona Louisiana Simulation files Worldwide Computing Winter 2004 Partial list of stations for remote data analysis 8/27/2004

Offsite Data Re-processing
Successfully reprocessed 550M events, 200 pb-1 At Fermilab: 450 M events At GRIDKA, UK, IN2P3, NIKHEF, WESTGRID: 100M Events , data transfers >20 TB First steps to a grid Remote submission Transfer of tens of TB scale data sets Non-grid based re-processing Grid-based reprocessing planned end of CY04 Transfers exceeding 100 TB 8/27/2004

(Regional Analysis Center)
DØ Analysis Systems … CluEDØ ~350 nodes dØmino ~50TB … RAC (Regional Analysis Center) … CAB CLuEDO desktop cluster at DØ administered by DØ collaborators, Batch system (FNBS), local fileservers for analysis. Domino a legacy system Older analyses, file server, single event access Central routing station to offsite. CAB (Central Analysis Backend) at Feynman Computing Center PC/Linux dØmino back-end supplied & administrated by computing division 400 dual 2GHZ nodes, each with 80 GB disk Regional Analysis Centers (RAC) 8/27/2004

DØ Remote Computing History
SAM in place: pre-2001 Formed the DØRACE and DØGrid teams: Sept. 2001 DØ Remote Analysis Model Proposed: Nov. 2001 8/27/2004

DØ Remote Analysis Model (DØRAM)
Normal Interaction Communication Path Occasional Interaction Central Analysis Center (CAC) DAS …. IAC ... … RAC Fermilab Regional Analysis Centers Data and Resource hub MC Production Data processing Data analysis Institutional Analysis Centers Desktop Analysis Stations 8/27/2004

DØ Remote Computing History
SAM in place: pre-2001 Formed the DØRACE and DØGrid teams: Sept. 2001 DØ Remote Analysis Model Proposed: Nov. 2001 Proposal for RAC accepted and endorsed by DØ: June - Aug. 2002 UTA awarded MRI for RAC: June 2002 Prototype RAC established at Karlsruhe: Aug. – Nov. 2002 Formation of DØ Southern Analysis Region: Apr. 2003 DØ Offsite re-processing: Nov – Feb. 2004 Activation of 1st US RAC at UTA: Nov. 2003 Formation and activation of DØSAR Grid for MC: Apr. 2004 8/27/2004

DØ Southern Analysis Region (DØSAR)
One of the regional grids within the DØGrid Consortium coordinating activities to maximize computing, human and analysis resources Formed around the RAC at UTA Eleven institutes and twelve clusters MC farm clusters Mixture of dedicated and multi-purpose, rack mounted Desktop Condor farm 8/27/2004

DØSAR Consortium Second Generation IAC’s First Generation IAC’s
Cinvestav, Mexico Universidade Estadual Paulista, Brazil University of Kansas Kansas State University First Generation IAC’s University of Texas at Arlington Louisiana Tech University Langston University University of Oklahoma Tata Institute (India) Each 1st generation institution is paired with a 2nd generation institution to help expedite implementation of D0SAR capabilities Third Generation IAC’s Ole Miss, MS Rice University, TX University of Arizona, Tucson, AZ Both 1st and 2nd generation institutions can then help the 3rd generation institutions implement D0SAR capabilities 8/27/2004

Centralized Deployment Models
Started with Lab-centric SAM infrastructure in place, … …transition to hierarchically distributed Model  8/27/2004

DØRAM Implementation UTA has the first US DØRAC
Mainz Wuppertal Munich Aachen Bonn GridKa (Karlsruhe) Mexico/Brazil OU/LU UAZ Rice LTU UTA KU KSU Ole Miss UTA has the first US DØRAC DØSAR formed around UTA 8/27/2004

UTA – RAC (DPCC) 84 P4 Xeon 2.4GHz CPU = 202 GHz
5TB of FBC + 3.2TB IDE Internal GFS File system 100 P4 Xeon 2.6GHz CPU = 260 GHz 64TB of IDE RAID + 4TB internal NFS File system Total CPU: 462 GHz Total disk: 76.2TB Total Memory: 168Gbyte Network bandwidth: 68Gb/sec HEP – CSE Joint Project DØ+ATLAS CSE Research 8/27/2004

The tools at DØSAR Sequential Access via Metadata (SAM) Batch Systems
Existing and battle tested data replication and cataloging system Batch Systems Condor Three of the DØSAR farms consists of desktop machines under condor PBS Most the dedicated DØSAR farms use this system Grid framework: JIM = Job Inventory Management, PPDG Provide framework for grid operation  Job submission, monitoring, match making and scheduling Built upon Condor-G and globus Interfaced to two job managers runjob: More generic grid-enabled system  1 US + 5 EU MC sites McFarm: 5 US DØSAR sites 8/27/2004

Tevatron Grid Framework (SAMGrid or JIM)
UTA 8/27/2004

The tools cnt’d Monte Carlo Farm (McFarm) management  Cloned to other institutions Increased the total number of offsite MC farms by 5 Various Monitoring Software Ganglia resource monitoring Piped to MonaLISA as a VO McFarmGraph: MC Job status monitoring McPerM: Farm performance monitor McQue: Anticipated grid resource occupation monitor DØSAR Grid: Submit requests onto a local machine and the requests gets transferred to a submission site and executed at an execution site 8/27/2004

DØSAR Computing & Human Resources
Institutions CPU(GHz) [future] Storage (TB) People Cinvestav 13 1.1 1F Langston 22 1.3 1F+1GA LTU 25+[12] 3.0 1F+1PD+2GA KU 12 2.5 1F+1PD KSU 40 3.5 1F+2GA OU 19+[270] (tape) 4F+3PD+2GA Sao Paulo 115+[300] 4.5 2F+Many Tata Institute 78 1.6 1F+1Sys UTA 520 74 2.5F+1sys+1.5PD+3GA Total 844 + [582] 93.3 + 120 (tape) 14.5F+2sys+6.5PD+10GA 8/27/2004

Ganglia Grid Resource Monitoring
Operating since Apr. 2003 8/27/2004

Job Status Monitoring: McFarmGraph
Operating since Sept. 2003 8/27/2004

Farm Performance Monitor: McPerM
Operating since Sept. 2003 Designed, implemented and improved by UTA Students 8/27/2004

Queue Monitor: McQue Prototype in commissioning
Number of Jobs % of Total Available CPUs Time from Present hours Anticipated CPU Occupation Jobs in Distribute Queue Prototype in commissioning 8/27/2004

DØSAR Strategy Maximally exploit existing software and utilities to enable as many sites to contribute to the experiment Setup all IAC’s with DØ Software and data analysis environment Install Condor (or PBS) batch control system on desktop farms or dedicated clusters Install McFarm MC Local Production Control Software Produce MC events on IAC machines Enable various monitoring software Install SAMGrid and interface it with McFarm Submit jobs through SAMGrid and monitor them Perform analysis at individual’s desk 8/27/2004

DØSAR Computing & Human Resources
Institutions CPU(GHz) [future] Storage (TB) People Cinvestav 13 1.1 1F Langston 22 1.3 1F+1GA LTU 25+[12] 3.0 1F+1PD+2GA KU 12 2.5 1F+1PD KSU 40 3.5 1F+2GA OU 19+[270] (tape) 4F+3PD+2GA Sao Paulo 115+[300] 4.5 2F+Many Tata Institute 78 1.6 1F+1Sys UTA 520 74 2.5F+1sys+1.5PD+3GA Total 844 + [582] 93.3 + 120 (tape) 14.5F+2sys+6.5PD+10GA 8/27/2004

DØSARGrid Status DØSAR-Grid
Total of seven clusters producing MC events At the 3rd biennial workshop in Louisiana Tech. Univ, April 2004 Five grid-enabled clusters form DØSARGrid for MC production Simulated data production on grid in progress Preparing to add 3 more MC and 2 more grid-enabled sites at the next workshop in Sept. 2004 Investigating to work with the JIM team at Fermilab for further software tasks Large amount of documents and regional expertise in grid computing accumulated in the consortium 8/27/2004

How does current DØSARGrid work?
Client Site DØ Grid JDL Sub. Sites Ded. Clst. Desktop. Clst. Exe. Sites SAM Reg. Grids 8/27/2004

DØSAR MC Delivery Stat. Institution Inception NMC (TMB) x106 LTU
6/2003 0.6 LU 7/2003 1.3 OU 4/2003 1.0 Tata, India 3.3 Sao Paulo, Brazil 4/2004 UTA-HEP 1/2003 3.5 UTA–RAC 12/2003 9.0 DØSAR Total As of 8/25/04 18.7 8/27/2004 D0 Grid/Remote Computing April Joel Snow Langston University

Actual DØ Data Re-processing at UTA
Completed and delivered 200M events in July, 2004 8/27/2004

Network Bandwidth Occupation
ATLAS DC2 DØ TMBFix Sustained Operation ATLAS DC2 OC12 Upgrade expected at the end of 04 or early 05 8/27/2004

Benefits of Regional Consortium
Construct end-to-end service environment in a smaller, manageable scale Train and accumulate local expertise Easier access to help Smaller group to work coherently and closely Easier to share expertise Draw additional resources from variety of funding sources Promote interdisciplinary collaboration Increase intellectual resources: Enable remote participants to be more actively contribute to the collaboration 8/27/2004

Some Successes in Funding at DØSAR
Funds from NSF MRI for UTA – RAC: 2002 Construction of the first U.S. university based RAC EPSCoR + University funds for LTU – IAC: 2003 Increase IAC compute resources Human resource for further development Brazilian National Funds for Univ. of Sao Paulo: 2003 Construction of a prototype-RAC for SA Further funding very likely EPSCoR funds for OU & LU – IAC’s: 2004 Compute resources for IAC 8/27/2004

Summary DØGrid is operating in MC production with SAMGrid framework
Generic (runjob): 1 U.S + 5 EU sites McFarm: 5 DØSAR site Large amount of offsite documents and expertise accumulated Moving toward grid-based data re-processing and analysis Massive data re-processing in late CY04 Data storage at RACs for the consortium Higher level of complexity Improved infrastructure necessary for end-to-end grid services, especially network bandwidths NLR and other regional network (10Gbps) improvement plans Started working with AMPATH, Oklahoma, Louisiana, Brazilian Consortia (Tentatively named the BOLT Network) for the last mile… Start working with global grid efforts  Allow working on interoperability 8/27/2004

HEP Data Grid for DØ and Its Regional Grid, DØ SAR

Similar presentations

Presentation on theme: "HEP Data Grid for DØ and Its Regional Grid, DØ SAR"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HEP Data Grid for DØ and Its Regional Grid, DØ SAR

Similar presentations

Presentation on theme: "HEP Data Grid for DØ and Its Regional Grid, DØ SAR"— Presentation transcript:

Similar presentations

About project

Feedback