Download presentation
Presentation is loading. Please wait.
Published byCamilla West Modified over 8 years ago
1
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division
2
Gabriele Garzoglio Mar 15, 2005 Overview The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management Local Job Management Getting more resources submitting to LCG
3
Gabriele Garzoglio Mar 15, 2005 Fermilab and DZero DZero
4
Gabriele Garzoglio Mar 15, 2005 Data size for the D0 Experiment Detector Data 1,000,000 Channels Event size 250KB Event rate ~50 Hz 0.75 PB of data since 1998 Past year overall 0.5 PB Expect overall 10 – 20 PB This means Move 10s TB / day Process PBs / year 25% – 50% remote computing
5
Gabriele Garzoglio Mar 15, 2005 Overview The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management Local Job Management Getting more resources submitting to LCG
6
Gabriele Garzoglio Mar 15, 2005 Motivation for the Reprocessing Processing: changing the data format from something close to the detector to something close to the physics. As the understanding of the detector improves, the processing algorithms change Sometimes it is worth to “reprocess” all the data to have “better” analysis results. Our understanding of the DZero calorimeter calibration is now based on reality rather then design/plans: we want to reprocess
7
Gabriele Garzoglio Mar 15, 2005 The computing task Events 1 Bilion Input 250TB (250kB/Event) Output 70TB (70kB/Event) Time 50s/Event: 20,000months Ideally 3400CPUs (1GHz PIII) for 6mths (~2 days/file) Remote processing 100% A stack of CDs as high as the Eiffel tower
8
Gabriele Garzoglio Mar 15, 2005 Data processing model Input Datasets (n files) Site 1Site 2Site m … Job 1Job 2Job n … Out 1Out 2Out n … (n batch processes per site) (stored locally at the site) Merging Permanent Storage (at any site) (n~100: files produced in 1 day)
9
Gabriele Garzoglio Mar 15, 2005 Challenges: Overall scale A dozen computing clusters in US and EU common meta-computing framework: SAM-Grid administrative independence Need to submit 1,700 batch jobs / day to meet the dead line (without counting failures) Each site needs to be filled up at all time: locally scale up to 1000 batch nodes Time to completion of the unit of bookkeeping (~100 files): if too long (days) things are bound to fail Handle 250+ TB of data
10
Gabriele Garzoglio Mar 15, 2005 Challenges: Error Handling / Recovery Design for random failures unrecoverable application errors, network outages, file delivery failures, batch system crashes and hangups, worker-node crashes, filesystem corruption... Book-keeping of succeeded jobs/files: needed to assure completion without duplicated events Book-keeping of failed jobs/files: needed for recovery AND to trace problems in order fix bugs and to assure efficiency Simple error recovery to foster smooth operations
11
Gabriele Garzoglio Mar 15, 2005 Available Resources SITE#CPU 1GHz-eq. STATUS FNAL Farm1000CPUsused for data-taking Westgrid600CPUs ready Lyon400CPUs ready SAR (UTA)230CPUs ready Wisconsin30CPUs ready GridKa500CPUs ready Prague200CPUs ready CMS/OSG100CPUs under test UK750CPUs 4 sites being deployed ------------------------------------------------------------- 2800CPUs (1GHz PIII equiv.)
12
Gabriele Garzoglio Mar 15, 2005 Overview The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management Local Job Management Getting more resources submitting to LCG
13
Gabriele Garzoglio Mar 15, 2005 The SAM-Grid SAM-Grid is an integrated job, data, and information management system Grid-level job management is based on Condor-G and Globus Data handling and book-keeping is based on SAM (Sequential Access via Metadata): transparent data transport, processing history, and book-keeping. …lot of work to achieve scalability at the execution cluster
14
Gabriele Garzoglio Mar 15, 2005 SAM-Grid Diagram Site Resource Selector Info Collector Info Gatherer Match Making User Interface Submission Global Job Queue Grid Client Submission User Interface Global DH Services SAM Naming Server SAM Log Server Resource Optimizer SAM DB Server RCMetaData Catalog Bookkeeping Service SAM Stager(s) SAM Station (+other servs) Data Handling Worker Nodes Grid Gateway Grid/Fabric Interface JIM Advertise Local Job Handling Cluster AAA Dist.FS Info Manager XML DB server Site Conf. Glob/Loc JID map... Info Providers MDS MSS Cache Site Web Serv Grid Monitoring User Tools Flow of: jobdata meta-data
15
Gabriele Garzoglio Mar 15, 2005 JOB Computing Element User Interface Submission Service Job Management Diagram User Interface Resource Selection Match Making Service Information Collector Exec Site #1 Match Making Service Computing Element Grid Sensors Execution Site #n Submission Service Grid Sensors Computing Element Generic Service Generic Service Informatio n Collector Grid Sensor s Computin g Element Generic Service Generic Service ext. algo ext. algo Grid/Fabri c Interface
16
Gabriele Garzoglio Mar 15, 2005 Fabric Level Job Management Execution Site Grid/Fabric Interface JOB SAM Station Sandbox Facility SAM Station XML Monitoring Database Batch System Adaptor Sandbox Facility Batch System Worker Node XML Monitorin g Database Batch System Worker Node Batch System Adapter Job enters the SiteLocal Sandbox created for job (user input, configuration, SAM client, GridFTP client, user credentials) Local services notified of job Batch Job submission details requested Job submittedJob starts on Worker nodePush of monitoring info starts Job fetches SandboxJob gets dependent products and input data Framework passes control to application Grid monitors status of job User requests status of job Job stores output from application Stdout, stderr, logs handed to Grid
17
Gabriele Garzoglio Mar 15, 2005 How do we get more resources? We are working on forwarding jobs to the LCG Grid A “forwarding-node” is the advertised gateway to LCG LCG becomes yet another batch system… well, not quite a batch system Need to get rid on the assumptions on the locality of the network SAM-Grid LCG Fwd-node VO-srv
18
Gabriele Garzoglio Mar 15, 2005 Conclusions DZero needs to reprocess 250 TB of data in the next 6-8 months It will produce 70 TB of output, processing data at a dozen computing centers on ~3000 CPUs The SAM-Grid system will provide the meta- computing infrastructure to handle data, job, and information.
19
Gabriele Garzoglio Mar 15, 2005 More info at… http://www- d0.fnal.gov/computing/reprocessing/ http://www-d0.fnal.gov/computing/grid/ http://samgrid.fnal.gov:8080/ http://d0db.fnal.gov/sam/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.