Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

SAM-Grid Status Core SAM development SAM-Grid architecture Progress Future work.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
Rod Walker IC 13th March 2002 SAM-Grid Middleware  SAM.  JIM.  RunJob.  Conclusions. - Rod Walker,ICL.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
Workload Management Massimo Sgaravatto INFN Padova.
Chapter 9: Moving to Design
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA,
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Accounting for the Grid Usage Records and a Resource Usage Service.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
CDF Grid Status Stefan Stonjek 05-Jul th GridPP meeting / Durham.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
28 April 2003Lee Lueking, PPDG Review1 BaBar and DØ Experiment Reports DOE Review of PPDG January 28-29, 2003 Lee Lueking Fermilab Computing Division D0.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
SAMGrid for CDF MC (and beyond) Igor Terekhov, FNAL/CD/CCF/SAM for JIM team.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.
DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
1 DØ Grid PP Plans – SAM, Grid, Ceiling Wax and Things Iain Bertram Lancaster University Monday 5 November 2001.
GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.
July 25, 20071/21 OSG Information Services Gabriele Garzoglio, Rob Quick, Chris Green OSG Information Services, VO Monitoring Services and Resource Selection.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Bob Jones EGEE Technical Director
Report on GLUE activities 5th EU-DataGRID Conference
Presentation transcript:

Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team

Igor Terekhov, FNAL Plan of Attack Brief History, D0 and CDF computing Grid Jobs and Information Management Architecture Job management Information management JIM project status and plans Globally Distributed data handling in SAM and beyond Summary

Igor Terekhov, FNAL History Run II CDF and D0, the two largest, currently running collider experiments Each experiment to accumulate ~1PB raw, reconstructed, analyzed data by Get the Higgs jointly. Real data acquisition – 5 /wk, 25MB/s, 1TB/day, plus MC

Igor Terekhov, FNAL

Globally Distributed Computing D0 – 78 institutions, 18 countries. CDF – 60 institutions, 12 countries. Many institutions have computing (including storage) resources, dozens for each of D0, CDF Some of these are actually shared, regionally or experiment-wide Sharing is good A possible contribution by the institution into the collaboration while keeping it local Recent Grid trend (and its funding) encourages it

Igor Terekhov, FNAL Goals of Globally Distributed Computing in Run II To distribute data to processing centers – SAM is a way, see later slide To benefit from the pool of distributed resources – maximize job turnaround, yet keep single interface To facilitate and automate decision making on job/data placement. Submit to the cyberspace, choose best resource To provide an aggregate view of the system and its activities and keep track of what’s happening To maintain security Finally, to learn and prepare for the LHC computing

Igor Terekhov, FNAL SAM Highlights SAM is Sequential data Access via Meta-data. Presented numerous times, prev CHEPS Core features: meta-data cataloguing, global data replication and routing, co-allocation of compute and data resources Global data distribution: MC import from remote sites Off-site analysis centers Off-site reconstruction (D0)

Igor Terekhov, FNAL Data Site WAN Data Flow Routing+Caching=Replication

Igor Terekhov, FNAL Now that the Data’s Distributed: JIM Grid Jobs and Information Management Owes to the D0 Grid funding – PPDG (an FNAL team), UK GridPP (Rod Walker, ICL) Very young – started 2001 Actively explore, adopt, enhance, develop new Grid technologies Collaborate with the Condor team from The University of Wisconsin on Job management JIM with SAM is also called The SAMGrid T<10min?

Igor Terekhov, FNAL

Job Management Strategies We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster) We distinguish structured jobs from unstructured. Structured jobs have their details known to Grid middleware. Unstructured jobs are mapped as a whole onto a cluster In the first phase, we want reasonably intelligent scheduling and reliable execution of unstructured data-intensive jobs.

Igor Terekhov, FNAL Job Management Highlights We seek to provide automated resource selection (brokering) at the global level with final scheduling done locally (environments like CDF CAF, Frank’s talk) Focus on data-intensive jobs: Execution time is composed of: Time to retrieve any missing input data Time to process the data Time to store output data In the Leading Order, we rank sites by the amount of data cached at the site (minimize missing input data) Scheduler is interfaced with the data handling system

Igor Terekhov, FNAL Job Management – Distinct JIM Features Decision making is based on both: Information existing irrespective of jobs (resource description) Functions of (jobs,resource) Decision making is interfaced with data handling middleware rather than individual SE’s or RC alone: this allows incorporation of DH considerations Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability

Igor Terekhov, FNAL JOB Computing Element Submission Client User Interface Queuing System Job Management User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Computing Element Storage Element Data Handling System Storage Element Informatio n Collector Grid Sensor s Computin g Element Data Handling System

Igor Terekhov, FNAL Condor Framework and Enhancements We Drove Initial Condor-G: Personal Grid agent helping user run a job on a cluster of his/her choice JIM: True grid service for accepting and placing jobs from all users Added MMS for Grid job brokering JIM: from 2-tier to 3-tier architecture Decouple queing/spooling/scheduling machine from user machine Security delegation, proper std* spooling, etc Will move into standard Condor

Igor Terekhov, FNAL Condor Framework and Enhancements We Drove Classic Matchmaking service (MMS): Clusters advertise their availability, jobs are matched with clusters Cluster (Resource) description exists irrespective of jobs JIM: Ranking expressions contain functions that are evaluated at run-time Helps rank a job by a function(job,resource) Now: query participating sites for data cached. Future: estimates when data for the job can arrive etc Feature now in standard Condor-G

Igor Terekhov, FNAL Monitoring Highlights Sites (resources) and jobs Distributed knowledge about jobs etc Incremental knowledge building GMA for current state inquiries, Logging for recent history studies All Web based

Igor Terekhov, FNAL Information Management – Implementation and Technology Choices XML for representation of site configuration and (almost) all other information Xquery and XSLT for information processing Xindice and other native XML databases for database semantics

Igor Terekhov, FNAL Main Site/cluster Config … Schema Resource Advertisement Monitoring Schema Data Handling Hosting Environment Meta-Schema

Igor Terekhov, FNAL Web Browser Web Server Site 1 Information System IP Web Browser Web Server 1 Site 2 Information System IP Web Server N Site N Information System JIM Monitoring

Igor Terekhov, FNAL JIM Project Status Delivered prototype for D0, Oct 10, 2002: Remote job submission Brokering based on data cached Web-based monitoring SC-2002 demo – 11 sites (D0, CDF), big success April 2003 – production deployment of V1 (Grid analysis in production a reality as of April, 1) Post V1 – OGSA, Web services, logging service

Igor Terekhov, FNAL Grid Data Handling We define GDH as a middleware service which: Brokers storage requests Maintains economical knowledge about costs of access to different SE’s Replicates data as needed (not only as driven by admins) Generalizes or replaces some of the services of the Data Management part of SAM

Igor Terekhov, FNAL Grid Data Handling, Initial Thoughts

Igor Terekhov, FNAL The Necessary (Almost) Final Slide Run II experiments’ computing is highly distributed, Grid trend is very relevant The JIM (Jobs and Information Management) part of the SAMGrid addresses the needs for global and grid computing at Run II We use Condor and Globus middleware to schedule jobs globally (based on data), and provide Web- based monitoring Demo available – see me or Gabriele SAM, the data handling system, is evolved towards the Grid, with modern storage element access enabled

Igor Terekhov, FNAL P.S. – Related Talks F. Wuerthwein, CAF (Cluster Analysis Facility) – job management on a cluster and interface to JIM/Grid F. Ratnikov, Monitoring on CAF and interface to JIM/Grid S. Stonjek, SAMgrid deployment experiences L. Lueking, G. Garzoglio – SAM-related

Backup Slides

Igor Terekhov, FNAL Information Management In JIM’s view, this includes both: resource description for job brokering Infrastructure for monitoring (core project area) GT MDS is not sufficient: Need (persistent) info representation that’s independent of LDIF or other such format Need maximum flexibility in information structure – no fixed schema Need configuration tools, push operation etc