Distributed Data Access and Resource Management in the D0 SAM System

Slides:

Advertisements

Similar presentations

Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.

Advertisements

High Performance Computing Course Notes Grid Computing.

Data Grids Darshan R. Kapadia Gregor von Laszewski

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Workload Management Massimo Sgaravatto INFN Padova.

Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.

◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.

1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.

SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.

File and Object Replication in Data Grids Chin-Yi Tsai.

September 4,2001Lee Lueking, FNAL1 SAM Resource Management Lee Lueking CHEP 2001 September 3-8, 2001 Beijing China.

SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.

The Grid System Design Liu Xiangrui Beijing Institute of Technology.

Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.

Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.

1 DØ Grid PP Plans – SAM, Grid, Ceiling Wax and Things Iain Bertram Lancaster University Monday 5 November 2001.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

6 march Building the INFN Grid Proposal outline a.ghiselli,l.luminari,m.sgaravatto,c.vistoli INFN Grid meeting, milano.

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

NeST: Network Storage John Bent, Venkateshwaran V Miron Livny, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

SAM: Past, Present, and Future Lee Lueking All Dzero Meeting November 2, 2001.

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.

Workload Management Workpackage

Clouds , Grids and Clusters

Introduction to Distributed Platforms

Introduction to Load Balancing:

Operating Systems : Overview

Pasquale Pagano CNR, Italy

GWE Core Grid Wizard Enterprise (

GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.

Globus —— Toolkits for Grid Computing

Grid Computing.

LQCD Computing Operations

University of Technology

Grid Computing B.Ramamurthy 9/22/2018 B.Ramamurthy.

Operating Systems : Overview

Operating Systems Bina Ramamurthy CSE421 11/27/2018 B.Ramamurthy.

Support for ”interactive batch”

Operating Systems : Overview

Operating Systems : Overview

Operating Systems : Overview

Operating Systems : Overview

Operating Systems : Overview

Operating Systems : Overview

Wide Area Workload Management Work Package DATAGRID project

Operating Systems : Overview

Experiences in Running Workloads over OSG/Grid3

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Distributed Data Access and Resource Management in the D0 SAM System I.Terekhov, Fermi National Accelerator Laboratory, for the SAM project: L.Carpenter, L.Lueking, C.Moore, J.Trumbo, S.Veseli, M.Vranicar, S.White, V.White

Plan of Attack The domain SAM as a Data Grid D0 overview and applications SAM as a Data Grid Metadata File replication Initial resource management SAM and generic Grid technologies Comprehensive resource management

D0: A Virtual Organization High Energy Physics (HEP) collider experiment, multi-institutional Collaboration of 500+ scientists, 72+ institutions, 18+ countries Physicists generate, analyze data Coordinated resource sharing (networks, MSS, etc) for common problem (physics analysis) solving

Applications and Data Intensity Real data taking from the detector Monte-Carlo data simulation Reconstruction Analysis The gist of experimental HEP Extremely I/O intensive Recurrent processing of datasets: caching highly beneficial

Data Handling as the Core of D0 Meta-Computing HEP Applications are data-intensive Computational Economy is extremely data-centric b/c costs are driven by DH resources SAM: primarily and historically a DH system: a working Data Grid prototype Job control inclusion is in the Grid context (the D0-PPDG project)

SAM as a Data Grid High Level Services Replica Selection Replication Cost Estimation Replica Selection Data Replication Comprehensive Resource Management Generic Grid Services Core Services Mass Storage Systems Metadata Resource Management (External to SAM) Based on: A.Chervenak, I.Foster, C. Kesselman, C.Salisbury, S.Tuecke, The Data Grid: Towards an Architecture for the Distributed Management And Analysis of Large Scientific Datasets, To appear in Journal of Network and Computer Applications

Data Replica Management Levels Processing Station is a (locally distributed, semi-autonomous) collection of HW resources (disk, CPU, etc). A SW component Local data replication for parallel processing in a single batch system - within a Station Global data replication - worldwide data exchange among Stations and MSS’s

Local Data Replication consider a cluster, physically distributed disk cache logical partitioning by research groups (controlled, coordinated sharing) each group executes independent cache replacement algorithm (FIFO, LRU, many flavors) Replica catalog is updated in the course of the cache replacement Access history of each local replica is maintained persistently in the MD

Local Data Replication, cont’d While Resource Managers strive to have jobs and their data being in proximity (see below), the Batch System does not always dispatch jobs wherever the data lies Station executes intra-cluster data replication on demand, fully user-transparently

Forwarding + Caching = Global Replication Mass Storage System Station Site Replica User (producer) WAN Data flow

Goals of Resource Management Implement experiment policies on prioritization and fair sharing in resource usage, by user categories (access modes, research group etc) Maximize throughput in terms of real work done (i.e. user jobs and not system internal jobs such as data transfers)

RM approaches Fair Sharing (policies) Allocation of resources and scheduling of jobs The goal is to ensure that, in a busy environment, each abstract user gets a fixed share of “resources” or gets a fixed share of “work” done Co-allocation and reservation (optimization)

FS and Computational Economy Jobs, when executed, incur costs (through resource utilization) and realize benefits (through getting work done) Maintain a tuple (vector) of cumulative costs/benefits for each abstract user and compare them to his allocated fair share to set priority higher/lower Incorporated all known resource types and benefit metrics, totally flexible

The Hierarchy of Resource Managers Sites Connected by WAN Global RM Experiment Policies, Fair Share Allocations, Cost Metrics Stations And MSS’s Connected By LANs Site RM Batch queues and disks Station – Local RM

Job Control: Station Integration with the Abstract Batch System Sam submit Local RM (Station Master) Job Manager (Project Master) invoke Client jobEnd submit setJobCount/stop Sam condition satisfied Batch System Process Manager (SAM wrapper script) dispatch invoke User Task resubmit Fair Share Job Scheduling Resource Co-allocation

SAM as a Data Grid Cached data, File transfer queues, Site RM weather conditions High Level Services Replication Cost Estimation DH-Batch system integration, Fair Share Allocation, MSS access control Network access control Preferred locations Replica Selection Caching, Forwarding, Pinning Data Replication Comprehensive Resource Management Core Services Mass Storage Systems Metadata Resource Management Replica catalog, System configuration, Cost/Benefit metrics (External to SAM) Batch System internal RM MSS internal RM (External to SAM)

SAM Grid Work (D0-PPDG) Enhance the system by adding Grid services (Grid authentication, replica selection, etc) Adapt the system to generic Grid services Replace proprietary tools and internal protocols with those standard to the Grid Collaborate with Computer Scientists to develop new Grid technologies, use SAM as a testbed for testing/validating them

Initial PPDG Work: Condor/D0 Job Scheduling, Preliminary Architecture Condor MMS DAGMan Job Management: Grid Meta-Scheduler Condor Condor/SAM-Grid adapter CondorG Standard Grid Protocols Schedule Jobs Costs of job placements? SAM/Condor-Grid adapter D0 Meta-computing System Data Management: The D0 Data Grid Sam submit Data and DH Resources SAM Abstract Batch System

Conclusions D0 SAM is not only a production meta-computing system, but a functioning Data Grid prototype, with data replication and resource management being in advanced/mature stage Work continues to fully Grid-enable the system Some of our components/services will hopefully be of interest to the Grid community