September 4,2001Lee Lueking, FNAL1 SAM Resource Management Lee Lueking CHEP 2001 September 3-8, 2001 Beijing China.

Slides:



Advertisements
Similar presentations
David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)
Advertisements

Interactive lesson about operating system
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Improving Robustness in Distributed Systems Jeremy Russell Software Engineering Honours Project.
Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
11 MAINTAINING THE OPERATING SYSTEM Chapter 5. Chapter 5: MAINTAINING THE OPERATING SYSTEM2 CHAPTER OVERVIEW Understand the difference between service.
11 MAINTAINING THE OPERATING SYSTEM Chapter 5. Chapter 5: MAINTAINING THE OPERATING SYSTEM2 CHAPTER OVERVIEW  Understand the difference between service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Process Management A process is a program in execution. It is a unit of work within the system. Program is a passive entity, process is an active entity.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Database Design – Lecture 16
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Introduction to Hadoop and HDFS
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
11 March 2004Getting Ready for the Grid SAM: Tevatron Experiments Using the Grid CDF and D0 Need the Grid –Requirements, the CAF and SAM –Grid from the.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
ORBMeeting July 11, Outline SAM Overview and Station description Resource Management Station Cache Station Prioritized Fair Share Job Control File.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
SAM Installation Lauri Loebel Carpenter and the SAM Team February
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
A New CDF Model for Data Movement Based on SRM Manoj K. Jha INFN- Bologna 27 th Feb., 2009 University of Birmingham, U.K.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
SAM - Sequential Data Access via Metadata Schema Metadata Functionality Workshop Glasgow University April 26-28,2004.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
6 march Building the INFN Grid Proposal outline a.ghiselli,l.luminari,m.sgaravatto,c.vistoli INFN Grid meeting, milano.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
A New CDF Model for Data Movement Based on SRM Manoj K. Jha INFN- Bologna Presently at Fermilab 21 st April, 2009 Post Doctoral Interview University of.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Introduction to the SAM System at DØ Physics 5391 July 1, 2002 Mark Sosebee U.T. Arlington.
SAM: Past, Present, and Future Lee Lueking All Dzero Meeting November 2, 2001.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Status report NIKHEF Willem van Leeuwen February 11, 2002 DØRACE.
Belle II Physics Analysis Center at TIFR
dCache “Intro” a layperson perspective Frank Würthwein UCSD
Open Source distributed document DB for an enterprise
Distributed Data Access and Resource Management in the D0 SAM System
Ákos Frohner EGEE'08 September 2008
Storage Virtualization
Chapter 1: Introduction
Status report NIKHEF Willem van Leeuwen February 11, 2002 DØRACE.
Lee Lueking D0RACE January 17, 2002
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

September 4,2001Lee Lueking, FNAL1 SAM Resource Management Lee Lueking CHEP 2001 September 3-8, 2001 Beijing China

September 4,2001Lee Lueking, FNAL2 Intro to SAM is Sequential Access to data via Meta-data Project started in 1997 to handle D0’s needs for Run II data system. Current SAM team includes: –Lauri Loebel-Carpenter, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders)

September 4,2001Lee Lueking, FNAL3 Overview Goals of Resource Management Users, Groups and Access modes Resources and Resource Management Strategies Implementation –System Configuration –Rules and Policies –Disk Cache Management –Fair Share scheduling –Resource Co-allocation Plans and Conclusion

September 4,2001Lee Lueking, FNAL4 Goals of Resource Management Implement experiment policies on prioritization and fair sharing in resource usage, by user categories (access modes, research group etc) Maximize throughput in terms of real work done (i.e. user jobs and not system internal jobs such as data transfers)

September 4,2001Lee Lueking, FNAL5 Groups Users whose datasets, processing styles and goals are largely shared. Defined by: –physics topics, like Higgs, Top, W/Z, B, QCD, and New Phenomena –detector elements like calorimeter, silicon tracking, muon, and so on –particle identification like jets, electron, muon, and tau. Users must be registered and it is possible for each individual to be included many groups.

September 4,2001Lee Lueking, FNAL6 Access Modes Storage –Data acquisition storage –Monte Carlo data storage –General User data storage Delivery –Frequently accessed data –Cooperative access and processing –Data file delivery on demand –Random access event selection

September 4,2001Lee Lueking, FNAL7 Resources Tape mounts Tape volume access Tape drive usage Network throughput Disk cache Processing CPU Memory cache

September 4,2001Lee Lueking, FNAL8 Management Strategies Divide the problem into 3 tier hierarchy: Local (station), Site, Global Hardware Configuration: Mass Storage System (ATL) access, Network, Disk assignments. Establish Rules: Group allocations, Access mode priorities, Data routing paths, Type of processing, etc. Algorithms to combine rules

September 4,2001Lee Lueking, FNAL9 The Hierarchy of Resource Managers Global RM Sites Connected by WAN Stations And MSS’s Connected By LANs Batch queues and disks Site RM Station – Local RM Experiment Policies, Fair Share Allocations, Cost Metrics

September 4,2001Lee Lueking, FNAL10 Implementation

September 4,2001Lee Lueking, FNAL11 Overview of Sam Database Server(s) (to Central DB) Name Server Site or Global Resource Manager(s) Log server Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local Shared Locally Arrows indicate Control and data flow

September 4,2001Lee Lueking, FNAL12 The SAM Station Responsibilities –Cache Management –Project (Job) Management –Movement of data files to/from MSS or other Stations Consists of a set of inter-communicating servers: –Station Master Server, –File Storage Server, –File Stager(s), –Project (Job) Manager(s)

September 4,2001Lee Lueking, FNAL13 Components of a SAM Station Station & Cache Manager File Storage Server File Stager(s) Project Managers /Consumers eworkers File Storage Clients MSS or Other Station MSS or Other Station Data flow Control Producers/ Cache Disk Temp Disk

September 4,2001Lee Lueking, FNAL14 Station Configuration Disks assigned to the cache Batch system used Batch queues available Batch queue depth Processing capacity CPU and physical memory Mass Storage Systems available Inter -station transfer mechanism: BBFTP, rcp Disk accessibility for distributed cluster Network connection, bandwidth, subnet for each machine Security issues, access to kerberos tickets, etc. Waits, timeouts and retries on failure conditions

September 4,2001Lee Lueking, FNAL15 Rules and Policies Disk cache allocated to each group Disk cache refreshment algorithm for each group:LRU,FIFO, etc. Minimum amount of data to deliver at a time from each tape for a project Order files brought into the cache. Through which station files will be routed when retrieving from a particular Mass Storage System Which data access activities have the highest priority Which data storing activities have the highest priority To which MSS’s are files stored, and to which tapes Sharing of the resources of a station among groups Which users belong to which groups How many projects per group are allowed What processing activities are allowed on each station? * To which stations should data access and processing activities be sent? * How should the resources of a local cluster of stations be shared among groups?* * Currently done by administrators

September 4,2001Lee Lueking, FNAL16 Station Management Caches –Allocations established for groups on each station. –Resources are allocated by group Total Size Lock (pin) Size Refresh algorithm: LRU,FIFO,… –No rigid assignment to particular physical disks. Projects –Number of concurrent projects for each group, on each station. Administration is by authorized users only –Station admins –Group admins

September 4,2001Lee Lueking, FNAL17 Station Administration: Dump(1) % sam dump station –groups *** BEGIN DUMP STATION central-analysis, id=21 running at d0mino 5 days 22 hours 24 minutes 20 seconds, admins: lueking Known batch systems: lsf Default batch system: lsf No Source location is preferred There are 1 authorized transfer groups Full delivery unit is enforced; external deliveries are unconstrained

September 4,2001Lee Lueking, FNAL18 Station Administration: Dump (2) AUTHORIZED GROUPS: group algo: admins: cope lueking melanson terekhov veseli white, swap policy: LRU, fair share: 0, quotas (cur/max): projects = 5/50, disk: KB/ KB, locks:0B/ KB group cal: admins: lueking terekhov veseli white, swap policy: LRU, fair share: 0, quotas (cur/max): projects = 1/10, disk: KB/78125MB, locks:0B/78125MB group demo: admins: lueking terekhov veseli white, swap policy: LRU, fair share: , quotas (cur/max): projects = 2/50, disk: KB/ KB, locks:0B/0KB group dzero: admins: lueking melanson terekhov veseli white, swap policy: LRU, fair share: , quotas (cur/max): projects = 10/100, disk: KB/ KB, locks:0B/ KB group emid: admins: lueking terekhov veseli white, swap policy: LRU, fair share: 0, quotas (cur/max): projects = 0/10, disk: KB/ KB, locks:0B/ KB group test: admins: lueking terekhov veseli white, swap policy: LRU, fair share: , quotas (cur/max): projects = 1/20, disk: KB/ KB, locks:237179KB/ KB group thumbnail: admins: lueking melanson schellma, swap policy: LRU, fair share: , quotas (cur/max): projects = 0/5, disk: KB/ KB, locks:0B/0KB *** END OF STATION DUMP ***

September 4,2001Lee Lueking, FNAL19 Adding Data to the System Metadata descriptions for: –Detector data –Monte Carlo data –Processing details Mapping to storage locations (we call auto- destinations) Station forwarding specification

September 4,2001Lee Lueking, FNAL20 Replica Site WAN Data flow Station Mass Storage System User (producer) Forwarding + Caching = Global Replication NIKHEF (Amsterdam) 155 Mbps Sara Fermilab D0robot

September 4,2001Lee Lueking, FNAL21 Replica Site WAN Data flow Station Mass Storage System User (producer) Routing + Caching = Global Replication

September 4,2001Lee Lueking, FNAL22 Resource Management Approaches Fair Sharing (policies) –Allocation of resources and scheduling of jobs –The goal is to ensure that, in a busy environment, each abstract user gets a fixed share of “resources” or gets a fixed share of “work” done Co-allocation and reservation (optimization)

September 4,2001Lee Lueking, FNAL23 Fair Share and Computational Economy Jobs, when executed, incur costs (through resource utilization) and realize benefits (through getting work done) Maintain a tuple (vector) of cumulative costs/benefits for each abstract user and compare them to his allocated fair share to set priority higher/lower Incorporate all known resource types and benefit metrics, totally flexible

September 4,2001Lee Lueking, FNAL24 Job Control: Station Integration with the Abstract Batch System Client Local RM (Station Master) Batch System Process Manager (SAM wrapper script) User Task Job Manager (Project Master) Sam submit submit dispatchinvoke Sam condition satisfied resubmit setJobCount/stop invoke jobEnd 1.Fair Share Job Scheduling 2.Resource Co-allocation

September 4,2001Lee Lueking, FNAL25 Future Plans Tape mounts were a critical resource in the past, but the inter-station movement of data is perceived to be a future constraint as more stations are deployed with large disk caches. In addition to moving the data to computing resources, the system will evolve to move the processing to the data. Job control language that will specify each task at a level that will allow the system to decide when and where it can optimally be processed. Incorporate standard grid components as availability and need dictates: GridFTP, GSI, Condor, DAGMan, etc..

September 4,2001Lee Lueking, FNAL26 Conclusion The SAM system used for D0 data management and access represents a large step toward a global data grid. Resources are managed at station, site and global levels. The system is governed by station configuration and rules/policies. Fair share resource allocation and scheduling controls amount of work done by each group, access mode, etc. co-allocation coordinates data and processing to most effectively utilize the overall system.