David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)
David Colling GridPP Edinburgh 6th November 2001 SAM stands for Sequential Access to Data via Metadata. Where sequential refers to the events stored within files. Lauri Loebel-Carpenter, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders) The current SAM development team include: Recently some work in the UK by Rod Walker
David Colling GridPP Edinburgh 6th November 2001 History of SAM Project started in 1997 Built for the DØ virtual organisation (~500 physicists, 72 institutions, 18 countries) SAMs objectives are: to provide a world wide system of shareable computing and storage resources. So providing a solution to the common problem of extracting physics results from about a Petabyte of data (c. 2003) to provide a large degree of transparency to the user. Who makes requests for datasets, submits jobs and stores files (together with extensive metadata about the processing steps etc.)
David Colling GridPP Edinburgh 6th November 2001 Currently SAMs storage and delivery of data is far more advanced than its job submission. SAM is an operational prototype of many of the concepts being developed for Grid computing.
David Colling GridPP Edinburgh 6th November 2001 Database Server(s) (Central Database) Name Server Global Resource Manager(s) Log server Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local Shared Locally Arrows indicate Control and data flow Overview of SAM
David Colling GridPP Edinburgh 6th November 2001 Name Sever allows all components to find each other by name The Database server has numerous methods which process transactions and retrieve information from the central database The Resource manager control efficient use of resources such as tape stores The Log server gathers information from the entire system for monitoring and debugging All communication is via CORBA
David Colling GridPP Edinburgh 6th November 2001 The SAM station A SAM station is deployed on local processing platforms A station is unshared outside its set of CPU and disk resources. Stations can communicate directly with each other, and data cached at one stations cache can be replicated at other stations upon demand. Local groups of stations can, at a physical site, can share a locally available mass storage system (e.g. FermiLab)
David Colling GridPP Edinburgh 6th November 2001 The SAM station The stations resposibilities include: Storing and retrieving data files from mass storage and other stations. Managing data stored on cache disk. Launching Project managers which oversee the processing of data requests by consumers in well defined projects. All these functions are provided by the servers within a station.(See next slide)
David Colling GridPP Edinburgh 6th November 2001 File Stager(s) Station & Cache Manager File Storage Server Project Managers /Consumers eworkers File Storage Clients MSS or Other Station MSS or Other Station Data flow Control Producers/ Cache Disk Temp Disk The SAM Station
David Colling GridPP Edinburgh 6th November 2001 The SAM Station The Station Manager oversees the removal of filescached on disk, and instructs the File Stager to add new files. All processing projects are started through the Station Server which starts Project Managers. Files are added to the system through the File Storage Server (FSS), which uses the Stagers to initiate transfers to the available MSS or another station.
David Colling GridPP Edinburgh 6th November 2001 A Station Job Manager provides services to execute a user application, script, or series of jobs, potentially as parallel processe either interactively or by use of a local batch system. Currently supported are LSF and FBS, Condor and PBS adapters are under constructed and are being tested. The station Cache Manager and Job Manager are implemented as a single Station Master server. Job submission and synchronization between job execution and data delivery is currently part of SAM. Jobs are put on hold in batch system queues until data files are available to the job. At present jobs submitted at one station may only be run using the batch system(s) available at that Station. The SAM Station
David Colling GridPP Edinburgh 6th November 2001 The User Interface UIs are provided add data, access data, set configurations parameters and monitor the system. These take the forms of Unix command line, Web GUIs and Python API. There is also a C++ interface for accessing data through a standard DØ framework package.
David Colling GridPP Edinburgh 6th November 2001 Defining a dataset
David Colling GridPP Edinburgh 6th November 2001 Examining a predefined dataset
David Colling GridPP Edinburgh 6th November 2001 Querying Cached Files
David Colling GridPP Edinburgh 6th November 2001 The SAM station Real Data files from FNAL MC files from NIKHEF
David Colling GridPP Edinburgh 6th November 2001 The SAM station sam submit --defname=run129194_reco --cpu-per-event=2m --group=dzero --batch-system-flags="--universe=vanilla --output=condor.out --log=condor.log --error=condor.error --initialdir=/home/walker/TestSam/blife/BLifetime_x-run13264x_reco_p arguments='-rcp framework.rcp -input_file SAMInput: -output_file outputfile -out BLifetime_x.out -log BLifetime_x.log -time -mem'" --framework-exe=./BLifetime_x The SAM submit command Starts project and submits job to Condor BS
David Colling GridPP Edinburgh 6th November 2001 MSU Columbia UTA 64 Lyon/IN2P3 100 Prague 32 Imperial College Lancaster 200 NIKHEF 50 Fermilab SuperJanet SURFnet ESnet Abilene = MC production centers The DØ SAM World Also a UCL-CDF-test station
David Colling GridPP Edinburgh 6th November 2001 SAM Works now! #Transfers initiated between 9:30 and 12:30 (Thursday 25 Oct 2001) | from station | to station | #files | tot_size (KB)| | ccin2p3-analysis | central-analysis | 51 | | central-analysis | clued0 | 43 | | central-analysis | enstore | 138 | | central-analysis | imperial-test | 19 | | datalogger-d0olb | enstore | 54 | | datalogger-d0olc | enstore | 34 | | enstore | central-analysis | 20 | | enstore | clued0 | 20 | | enstore | linux-analysis-cluster-1 | 27 | | hoeve | central-analysis | 67 | | lancs | central-analysis | 21 | | prague-test-station | central-analysis | 2 | | uta-hep | central-analysis | 5 |
David Colling GridPP Edinburgh 6th November 2001 Compute systems and Storage systems in US – Fermilab, UTA, Columbia, MSU, France/Lyon-IN2P3, UK/Lancaster and Imperial College, Netherlands/NIKHEF, Czech Republic/Prague Many other sites are expected to provide additional compute and storage resources when the experiment moves from commissioning to physics data taking. Storage systems consist of disk storage elements at all locations and robotically controlled tape libraries at Fermilab, Lyon and Nikhef and Lancaster (almost) All storage elements support the basic functions of storing or retrieving a file. Some support parallel transfer protocols, currently via bbftp The underlying storage management systems for tape storage elements are different at Fermilab, Lyon and Nikhef. Fermilab tape storage management system, Enstore, provides the ability to assign priorities and file placement instructions to file requests and provides reports about placement of data on tape, queue wait time, transfer time and other information that can be used for resource management. The Fabric
David Colling GridPP Edinburgh 6th November 2001 Interim Conclusions SAM is a sophisticated tool for data transfer, and a less sophisticated tool for job submission. SAM works now, and has real users! SAM is an operational prototype of many of the concepts being developed for Grid computing.
David Colling GridPP Edinburgh 6th November 2001 Interim Conclusions However, significant parts of SAM will have to be enhanced (or replaced) before it can truly claim to be a data grid. This work will happen as part of the Particle Physics Data Grid (PPDG) project. Current status will be in black, planned enhancements will be in bold red. The following slides are extracts from Vicky Whites Talk SAM and PPDG CHEP 2001
Fabric Tape Storage Elements Request Formulator and Planner Client Applications Compute Elements Indicates component that will be replaced Disk Storage Elements LANs and WANs Resource and Services Catalog Replica Catalog Meta-data Catalog Authentication and Security GSI SAM-specific user, group, node, station registrationBbftp cookie Connectivity and Resource CORBAUDP File transfer protocols - ftp, bbftp, rcp GridFTP Mass Storage systems protocols e.g. encp, hpss Collective Services Catalog protocols Significant Event LoggerNaming ServiceDatabase ManagerCatalog Manager SAM Resource Management Batch Systems - LSF, FBS, PBS, Condor Data Mover Job Services Storage ManagerJob ManagerCache ManagerRequest Manager Dataset Editor File Storage ServerProject MasterStation Master Web Python codes, Java codes Command line D0 Framework C++ codes StagerOptimiser Code Repostory Name in quotes is SAM-given software component name or addedenhancedusing PPDG and Grid tools
David Colling GridPP Edinburgh 6th November 2001 Enhancing SAM The Job Manager is limited and can only submit to local resources. The specification of user jobs, including their characteristics and input datasets, is a major component of the PPDG work. The intention is to provide Grid job services components that replace the SAM job services components. This will support job submission (including composite and parallel jobs) to suitable SAM Station(s) and eventually any available Grid computing resource.
David Colling GridPP Edinburgh 6th November 2001 Unix user names, physics groups, nodes, domains and stations are registered. Valid combinations of these must be provided to obtain services. Station servers at one station provide service on behalf of their local users and are trusted by other Station servers or Database Servers. Globus core Security Infrastructure services is a planned PPDG enhancement of the system. Service registration and discovery is implemented using a CORBA naming service, with namespace by station name. APIs to services in SAM are all defined using CORBA Interface Definition Language and have multiple language bindings (C++, Python, Java) and, in many cases, a shell interface. Use of GridFTP and other standard protocols to access storage elements is a planned PPDG modification to the system. Integration with grid monitoring tools and approaches is a PPDG area of research. Registration of resources and services using a standardized Grid registration or enquiry protocol is a PPDG enhancement to the system. Enhancing SAM
David Colling GridPP Edinburgh 6th November 2001 Database Managers provide access to the Replica Catalog, Metadata Catalog, SAM Resource and configuration catalog and Transformation catalog. All catalogs currently are tables in a central Oracle database; a matter that is hidden from their clients. Replication of some catalogs in two or more locations worldwide is a planned enhancement to the system. Database managers will need to be enhanced to adapt SAM-specific APIs and catalog protocols onto Grid catalog APIs using PPDG- supported Grid protocols so that information may be published and retrieved in the wider Physics Data Grid that spans several virtual organizations. A central Logging server receives significant events. This will be refined to receive only summary level information, with more detailed monitoring information held at each site. Work in the context of PPDG will examine how to use a Grid Monitoring Architecture and tools. Enhancing SAM
David Colling GridPP Edinburgh 6th November 2001 Resource manager services are provided by an Optimization service. File transfer actions are prioritized and authorized prior to being executed. The current primitive functionality of re-ordering and grouping file requests, primarily to optimize access to tapes, will need to be greatly extended, redesigned and re-implemented to better deal with co-location of data with computing elements and fair-shares and policy- driven use of all computing, storage and network resource. This is a major component of the SAM/PPDG work, to be carried out in collaboration with the Condor team. Enhancing SAM
David Colling GridPP Edinburgh 6th November 2001 Enhancing SAM Other enhancement also needed for scalability e.g. relies on a single Oracle database, which is a single point of failure. Needs replication/cache. Etc etc...
David Colling GridPP Edinburgh 6th November 2001 Conclusions SAM already does a lot and planned enhancements will give it far greater functionality.