Download presentation
Presentation is loading. Please wait.
1
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002
2
Outline The SAM-Grid Project The SAM & JIM Architecture –SAM: the Data Handling System –Jim: the Job Management Infrastructure –JIM: the Information and Monitoring System The Current Grid Infrastructure Milestones of the Deliverables Conclusions
3
The scope of the project Enable fully distributed computing for DZero and CDF, by enhancing the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing, in a secure and accountable environment. The SAM ‘grid-ification’ is funded by PPDG and GridPP: we are working with both Computer scientists, like the Condor Team, and physicists, like UTA and Imperial College. We are collaborating with other groups working on Grid technologies as well (EDG, DataTAG among them). Warm cooperation between Fermilab CD Departments and the Project (e.g. ISD for the SAM/DCache integration) We promote interoperability and code reuse (via modularization and standardization). CDF and DZero are running now! Short-term deliverables are are due at the end of the Summer; long-term in 2 yrs.
4
Why a Job and Data Handling infrastructure? Increases the productivity of physics results A high level of transparency to the user: maximize time spent by the physicist doing physics Enable worldwide analysis of the data Efficient utilization of the resources: disks, mass storage systems, processing nodes, network… Automatic bookkeeping: reproducibility + accountability Extensibility to new standardized services and protocols via modularization and “plug-in” mechanisms
5
Outline The SAM-Grid Project The SAM & JIM Architecture –SAM: the Data Handling System –Jim: the Job Management Infrastructure –JIM: the Information and Monitoring System The Current Grid Infrastructure Milestones of the Deliverables Conclusions
6
High Level Components Information and Monitoring Data Handling Job Management
7
The Data Handling: SAM Data Handling DH Resource Management Data Delivery and Caching SAM Principal Component Service Implementation Or Library Information Information and Monitoring Job Management
8
History SAM is Sequential data Access via Meta-data Joint project between D0 and Computing Division started in 1997 to meet the Run II data handling needs Joint project between D0 and Computing Division started in 1997 to meet the Run II data handling needs SAM is integrated into DZero at all levels. SAM is in commissioning phase for CDF http://d0db.fnal.gov/sam http://runIIcomputing.fnal.gov
9
SAM as a Distributed System Database Server(s) (Central Database) Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local To Site Shared Locally Arrows indicate Control and data flow Name Server Global Resource Manager(s) Log server services A Station is a collection of resources controlled by the SAM system. SAM services can be accessed to monitor the status of the system The central Database Server has proven to be robust and reliable.
10
Components of a SAM Station SAM is a distributed data movement and management service: data replication is achieved by the use of disk caches during file routing. SAM is a fully functional meta-data catalog. Station & Cache Manager File Storage Server File Stager(s) Project Managers /Consumers eworkers File Storage Clients MSS or Other Station MSS or Other Station Data flow Control Producers/ Cache Disk Temp Disk … …
11
Accessibility of the Fabric via SAM Services MSS1 Local Station 1 Cache1 Local Station 1 Cache2 Local Station 2 Cache1 Remote Station Cache1 A station can access a remote resource via the services offered by other connected stations Service connectivity does not in general correspond to network connectivity Requests are routed from the originator to the destination File caching during routing leads to file replication More in Igor Terekhov’s Talk: “Meta-Computing at DØ” MSS2 Remote Station Cache2
12
Current Developments of SAM Site Autonomy: the goal is enabling site installations of SAM and JIM to work even if disconnected from the network. The distribution of the Replica and Meta-data Catalogs is a prerequisite for this to happen. Opportunistic deployment: in order to enable SAM and JIM to operate in full efficiency in a dynamic environment like the Grid, automatic deployment of stations at resources that are momentarily available is an interesting path to investigate.
13
The Job Management Data Handling DH Resource Management Data Delivery and Caching SAM Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates Principal Component Service Implementation Or Library Information Information and Monitoring
14
The Job Description Language User interface: the Job Description Language must be expressive enough to fully characterize the structure of the job (Monte Carlo and Analysis) We are collaborating with the University of Texas Arlington to define the structure of a DZero (CDF) job. Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
15
The Request Broker The Brokering Service is implemented using the Condor Match Making Service The idea is to use a stable technology in a new way Because of the collaboration with the Condor Team under PPDG, 2 features have been added to make this possible : –Runtime selection of the remote execution site –Execution of external code when negotiating the matches Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
16
The Job Submission Service The job submission service relies on standard Condor technologies It implements a high level of robustness to service failures and loss of connectivity Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
17
The Job Submission Mechanism (I) Physical job dispatch is achieved via the GRAM protocol from the Globus Toolkit When applicable, executables, configuration files, stdio and stderr are transported via GASS servers Gatekeepers deployed at each site serve client requests for job submission Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
18
The Job Submission Mechanism (II) A Gatekeeper authenticates and authorizes the client via the Globus Security Infrastructure After AA, the Gatekeeper spawns a Job Manager that submits the job to the local batch system, reports the status to the submission client (Condor-G), cleans up after job termination. Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
19
The Fabric (I) Among the Batch systems currently supported by the Gatekeeper are LSF, PBS, Condor, FBS In our architecture Grid Sensors are deployed at the compute elements as well as the local submission nodes. The Sensors report static and small-size dynamic states to the Information and Monitoring System. Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
20
The Fabric (II) What attributes best describe resources is still a research topic. The choice of such schema as implication on the semantics of the JDL. We are collaborating with DataTAG and EDG to find a common Glue Schema in order to enable interoperability of EU and US Grids. Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates
21
Information Flow User Interfac e Condor-G Information And Monitoring Gatekeeper Batch Syestem Grid Sensors Compute Resource GRAM Condor Negotiator Condor Collector Condor Grid Manager External Code Execution Site Parser JDL ClassAd Cin Cout User Interfac e Parser Condor Schedd Condor Schedd Condor Schedd Condor Collector Condor Collector Grid Sensors Condor Negotiator Condor Negotiator External Code Condor Grid Manager Condor Grid Manager Gatekeeper Batch Syestem Compute Resource
22
Monitoring and Information: the glue Data Handling DH Resource Management Data Delivery and Caching SAM Job Management Request Broker Compute Element Resource Site Gatekeeper Job Scheduler JH Client Batch System Condor-G Condor MMS GRAM Grid sensors (All) Job Status Updates Principal Component Service Implementation Or Library Information Monitoring and Information Logging and Bookkeeping Info Processor And Converter Replica Catalog Resource Info AAA GSI MDS-2 Condor Class Ads Grid RC
23
Status Monitor –Meta Directory Service from the Globus Toolkit (LDAP protocol) –Condor Components (ClassAds) Monitoring and Information Logging and Bookkeeping Info Processor And Converter Replica Catalog Resource Info AAA GSI MDS-2 Condor Class Ads Grid RC Data Handling Job Management Resource and Information Service implementations: MDS automatically discard old information and pull the new information from information providers. Well suited for the run-time monitoring of the system.
24
Logging and Bookkeeping implemented via a plug-able back-end module. SAM servers already use the logger, Monitoring and Information Logging and Bookkeeping Info Processor And Converter Replica Catalog Resource Info AAA GSI MDS-2 Condor Class Ads Grid RC Data Handling Job Management SAM provides a UDP- based message logger. Persistency is which results in a valuable debugging tool. We are going to extend the use of this service to JIM. Messages will be store in XML format.
25
The Replica Catalog We plan to migrate to the Grid Replica Catalog, in order to allow distribution of the service and a set of standardized interfaces to external services Monitoring and Information Logging and Bookkeeping Info Processor And Converter Replica Catalog Resource Info AAA GSI MDS-2 Condor Class Ads Grid RC Data Handling Job Management The Replica Catalog is currently implemented with SAM
26
Information Conversion and Accessibility when needed: LDAP, ClassAd, XML. We are evaluating web portal frameworks to enable access to the system from the internet Monitoring and Information Logging and Bookkeeping Info Processor And Converter Replica Catalog Resource Info AAA GSI MDS-2 Condor Class Ads Grid RC Data Handling Job Management A translation service is responsible to convert the 3 protocols used
27
Site AAA Information System are built on top of standard grid tools and adopt the GSI security mechanisms. Monitoring and Information Logging and Bookkeeping Info Processor And Converter Replica Catalog Resource Info AAA GSI MDS-2 Condor Class Ads Grid RC Data Handling Job Management The Job Management Infrastructure and the Monitoring and The full integration of the Data Handling System with GSI is work in progress… Open issue: the management of the AA map files
28
Outline The SAM-Grid Project The SAM & JIM Architecture –SAM: the Data Handling System –Jim: the Job Management Infrastructure –JIM: the Information and Monitoring System The Current Grid Infrastructure Milestones of the Deliverables Conclusions
29
The Current Grid Infrastructure Node_1 GRA M Condor-G Node_3 GRA M Fork Node_2 GRA M PB S Node_4 GRA M Condor FNAL IC UTA Node_1 GRA M Condor Condor- G Node_1 GRA M LSF Condor- G pcpc BSBS clien t Info
30
Outline The SAM-Grid Project The SAM & JIM Architecture –SAM: the Data Handling System –Jim: the Job Management Infrastructure –JIM: the Information and Monitoring System The Current Grid Infrastructure Milestones of the Deliverables Conclusions
31
The Organization: a Collaborative Effort We hold weekly meetings to coordinate efforts on the DZero/CDF SAM Grid Project. Participants are from UK institutions, NIKHEF, INFN and US institutions. We discuss deliverables, design, implementation. The real pressure comes from the experiments that are taking data now!
32
The Short Term Project Goals Deployment of JIM to enable execution of unstructured Monte Carlo jobs with basic brokering (end of Summer) Status Monitoring of unstructured jobs (end of Summer) Basic System Monitoring (end of Summer) Execution of unstructured SAM analysis jobs with basic brokering (end of the year)
33
The 2yr-Term Project Goals Reliable Execution of structured, locally distributed Monte Carlo and SAM analysis jobs with basic brokering. Scheduling criteria for data-intensive jobs, full Job Handling – Data Handling interaction. Fully Distributed Monitoring and Information Services for Structured Jobs and Data Handling.
34
The Milestones Dependencies Job Def Doc Execute unstructured MC and SAM analysis jobs with basic brokering Tech. Rev. doc. Execute unstructured SAM analysis jobs UC doc Arch. Doc Execute User-routed MC Jobs Prototype Grid with RB, JSS, GMA-based MIS Study JDLsUse Cases Condor GMA, MDS GSI SAM GSI In SAM Condor In SAM Basic SAM Res Info Service Toy Grid with JSS, basic Monitoring MDS TestBed Status Mon-ing of unstructured jobs Basic System Mon-ing CondorG TestBed SAM Grid- ready Reliable Execution of structured, locally distributed MC and SAM analysis jobs with basic brokering Scheduling criteria for data-intensive jobs, JH-DH interaction design Monitoring of structured jobs DH Mon-ing JH, MIS fully distributed JDL 6 Mo 9-19 Mo Now
35
Conclusions SAM is the Data Handling System of the DZero experiment and in phase of commissioning for CDF. The SAM-Grid project has the goal of integrating SAM with standard grid technologies to enable fully distributed computing for DZero and CDF. The Brokering service of the Grid Architecture of the project is based on the Condor Match Making Service. We are funded by PPDG and GridPP and we collaborate with Grid groups in US and EU to best tailor and develop the technologies for the experiments. We are deploying a test bed in US and EU to develop and test SAM and JIM. The experiments are running now! Closest delivery milestones at the end of the Summer and at the end of the year. http://www-d0.fnal.gov/computing/grid/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.