Download presentation
Presentation is loading. Please wait.
Published byDonald Carr Modified over 9 years ago
1
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week 2006 April 25, 2006
2
Parag Mhashilkar, Fermilab2 Resource Selection in OSG Overview Why Resource Selection Service? Resource Selection Service in OSG Collaborators Involved Resource Selection Service Architecture Current Status Future Work
3
April 25, 2006Parag Mhashilkar, Fermilab3 Why Resource Selection Service? A job can Have special requirements. Example: disk > 1GB, memory > 256MB Resources can Provide special services. Example: disk > 5GB, memory > 512 MB, Software toolkit-X installed, etc. Without a resource selection service User has to keep track of availability of every resource that can run the job. Resource selection service can Gather the information about the job and resources and make decision where the job should run. Dereference abstract attributes to bind to the job during match- making or execution time. Resource Selection Service 1 2 3 ……...N Resources Job
4
April 25, 2006Parag Mhashilkar, Fermilab4 Resource Selection Service (ReSS) in OSG The Resource Selector is a component of the OSG Job Management Infrastructure. Sponsored by PPDG, the project started in Sep 2005, with an aim to develop and deploy a Resource Selection Service that VOs with requirements on job management similar to DZero can use. Requirements that ReSS should support – community of 100 users, submitting jobs to 10 job schedulers. 10,000 jobs per day, with bursts of 2,000 per hour. 100 clusters job and resource descriptions in classad format with 200 attributes and 5Kb of information. With ReSS Emphasis is on supporting several Virtual Organizations (VO) based on policies. VOs can tag resources which are certified to run their jobs making resource selection more manageable.
5
April 25, 2006Parag Mhashilkar, Fermilab5 Collaborators Involved VOs DZero Atlas LIGO FermiGrid Fermilab OSG TG-MIG group CEMon group from INFN Condor group from UW Madison GLUE group from INFN
6
April 25, 2006Parag Mhashilkar, Fermilab6 Resource Selection Service Architecture Condor Match Maker Info Gatherer classads Condor Scheduler job What Gate? Gate 3 job CEMon CE Gate1 job-managers jobsinfo CLUSTER GIP CEMon CE Gate2 job-managers jobsinfo CLUSTER GIP CEMon CE Gate3 job-managers jobsinfo CLUSTER GIP
7
April 25, 2006Parag Mhashilkar, Fermilab7 Architecture … Generic Information Provider (GIP) describes resources in LDIF format using GLUE Schema. CEMon provides flexible plug-in mechanism to translate classads. Information Gatherer (IG) Subscribes to several CEMons to gather the information about the CEs and advertises it to several condor pools. It acts as an adapter between CEMons and Condor matchmaker. Support for callouts to external match-making functions. These functions can make match-making more extensible.
8
April 25, 2006Parag Mhashilkar, Fermilab8 Current Status First release of the ReSS is scheduled to be included in OSG ITB-0.5.0 Focus on testing functionality, scalability and stress test of Information Gatherer. Validate Classads from different sites so they can be used for common resource selection criteria. Study the scalability and investigate how IG handles O(10) CEMon registrations and O(100) classad processing and transferring to the condor_collector. Stress test study of the IG. Simulate the load of the production environment by increasing 10 times the frequency of classad publication by the O(10) CEMon's. Stress test the match making infrastructure submitting O(1) job/sec for 1 hour. In particular, and push the limits ….. Evaluate the efficiency of the condor_negotiator using call-out to external code for match-making.
9
April 25, 2006Parag Mhashilkar, Fermilab9 Future Work Working on deployment procedures for OSG production in context of VDT. Work with other VOs with requirements similar to mentioned earlier and extend the support of ReSS for other VOs. Improve the scalability of ReSS beyond the RunII experiments. Have end-to-end Samgrid-OSG integration by OSG 0.6.0
10
April 25, 2006Parag Mhashilkar, Fermilab10 Sam-on-the-fly Overview What is SAM? Why sam-on-the-fly? Addressing the Challenges Current Status
11
April 25, 2006Parag Mhashilkar, Fermilab11 What is SAM? Samgrid consists of Job Management (JIM) Data Management (SAM) SAM stands for ‘Sequential Access via Metadata’ (SAM). The project was started in 1997 by DZero SAM is organized around the concepts of a dataset (Catalog of file metadata). Experiments: DZero, CDF, MINOS Samgrid Job Management (JIM) Data Management (SAM)
12
April 25, 2006Parag Mhashilkar, Fermilab12 Why Sam-on-the-fly? Sites have resources that are available for longer duration. For example cluster at UW has 1TB disk for DZero users for next 2 months. SAM-on-the-fly tries to address the issue of making the resources available for the users dynamically. Before DZero users can use this resource, there is a need to Deploy and configure SAM services like Station (collection of resources controlled by SAM system) Stager (service to handle staging of files on disk used by SAM) FSS (service to interface with the FS) File transferring services like gridftp, sam_fcp, etc. Register SAM services with central SAM DB Start and Stop SAM station services. Do the cleanup when the lease period expires. Firewall and security configurations.
13
April 25, 2006Parag Mhashilkar, Fermilab13 Addressing the Challenges Job 1.Deploy and configure SAM 2.Register SAM services with the SAM system 3.Start SAM services for the duration of lease 4.When the lease expires, stop SAM 5.Do the cleanup Resource
14
April 25, 2006Parag Mhashilkar, Fermilab14 Current Status Automated the product deployment steps. Semi-Automated the SAM services registration steps. Automated starting and stopping of SAM services. This project is a work in progress. People: Fermi National Accelerator Laboratory University of Wisconsin Madison: Alain Roy and Hidayat Teonadi.
15
April 25, 2006Parag Mhashilkar, Fermilab15 References Resource Selection Service for OSG http://www.opensciencegrid.org http://osg.ivdgl.org/twiki/bin/view/ResourceSelection/WebHome SAM http://projects.fnal.gov/samgrid Thanks to Miron and Condor Group for all the support! Questions?
16
April 25, 2006Parag Mhashilkar, Fermilab16 SAM as a Distributed System Database Server(s) (Central Database) Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local To Site Shared Locally Arrows indicate Control and data flow Name Server Global Resource Manager(s) services SAM services can be accessed to monitor the status of the system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.