Download presentation
Presentation is loading. Please wait.
Published byAdele Park Modified over 9 years ago
1
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating Communities for Grid I/O
2
www.cs.wisc.edu/condor 4 earth-shattering revelations 1) The grid is big. 2) Scientific data-sets are large. 3) Idle resources are available. 4) Locality is good.
3
www.cs.wisc.edu/condor How to optimize job placement on the grid? › Move data to the jobs. › Move jobs to the data. › Allow jobs to access data remotely. › Need framework for evaluation. I/O communities
4
www.cs.wisc.edu/condor I/O Communities UW INFN
5
www.cs.wisc.edu/condor I/O communities are an old idea › File servers and administrative domains › But, we want more flexible boundaries simple mechanism by which users can express I/O community relationships
6
www.cs.wisc.edu/condor I/O communities › Mechanism which allow either jobs to move to data, or data to move to jobs, or data to be accessed remotely › Framework to evaluate these policies
7
www.cs.wisc.edu/condor Grocers, butchers, cops › Members of an I/O community Storage appliances Interposition agents Scheduling systems Discovery systems Match-makers Collection of CPU’s
8
www.cs.wisc.edu/condor Storage appliances › Should run without special privilege Flexible and easily deployable Acceptable to nervous sys admins › Should allow multiple access modes Low latency local accesses High bandwidth remote puts and gets
9
www.cs.wisc.edu/condor Interposition agents › Thin software layer interposed between application and OS › Allow applications to transparently interact with storage appliances › Unmodified programs can run in grid environment
10
www.cs.wisc.edu/condor Scheduling systems and discovery › Top level scheduler needs ability to discover diverse resources › CPU discovery Where can a job run? › Device discovery Where is my local storage appliance? › Replica discovery Where can I find my data?
11
www.cs.wisc.edu/condor Match-making › Match-making is the glue which brings discovery systems together › Allows participants to indirectly identify each other i.e. can locate resources without explicitly naming them
12
www.cs.wisc.edu/condor Mechanisms not policies › I/O communities are a mechanism not a policy › A higher layer is expected to choose application appropriate policies › We will however demonstrate the strength of the mechanism by defining appropriate policies for one particular application
13
www.cs.wisc.edu/condor Experimental results › Implementation › Environment › Application › Measurements › Evaluation
14
www.cs.wisc.edu/condor Implementation › NeST storage appliance › Pluggable File System (PFS) interposition agent built with Bypass › Condor and ClassAds scheduling system discovery system match-maker
15
www.cs.wisc.edu/condor Two I/O communities › INFN Condor pool 236 machines, about 30 available at any one time Wide range of machines and networks spread across Italy Storage appliance in Bologna 750 MIPS, 378 MB RAM
16
www.cs.wisc.edu/condor Two I/O communities › UW Condor pool 911 machines, 100 dedicated for us Each is 600 MIPS, 512 MB RAM Networked on 100 Mb/s switch One was used as a storage appliance
17
www.cs.wisc.edu/condor CMS simulator sample run › Purposefully choose a run with high I/O to CPU ratio › Accesses about 20 MB of data from a 300 MB database › Writes about 1 MB of output › ~160 seconds execution time on a 600 MIPS machine with local disk
18
www.cs.wisc.edu/condor Assume the position › We assumed the role of an Italian scientist › Database stored in Bologna › Need to run 300 instances of simulator › How to take advantage of UW pool? Three way matching
19
www.cs.wisc.edu/condor Three way matching Machine NeST Job Ad Machine Ad Storage Ad match Refers to NearestStorage. Knows where NearestStorage is.
20
www.cs.wisc.edu/condor Two way ClassAds Type = “job” TargetType = “machine” Cmd = “sim.exe” Owner = “thain” Requirements = (OpSys==“linux”) Job ClassAd Type = “machine” TargetType = “job” OpSys = “linux” Requirements = (Owner==“thain”) Machine ClassAd
21
www.cs.wisc.edu/condor Three way ClassAds Type = “job” TargetType = “machine” Cmd = “sim.exe” Owner = “thain” Requirements = (OpSys==“linux”) && NearestStorage.HasCMSData Job ClassAd Type = “machine” TargetType = “job” OpSys = “linux” Requirements = (Owner==“thain”) NearestStorage = ( Name = “turkey”) && (Type==“Storage”) Machine ClassAd Type = “storage” Name = “turkey.cs.wisc.edu” HasCMSData = true CMSDataPath = /cmsdata” Storage ClassAd
22
www.cs.wisc.edu/condor Policy specification › Run anywhere where data is available Requirements = (NearestStorage.HasCMSData) › Run local only Requirements = (NearestStorage.Name == “nestore.bologna”) › Run local first Requirements = (NearestStorage.HasCMSData) Rank = (NearestStorage.Name == “nestore.bologna” ) ? 10 : 0 › Arbitrarily complex Requirements = ( NearestStorage.Name == “nestore.bologna”) || ( ClockHour 18 )
23
www.cs.wisc.edu/condor Policies evaluated › INFN local › UW remote › UW stage first › UW local (pre-staged) › INFN local, UW remote › INFN local, UW stage › INFN local, UW local
24
www.cs.wisc.edu/condor Completion Time
25
www.cs.wisc.edu/condor CPU Efficiency
26
www.cs.wisc.edu/condor Conclusions › Locality is good › I/O communities are a natural structure to expose this locality › Users can use I/O communities to easily express different job placement policies
27
www.cs.wisc.edu/condor Future work › Automation Configuration of communities Dynamically adjust size as load dictates › Automation Selection of movement policy › Automation
28
www.cs.wisc.edu/condor For more info › Condor http://www.cs.wisc.edu/condor › ClassAds http://www.cs.wisc.edu/condor/classad › PFS http://www.cs.wisc.edu/condor/pfs › NeST http://www.nestproject.org
29
www.cs.wisc.edu/condor Local only
30
www.cs.wisc.edu/condor Remote only
31
www.cs.wisc.edu/condor Both local and remote
32
www.cs.wisc.edu/condor Grid applications have demanding I/O needs › Petabytes of data in tape repositories › Scheduling systems have demonstrated that there are idle CPUs › Some systems move jobs to data move data to jobs allow job remote access to data › No one approach is always “best”
33
www.cs.wisc.edu/condor Easy come, easy go › In a computation grid, resources are very dynamic › Programs need rich methods for finding and claiming resources CPU discovery Device discovery Replica discovery
34
www.cs.wisc.edu/condor Bringing it all together CPU Discovery System Replica Discovery System Device Discovery System JobAgent Execution site Storage appliance Distributed Repository Short-haul I/O Long-haul I/O
35
www.cs.wisc.edu/condor Conclusions › Locality is good › Balance point between staging data and accessing it remotely is not static depends on specific attributes of the job data size, expected degree of re-reference, etc depends on performance metric CPU efficiency or job completion time
36
www.cs.wisc.edu/condor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.