Presentation is loading. Please wait.

Presentation is loading. Please wait.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Similar presentations


Presentation on theme: "Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating."— Presentation transcript:

1 Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating Communities for Grid I/O

2 www.cs.wisc.edu/condor New framework needed › Remote I/O is possible anywhere › Build notion of locality into system? › What are possibilities?  Move job to data  Move data to job  Allow job to access data remotely › Need framework to expose these policies

3 www.cs.wisc.edu/condor Key elements › Storage appliance, interposition agents, schedulers and match-makers › Mechanism not policies › Policies are exposed to an upper layer  We will however demonstrate the strength of this mechanism

4 www.cs.wisc.edu/condor To infinity and beyond › Speedups of 2.5x possible when we are able to use locality intelligently › This will continue to be important  Data sets are getting larger and larger  There will always be bottlenecks

5 www.cs.wisc.edu/condor Outline › Motivation › Components › Expressing locality › Experiments › Conclusion

6 www.cs.wisc.edu/condor I/O communities › Mechanism which allow either  jobs to move to data, or  data to move to jobs, or  data to be accessed remotely › Framework to evaluate these policies

7 www.cs.wisc.edu/condor Grocers, butchers, cops › Members of an I/O community  Storage appliances  Interposition agents  Scheduling systems  Discovery systems  Match-makers  Collection of CPU’s

8 www.cs.wisc.edu/condor Storage appliances › Should run without special privilege  Flexible and easily deployable  Acceptable to nervous sys admins › Should allow multiple access modes  Low latency local accesses  High bandwidth remote puts and gets

9 www.cs.wisc.edu/condor NeST Common protocol layer GFTPChirpHTTPFTP Dispatcher Storage Manager Physical storage layer Multiple concurrencies Transfer Manager Control flow Data flow

10 www.cs.wisc.edu/condor Interposition agents › Thin software layer interposed between application and OS › Allow applications to transparently interact with storage appliances › Unmodified programs can run in grid environment

11 www.cs.wisc.edu/condor PFS: Pluggable File System

12 www.cs.wisc.edu/condor Scheduling systems and discovery › Top level scheduler needs ability to discover diverse resources › CPU discovery  Where can a job run? › Device discovery  Where is my local storage appliance? › Replica discovery  Where can I find my data?

13 www.cs.wisc.edu/condor Match-making › Match-making is the glue which brings discovery systems together › Allows participants to indirectly identify each other  i.e. can locate resources without explicitly naming them

14 www.cs.wisc.edu/condor Condor and ClassAds

15 www.cs.wisc.edu/condor Outline › Motivation › Components › Expressing locality › Experiments › Conclusion

16 www.cs.wisc.edu/condor I/O Communities UW INFN

17 www.cs.wisc.edu/condor Two I/O communities › INFN Condor pool  236 machines, about 30 available at any one time  Wide range of machines and networks spread across Italy  Storage appliance in Bologna 750 MIPS, 378 MB RAM

18 www.cs.wisc.edu/condor Two I/O communities › UW Condor pool  ~900 machines, 100 dedicated for us  Each is 600 MIPS, 512 MB RAM  Networked on 100 Mb/s switch  One was used as a storage appliance

19 www.cs.wisc.edu/condor Who Am I This Time? › We assumed the role of an Italian scientist › Database stored in Bologna › Need to run 300 instances of simulator

20 www.cs.wisc.edu/condor Hmmm…

21 www.cs.wisc.edu/condor Three way matching Machine NeST Job Ad Machine Ad Storage Ad match Refers to NearestStorage. Knows where NearestStorage is.

22 www.cs.wisc.edu/condor Two way ClassAds Type = “job” TargetType = “machine” Cmd = “sim.exe” Owner = “thain” Requirements = (OpSys==“linux”) Job ClassAd Type = “machine” TargetType = “job” OpSys = “linux” Requirements = (Owner==“thain”) Machine ClassAd

23 www.cs.wisc.edu/condor Three way ClassAds Type = “job” TargetType = “machine” Cmd = “sim.exe” Owner = “thain” Requirements = (OpSys==“linux”) && NearestStorage.HasCMSData Job ClassAd Type = “machine” TargetType = “job” OpSys = “linux” Requirements = (Owner==“thain”) NearestStorage = ( Name = “turkey”) && (Type==“Storage”) Machine ClassAd Type = “storage” Name = “turkey.cs.wisc.edu” HasCMSData = true CMSDataPath = /cmsdata” Storage ClassAd

24 www.cs.wisc.edu/condor Outline › Motivation › Components › Expressing locality › Experiments › Conclusion

25 www.cs.wisc.edu/condor BOOM!

26 www.cs.wisc.edu/condor CMS simulator sample run › Purposefully choose a run with high I/O to CPU ratio › Accesses about 20 MB of data from a 300 MB database › Writes about 1 MB of output › ~160 seconds execution time  on a 600 MIPS machine with local disk

27 www.cs.wisc.edu/condor Policy specification › Run only with locality  Requirements = (NearestStorage.HasCMSData) › Run in only one particular community  Requirements = (NearestStorage.Name == “nestore.bologna”) › Prefer home community first  Requirements = (NearestStorage.HasCMSData)  Rank = (NearestStorage.Name == “nestore.bologna” ) ? 10 : 0 › Arbitrarily complex  Requirements = ( NearestStorage.Name == “nestore.bologna”) || ( ClockHour 18 )

28 www.cs.wisc.edu/condor Policies evaluated › INFN local › UW remote › UW stage first › UW local (pre-staged) › INFN local, UW remote › INFN local, UW stage › INFN local, UW local

29 www.cs.wisc.edu/condor Completion Time

30 www.cs.wisc.edu/condor CPU Efficiency

31 www.cs.wisc.edu/condor Conclusions › I/O communities expose locality policies › Users can increase throughput › Owners can maximize resource utilization

32 www.cs.wisc.edu/condor Future work › Automation  Configuration of communities  Dynamically adjust size as load dictates › Automation  Selection of movement policy › Automation

33 www.cs.wisc.edu/condor For more info › Condor  http://www.cs.wisc.edu/condor › ClassAds  http://www.cs.wisc.edu/condor/classad › PFS  http://www.cs.wisc.edu/condor/pfs › NeST  http://www.nestproject.org

34 www.cs.wisc.edu/condor Local only

35 www.cs.wisc.edu/condor Remote only

36 www.cs.wisc.edu/condor Both local and remote

37 www.cs.wisc.edu/condor I/O communities are an old idea, right? › File servers and administrative domains › No, not really. We need  more flexible boundaries  simple mechanism by which users can express I/O community relationships  hooks into system that allow users to use locality

38 www.cs.wisc.edu/condor Grid applications have demanding I/O needs › Petabytes of data in tape repositories › Scheduling systems have demonstrated that there are idle CPUs › Some systems  move jobs to data  move data to jobs  allow job remote access to data › No one approach is always “best”

39 www.cs.wisc.edu/condor Easy come, easy go › In a computation grid, resources are very dynamic › Programs need rich methods for finding and claiming resources  CPU discovery  Device discovery  Replica discovery

40 www.cs.wisc.edu/condor Bringing it all together CPU Discovery System Replica Discovery System Device Discovery System JobAgent Execution site Storage appliance Distributed Repository Short-haul I/O Long-haul I/O

41 www.cs.wisc.edu/condor Conclusions › Locality is good › Balance point between staging data and accessing it remotely is not static  depends on specific attributes of the job data size, expected degree of re-reference, etc  depends on performance metric CPU efficiency or job completion time

42 www.cs.wisc.edu/condor Implementation › NeST  storage appliance › Pluggable File System (PFS)  interposition agent built with Bypass › Condor and ClassAds  scheduling system  discovery system  match-maker

43 www.cs.wisc.edu/condor Jim Gast and Bart say... › Too many bullet slides › Contributions  scientist doesn’t want to name bec resources are dynamic and name is irrelevant  hooks into system to allow users to express and take advantage of locality

44 www.cs.wisc.edu/condor Jim Gast and Bart say...  everyone knows locality is good - but there is no way to express this and run jobs on the grid  I/O communities are mechanism by which user can use locality and specify policies to optimize job performance

45 www.cs.wisc.edu/condor 4 earth-shattering revelations 1) The grid is big. 2) Scientific data-sets are large. 3) Idle resources are available. 4) Locality is good.

46 www.cs.wisc.edu/condor Mechanisms not policies › I/O communities are a mechanism not a policy › A higher layer is expected to choose application appropriate policies › We will however demonstrate the strength of the mechanism by defining appropriate policies for one particular application

47 www.cs.wisc.edu/condor Experimental results › Implementation › Environment › Application › Measurements › Evaluation


Download ppt "Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating."

Similar presentations


Ads by Google