Presentation is loading. Please wait.

Presentation is loading. Please wait.

Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008.

Similar presentations


Presentation on theme: "Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008."— Presentation transcript:

1 Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

2 Background Volunteer computing  distributed scientific computing using volunteered resources (desktops, laptops, game consoles, cell phones, etc.)‏ BOINC  middleware for volunteer (and desktop grid) computing

3 Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network  performance  availability  proxies system availability reliability  crashes, invalid results, cheating

4 Diversity of applications Resource requirements  CPU, coprocessors, RAM, storage, network Completion time constraints Numerical properties  same result on all CPUs  a little different  unboundedly different

5 IBM World Community Grid “Umbrella” project sponsored by IBM  Rice genome study: Univ. of Washington  Protein X-ray crystallography: Ontario Cancer Inst.  African climate study: Univ. of Capetown  Dengue fever drug discovery: Univ. of Texas  Human protein folding: NYU, Univ. of Washington  HIV drug discovery: Scripps Institute Started Nov. 2004 390,000 volunteers total 167,000 years of CPU time Currently ~170 TeraFLOPS

6 CPU type

7 # cores

8 OS type

9 RAM

10 Free disk space

11 Availability

12 Job error rate

13 Average turnaround time

14 Current WCG applications

15 Job dispatching 1M jobs schedulerclient Goals  maximize system throughput  minimize time to batch completion  minimize time to grant credit  scale to >100 requests/sec

16 BOINC scheduler architecture Job queue (DB)‏ Scheduler client Feeder Job cache (shared memory)‏ Issues:  what if cache fills up with unsendable jobs?  what is client needs a job not in cache?

17 Homogeneous replication Different platforms do FP math differently  makes result validation difficult Divide platforms into equivalence classes, send instances of a job to a single class “Census” program computes distribution Scheduler: send committed jobs if possible Win/IntelWin/AMDetc. uncommitted

18 Retry acceleration Retries needed when:  job times out  error (crash) returned  results fail to validate Send retries to hosts that are:  fast (low turnaround)‏  reliable Shorten latency bound of retries

19 Volunteer app selection Volunteers can  select apps  opt to accept jobs from non-selected apps

20 Fast feasibility checks (no DB)‏ Client sends:  hardware spec  availability info  list of jobs queued, in progress Resource checks Completion time check  EDF simulation  deadlines missed?

21 Slow feasibility checks (DB)‏ Is job still needed? Has another replica been sent to this volunteer?

22 job Application Platform mechanism Jobs are associated with apps, not versions Win/x86Win/x64 Linux/x8 6 App versions Request message: platform 0: Win64 platform 1: Win32 Application Win/x86Win/x64 Linux/x86 App versions job

23 Host punishment The problem: hosts that error out all jobs Maintain M(h): max jobs per day for host h On each error, decrement M(h)‏ On valid job, double M(h)‏

24 Anonymous platform mechanism Rather than downloading apps from server, client has preexisting local apps. Scheduler: if client has its own apps, only send it jobs for those apps. Usage scenarios:  Computers with unsupported platforms  People who optimize apps  Security-conscious people who want to inspect the source code

25 Old scheduling policy  Job cache scan start from random point do fast feasibility checks lock job, do slow feasibility checks  Multiple scans send jobs committed to an HR class if fast host, send retries send work for selected apps is allowed, send work for non-selected apps  Problems rigid policy app == 1 CPU

26 Coprocessor and multi-thread apps How to select the best version for a given host? How to estimate performance on the host? Win/x86 single- threaded multi- threaded CUDA

27 Multithread/coprocessor (cont.)‏ How to decide which app version to use?  app versions have “plan class” string  scheduler has project-supplied function bool app_plan(SCHEDULER_REQUEST &sreq, char* plan_class, HOST_USAGE&);  returns: whether host can run app coprocessor usage CPU usage (possibly fractional)‏ expected FLOPS cmdline to pass to app  embodies knowledge about sublinear speedup, etc. Scheduler: call app_plan() for each version, use the one with highest expected FLOPS

28 Multithread/coprocessor (cont.)‏ Client  coprocessor handling (currently just CUDA)‏ hardware check/report scheduling (coprocessors not timesliced)‏  CPU scheduling run enough apps to use at least N cores

29 Score-based scheduling random N rank by score feasible jobs send M highest-scoring jobs

30 Terms in the score function Bonus if  host is fast and job is a retry  job is committed to HR class  app was selected by volunteer

31 Job size matching Goal: send large jobs to fast hosts, small jobs to slow hosts  reduce credit-granting delay  reduce server occupancy time Census program maintains host statistics Feeder maintains job size statistics Score penalty: |job - host| 2

32 Adaptive replication Goal: achieve a target level of reliability while reducing replication to 1+ε Idea: replicate less (but always some) as a host becomes more trusted Policy:  maintain “invalid rate” E(h) per host.  if E(h) > X, replicate (e.g., 2-fold)‏  else replicate with probability E(h)/X Is there a counterstrategy?

33 Server simulation How do we know these policies are any good? How can we study alternatives? In situ study is difficult SIMBA emulator (U. of Delaware): SIMBA (emulates N clients)‏ BOINC server (not emulated)‏

34 Upcoming scheduler changes Problems:  only use 1 app version  completion-time simulation is antiquated (doesn’t reflect multithread, coprocessor, RAM limitations)‏ New concept: resource signature  #CPUs, #coprocessors, RAM Do simulation based on “greedy EDF scheduling” using resource signature Select app version that can use available resources

35 Conclusion Volunteer computing has diverse resources and workloads BOINC has mechanisms that deal effectively and efficiently with this diversity Lots of fun research problems here! davea@ssl.berkeley.edu


Download ppt "Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008."

Similar presentations


Ads by Google