Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008.

Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Background Volunteer computing  distributed scientific computing using volunteered resources (desktops, laptops, game consoles, cell phones, etc.)‏ BOINC  middleware for volunteer (and desktop grid) computing

Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network  performance  availability  proxies system availability reliability  crashes, invalid results, cheating

Diversity of applications Resource requirements  CPU, coprocessors, RAM, storage, network Completion time constraints Numerical properties  same result on all CPUs  a little different  unboundedly different

IBM World Community Grid “Umbrella” project sponsored by IBM  Rice genome study: Univ. of Washington  Protein X-ray crystallography: Ontario Cancer Inst.  African climate study: Univ. of Capetown  Dengue fever drug discovery: Univ. of Texas  Human protein folding: NYU, Univ. of Washington  HIV drug discovery: Scripps Institute Started Nov. 2004 390,000 volunteers total 167,000 years of CPU time Currently ~170 TeraFLOPS

CPU type

# cores

OS type

Free disk space

Availability

Job error rate

Average turnaround time

Current WCG applications

Job dispatching 1M jobs schedulerclient Goals  maximize system throughput  minimize time to batch completion  minimize time to grant credit  scale to >100 requests/sec

BOINC scheduler architecture Job queue (DB)‏ Scheduler client Feeder Job cache (shared memory)‏ Issues:  what if cache fills up with unsendable jobs?  what is client needs a job not in cache?

Homogeneous replication Different platforms do FP math differently  makes result validation difficult Divide platforms into equivalence classes, send instances of a job to a single class “Census” program computes distribution Scheduler: send committed jobs if possible Win/IntelWin/AMDetc. uncommitted

Retry acceleration Retries needed when:  job times out  error (crash) returned  results fail to validate Send retries to hosts that are:  fast (low turnaround)‏  reliable Shorten latency bound of retries

Volunteer app selection Volunteers can  select apps  opt to accept jobs from non-selected apps

Fast feasibility checks (no DB)‏ Client sends:  hardware spec  availability info  list of jobs queued, in progress Resource checks Completion time check  EDF simulation  deadlines missed?

Slow feasibility checks (DB)‏ Is job still needed? Has another replica been sent to this volunteer?

job Application Platform mechanism Jobs are associated with apps, not versions Win/x86Win/x64 Linux/x8 6 App versions Request message: platform 0: Win64 platform 1: Win32 Application Win/x86Win/x64 Linux/x86 App versions job

Host punishment The problem: hosts that error out all jobs Maintain M(h): max jobs per day for host h On each error, decrement M(h)‏ On valid job, double M(h)‏

Anonymous platform mechanism Rather than downloading apps from server, client has preexisting local apps. Scheduler: if client has its own apps, only send it jobs for those apps. Usage scenarios:  Computers with unsupported platforms  People who optimize apps  Security-conscious people who want to inspect the source code

Old scheduling policy  Job cache scan start from random point do fast feasibility checks lock job, do slow feasibility checks  Multiple scans send jobs committed to an HR class if fast host, send retries send work for selected apps is allowed, send work for non-selected apps  Problems rigid policy app == 1 CPU

Coprocessor and multi-thread apps How to select the best version for a given host? How to estimate performance on the host? Win/x86 single- threaded multi- threaded CUDA

Multithread/coprocessor (cont.)‏ How to decide which app version to use?  app versions have “plan class” string  scheduler has project-supplied function bool app_plan(SCHEDULER_REQUEST &sreq, char* plan_class, HOST_USAGE&);  returns: whether host can run app coprocessor usage CPU usage (possibly fractional)‏ expected FLOPS cmdline to pass to app  embodies knowledge about sublinear speedup, etc. Scheduler: call app_plan() for each version, use the one with highest expected FLOPS

Multithread/coprocessor (cont.)‏ Client  coprocessor handling (currently just CUDA)‏ hardware check/report scheduling (coprocessors not timesliced)‏  CPU scheduling run enough apps to use at least N cores

Score-based scheduling random N rank by score feasible jobs send M highest-scoring jobs

Terms in the score function Bonus if  host is fast and job is a retry  job is committed to HR class  app was selected by volunteer

Job size matching Goal: send large jobs to fast hosts, small jobs to slow hosts  reduce credit-granting delay  reduce server occupancy time Census program maintains host statistics Feeder maintains job size statistics Score penalty: |job - host| 2

Adaptive replication Goal: achieve a target level of reliability while reducing replication to 1+ε Idea: replicate less (but always some) as a host becomes more trusted Policy:  maintain “invalid rate” E(h) per host.  if E(h) > X, replicate (e.g., 2-fold)‏  else replicate with probability E(h)/X Is there a counterstrategy?

Server simulation How do we know these policies are any good? How can we study alternatives? In situ study is difficult SIMBA emulator (U. of Delaware): SIMBA (emulates N clients)‏ BOINC server (not emulated)‏

Upcoming scheduler changes Problems:  only use 1 app version  completion-time simulation is antiquated (doesn’t reflect multithread, coprocessor, RAM limitations)‏ New concept: resource signature  #CPUs, #coprocessors, RAM Do simulation based on “greedy EDF scheduling” using resource signature Select app version that can use available resources

Conclusion Volunteer computing has diverse resources and workloads BOINC has mechanisms that deal effectively and efficiently with this diversity Lots of fun research problems here! davea@ssl.berkeley.edu

Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008.

Similar presentations

Presentation on theme: "Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008.

Similar presentations

Presentation on theme: "Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008."— Presentation transcript:

Similar presentations

About project

Feedback