Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

BOINC: A System for Public-Resource Computing and Storage David P. Anderson University of California, Berkeley.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
BOINC The Year in Review David P. Anderson Space Sciences Laboratory U.C. Berkeley 22 Oct 2009.
Performance Evaluation
Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley May 2, 2007.
BOINC The Year in Review David P. Anderson Space Sciences Lab U.C. Berkeley 12 Sept 2008.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Scientific Computing on Smartphones David P. Anderson Space Sciences Lab University of California, Berkeley April 17, 2014.
Volunteer Thinking with Bossa David P. Anderson Space Sciences Laboratory University of California, Berkeley.
Volunteer Computing and Hubs David P. Anderson Space Sciences Lab University of California, Berkeley HUBbub September 26, 2013.
Achievements and Opportunities in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley 18 April 2008.
A Guided Tour of BOINC David P. Anderson Space Sciences Lab University of California, Berkeley TACC November 8, 2013.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley May 7, 2008.
Resource Management in Volunteer Computing Grids An analysis of the different approaches to maximizing throughput on a BOINC grid Presented by Geoffrey.
Scientific Computing in the Consumer Digital Infrastructure David P. Anderson Space Sciences Lab University of California, Berkeley The Austin Forum November.
David P. Anderson Space Sciences Laboratory University of California – Berkeley Designing Middleware for Volunteer Computing.
Exa-Scale Volunteer Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Introduction to the BOINC software David P. Anderson Space Sciences Laboratory University of California, Berkeley.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
David P. Anderson Space Sciences Lab U.C. Berkeley Exa-Scale Volunteer Computing.
BOINC Workshop 10 Hien Nguyen, Eshwar Rohit University of Houston Supervisors: Dr. Jaspal Subhlok University of Houston Dr. David P. Anderson SSL – U.C,
Volunteer Computing with GPUs David P. Anderson Space Sciences Laboratory U.C. Berkeley.
and Citizen Cyber-Science David P. Anderson Space Sciences Laboratory U.C. Berkeley.
BOINC: Progress and Plans David P. Anderson Space Sciences Lab University of California, Berkeley BOINC:FAST August 2013.
David P. Anderson Space Sciences Laboratory University of California – Berkeley Designing Middleware for Volunteer Computing.
David P. Anderson Space Sciences Laboratory University of California – Berkeley Public and Grid Computing.
TEMPLATE DESIGN © BOINC: Middleware for Volunteer Computing David P. Anderson Space Sciences Laboratory University of.
Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
BOINC: An Open Platform for Public-Resource Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley.
David P. Anderson Space Sciences Laboratory University of California – Berkeley Public Distributed Computing with BOINC.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.
Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley 14 Sept 2007.
Volunteer Computing and BOINC Dr. David P. Anderson University of California, Berkeley Dec 3, 2010.
Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 30 Dec
The Future of Volunteer Computing David P. Anderson U.C. Berkeley Space Sciences Lab UH CS Dept. March 22, 2007.
Volunteer Computing in Biology David P. Anderson Space Sciences Lab U.C. Berkeley 10 Sept 2007.
Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.
Volunteer Computing: Involving the World in Science David P. Anderson U.C. Berkeley Space Sciences Lab February 16, 2007.
Volunteer Computing: the Ultimate Cloud Dr. David P. Anderson University of California, Berkeley Oct 19, 2010.
A Brief History of (CPU) Time -or- Ten Years of Multitude David P. Anderson Spaces Sciences Lab University of California, Berkeley 2 Sept 2010.
David P. Anderson Space Sciences Laboratory University of California – Berkeley Supercomputing with Personal Computers.
The Limits of Volunteer Computing Dr. David P. Anderson University of California, Berkeley March 20, 2011.
David P. Anderson UC Berkeley Gilles Fedak INRIA The Computational and Storage Potential of Volunteer Computing.
Volunteer Computing and Large-Scale Simulation David P. Anderson U.C. Berkeley Space Sciences Lab February 3, 2007.
Local Scheduling for Volunteer Computing David P. Anderson U.C. Berkeley Space Sciences Lab John McLeod VII Sybase March 30, 2007.
Using volunteered resources for data-intensive computing and storage David Anderson Space Sciences Lab UC Berkeley 10 April 2012.
Volunteer Computing with BOINC: a Tutorial David P. Anderson Space Sciences Laboratory University of California – Berkeley May 16, 2006.
Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 28 Nov
BOINC current work David Anderson 11 July Where we're at ● We've come a long way – Some successful projects – Progress on software ● The long road.
An Overview of Volunteer Computing
Volunteer Computing and BOINC
OPERATING SYSTEMS CS 3502 Fall 2017
University of California, Berkeley
Volunteer Computing: Planting the Flag David P
Volunteer Computing: SETI and Beyond David P
Volunteer Computing for Science Gateways
Designing a Runtime System for Volunteer Computing David P
Exa-Scale Volunteer Computing
Alternative system models
Job Scheduling in a Grid Computing Environment
New developments for deploying
Overview Introduction to Operating Systems
University of California, Berkeley
Exploring Multi-Core on
Presentation transcript:

Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Background Volunteer computing  distributed scientific computing using volunteered resources (desktops, laptops, game consoles, cell phones, etc.)‏ BOINC  middleware for volunteer (and desktop grid) computing

Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network  performance  availability  proxies system availability reliability  crashes, invalid results, cheating

Diversity of applications Resource requirements  CPU, coprocessors, RAM, storage, network Completion time constraints Numerical properties  same result on all CPUs  a little different  unboundedly different

IBM World Community Grid “Umbrella” project sponsored by IBM  Rice genome study: Univ. of Washington  Protein X-ray crystallography: Ontario Cancer Inst.  African climate study: Univ. of Capetown  Dengue fever drug discovery: Univ. of Texas  Human protein folding: NYU, Univ. of Washington  HIV drug discovery: Scripps Institute Started Nov ,000 volunteers total 167,000 years of CPU time Currently ~170 TeraFLOPS

CPU type

# cores

OS type

RAM

Free disk space

Availability

Job error rate

Average turnaround time

Current WCG applications

Job dispatching 1M jobs schedulerclient Goals  maximize system throughput  minimize time to batch completion  minimize time to grant credit  scale to >100 requests/sec

BOINC scheduler architecture Job queue (DB)‏ Scheduler client Feeder Job cache (shared memory)‏ Issues:  what if cache fills up with unsendable jobs?  what is client needs a job not in cache?

Homogeneous replication Different platforms do FP math differently  makes result validation difficult Divide platforms into equivalence classes, send instances of a job to a single class “Census” program computes distribution Scheduler: send committed jobs if possible Win/IntelWin/AMDetc. uncommitted

Retry acceleration Retries needed when:  job times out  error (crash) returned  results fail to validate Send retries to hosts that are:  fast (low turnaround)‏  reliable Shorten latency bound of retries

Volunteer app selection Volunteers can  select apps  opt to accept jobs from non-selected apps

Fast feasibility checks (no DB)‏ Client sends:  hardware spec  availability info  list of jobs queued, in progress Resource checks Completion time check  EDF simulation  deadlines missed?

Slow feasibility checks (DB)‏ Is job still needed? Has another replica been sent to this volunteer?

job Application Platform mechanism Jobs are associated with apps, not versions Win/x86Win/x64 Linux/x8 6 App versions Request message: platform 0: Win64 platform 1: Win32 Application Win/x86Win/x64 Linux/x86 App versions job

Host punishment The problem: hosts that error out all jobs Maintain M(h): max jobs per day for host h On each error, decrement M(h)‏ On valid job, double M(h)‏

Anonymous platform mechanism Rather than downloading apps from server, client has preexisting local apps. Scheduler: if client has its own apps, only send it jobs for those apps. Usage scenarios:  Computers with unsupported platforms  People who optimize apps  Security-conscious people who want to inspect the source code

Old scheduling policy  Job cache scan start from random point do fast feasibility checks lock job, do slow feasibility checks  Multiple scans send jobs committed to an HR class if fast host, send retries send work for selected apps is allowed, send work for non-selected apps  Problems rigid policy app == 1 CPU

Coprocessor and multi-thread apps How to select the best version for a given host? How to estimate performance on the host? Win/x86 single- threaded multi- threaded CUDA

Multithread/coprocessor (cont.)‏ How to decide which app version to use?  app versions have “plan class” string  scheduler has project-supplied function bool app_plan(SCHEDULER_REQUEST &sreq, char* plan_class, HOST_USAGE&);  returns: whether host can run app coprocessor usage CPU usage (possibly fractional)‏ expected FLOPS cmdline to pass to app  embodies knowledge about sublinear speedup, etc. Scheduler: call app_plan() for each version, use the one with highest expected FLOPS

Multithread/coprocessor (cont.)‏ Client  coprocessor handling (currently just CUDA)‏ hardware check/report scheduling (coprocessors not timesliced)‏  CPU scheduling run enough apps to use at least N cores

Score-based scheduling random N rank by score feasible jobs send M highest-scoring jobs

Terms in the score function Bonus if  host is fast and job is a retry  job is committed to HR class  app was selected by volunteer

Job size matching Goal: send large jobs to fast hosts, small jobs to slow hosts  reduce credit-granting delay  reduce server occupancy time Census program maintains host statistics Feeder maintains job size statistics Score penalty: |job - host| 2

Adaptive replication Goal: achieve a target level of reliability while reducing replication to 1+ε Idea: replicate less (but always some) as a host becomes more trusted Policy:  maintain “invalid rate” E(h) per host.  if E(h) > X, replicate (e.g., 2-fold)‏  else replicate with probability E(h)/X Is there a counterstrategy?

Server simulation How do we know these policies are any good? How can we study alternatives? In situ study is difficult SIMBA emulator (U. of Delaware): SIMBA (emulates N clients)‏ BOINC server (not emulated)‏

Upcoming scheduler changes Problems:  only use 1 app version  completion-time simulation is antiquated (doesn’t reflect multithread, coprocessor, RAM limitations)‏ New concept: resource signature  #CPUs, #coprocessors, RAM Do simulation based on “greedy EDF scheduling” using resource signature Select app version that can use available resources

Conclusion Volunteer computing has diverse resources and workloads BOINC has mechanisms that deal effectively and efficiently with this diversity Lots of fun research problems here!