David P. Anderson Space Sciences Laboratory University of California – Berkeley Volunteer Computing
Outline ● Volunteer computing ● BOINC: an OS for volunteer computing ● Applications ● Challenges and research directions
Where's the power? ● 2010: 1 billion Internet-connected PCs, 55% privately owned ● If 100M people participate: – 100 PetaFLOPs, 1 Exabyte (10^18) storage ● Consumer products drive technology – GPUs (NVIDIA, Sony Cell) your computers academic business home PCs
Volunteer computing history GIMPS, distributed.net climateprediction.net volunteer computing public [resource] computing Internet computing screensaver computing global computing peer-to-peer computing Grid computing BOINC
Volunteer/Grid differences
Save money! So volunteer computing is cheaper if X > 1/20 X = 1,000) cluster/Gridvolunteer computing:$1 per CPU/dayfree network:free$1 per 20 GB cost per GB:$X$1/20 Suppose processing 1 GB of data takes X computer days
Educational discount Internet2 (free, underutilized) UCB commodity Internet ($$) UCLA UIUC partner institutions Underutilized flat-rate ISP connections... so bandwidth may be effectively free also
Infrastructure software ● Roll your own ● XtremWeb, cosm – not complete/robust ● United Devices, Entropia – not free ● Grid (Globus/Condor), jxta – solve a different problem ● BOINC (Berkeley Open Infrastructure for Network Computing) –
Projects and participants SETIphysics Climate biomedical Joe Alice Jens diversity, autonomy heterogeneity allocation, trust
Encourage participation in >1 project ● Better long-term resource utilization – project A works while project B thinks ● Better short-term resource utilization – communicate/compute in parallel – match applications to resources project computing needs think work think work time
Creating a BOINC project ● Install BOINC server software on Unix box ● Adapt or develop application – compile for various platforms ● Write scripts/programs to: – generate tasks – validate results – handle results ● Develop web site ● Get media coverage
Structure of a BOINC project Scheduling server (C++) BOINC DB (MySQL) Work generation data server (HTTP) Web interfaces (PHP) Retry generation Result validation Result processing Garbage collection Ongoing tasks: - monitor server correctness - monitor server performance - develop and maintain applications
Redundant computing ● Addresses hardware errors, hackers ● Issue 2 or more copies of each task – don't send to same host or user – timed retry up to a limit ● Result comparison approaches – Application-specific “fuzzy comparison” – Homogeneous redundancy ● send copies only to numerically equivalent hosts – Develop platform-independent app
What do participants want? ● Incentives – contribute to science – get acknowledgement – community – screensaver graphics ● Invisibility, control of resource usage ● Involvement – translation, porting etc.
Credit accounting ● Credit is granted for – computation (CPU time x benchmark) – storage – network communication ● Cheat-resistance ● Accounting – user, host, team ● Credit DB export for 3rd-party web sites – cross-project identification
Participating ● Select project(s) ● Create account(s) ● Download/install BOINC client software ● Interact via web: – preferences – leaderboards – profile – teams – message boards, dynamic FAQ
Anonymous platform mechanism ● Participant compiles software from source ● Scheduler RPC: platform is “anonymous” ● Purposes: – support obscure platforms – security-conscious participants – performance tuning of applications
Client structure App Core client screensaver BOINC Manager servers
Applications ● Computation model – Workunits, results – Deadlines, resource estimates ● Data model – files, file references ● Mostly existing apps (FORTRAN, C) ● Categories – Physical simulation – Data processing – Distribution for its own sake
● Analysis of radio telescope data from Arecibo – SETI: search for narrowband signals – Astropulse: search for short broadband signals ● 0.3 MB in, ~4 CPU hours, 10 KB out ● Enhancements under BOINC: – data archival on clients – direct data distribution from observatory
Climateprediction.net ● Climate change study (Oxford University) – Met Office model (FORTRAN, 1M lines) ● Input: ~10MB executable, 1MB data ● Output per workunit: – 10 MB summary (always upload) – 1 GB detail file (archive on client, may upload) ● CPU time: 2-3 months (can't migrate) – trickle messages – preemptive scheduling
Biology projects ● Protein folding – (Scripps Institute) – (Stanford) ● Virtual drug discovery – ● Gene sequence analysis – NTT projects – Lattice (U. Maryland)
● Gravitational wave detection; LIGO ● UW Milwaukee/CalTech/Max Planck Inst. ● 30, MB data sets ● Each data set is analyzed w/ 40,000 different parameter sets; each takes ~6 hrs CPU ● Locality scheduling – minimize data transfer, client disk usage – minimize credit-granting delay
CERN projects ● – accelerator simulation (Sixtrack) ● – collision data analysis
Others ● UCB Internet measurement – Map/measure the Internet and home PCs ● BURP (big ugly rendering project) – ray-tracing ● PlanetQuest – image analysis for planetary transit detection
Challenges and questions ● Get 100 million participants – simplified account management ● Get more projects ● Distributed file system support ● Use peer-to-peer communication – BitTorrent integration ● Use GPUs and other resources ● Integrate with Grid (Lattice, CERN)
Volunteer computing ● A new high-performance computing paradigm ● Benefits to projects: – enables otherwise infeasible computational research – economic advantage even for small projects ● Benefits to participants: – increase public scientific knowledge/interest – catalyze virtual communities – democratize resource allocation