Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 30 Dec. 2011
High-Throughput Computing (HTC) ● Thousands or millions of separate jobs ● What matters is the rate of job completion – not the turnaround time of individual jobs ● Can use commodity computers – don’t need supercomputers
Scientific use of HTC ● Physical simulation – particle physics – atomic/molecular (bio, nano) – Earth climate system ● Compute-intensive data analysis – LHC (particle physics) – LIGO (gravitational waves) – genomics ● Bio-inspired optimization – genetic algorithms, flocking, ant colony etc.
Measures of computing throughput ● Floating-point operations (FLOP) – benchmarks: Linpack, Whetstone ● GigaFLOPS (10 9 /sec): 1 PC ● TeraFLOPS (10 12 /sec): 1 GPU ● PetaFLOPS (10 15 /sec): supercomputer ● ExaFLOPS (10 18 /sec): the future
Approaches to HPC ● Cluster computing – commodity or rack-mount PCs in 1 room ● Grid computing – sharing of clusters among/between organizations ● Cloud computing – rent cluster nodes, e.g. Amazon EC2 ● Volunteer computing – PC owners donate use of resources
Computing capacity ● Cluster: 1,000 nodes = ~10 TeraFLOPS ● Grid: largest one is ~100,000 nodes ● Cloud: Amazon ~100,000 nodes; ~1 PetaFLOPS ● Volunteer (actual): – 700,000 PCs, 100,000 with GPUs; 12 PetaFLOPS ● Volunteer (current potential): – 1.5 billion PCs: 100 ExaFLOPS – 5 billion mobile devices
Cost (for 10 TeraFLOPS/year) ● Cluster: $1.5M ● Amazon EC2 (5,000 instances): $4M ● Volunteer: ~ $0.1M
Energy All computing uses energy, but ● In cold climates, volunteer computing replaces conventional heating ● GPUs are 10X more efficient than CPUs ● Mobile device CPUs are 10x more efficient
Volunteer computing with BOINC volunteers projects CPDN WCG attachments
How to volunteer
Choose projects
Configure
Graphical interfaces
Community
Creating a BOINC project ● Install BOINC server software on a Linux box ● Compile apps for Windows/Mac/Linux ● Attract volunteers – develop web site – generate publicity – communicate with volunteers
Some projects ● ● IBM World Community Grid ● ● Climateprediction.net ● ●
Fundamental problems of volunteer computing ● Heterogeneity – need to compile apps for Win, Mac – portability is hard even on Linux ● Security – currently: account-based sandboxing – not enough for untrusted apps Virtual machine technology can solve both
Virtual machines application Operating system
Virtual machines Host operating system Guest operating system application
Virtual machines Windows 7 Debian Linux 2.6 application
VirtualBox: a VM system ● Open source (owned by Oracle) ● Rich feature set ● Low runtime overhead ● Easy to install
Process structure BOINC client vboxwrapper VirtualBox daemon VM instance shared-mem msg passing cmdline tool file-based communication
VM advantages ● Applications run in developer’s favorite environment (OS, libraries) – No need for multiple versions ● A VM is a strong “sandbox” – Application running in VM can’t access host OS – Can run untrusted applications
Volunteer storage ● A modern PC has ~1 TB disk ● 1M PCs * 100GB = 100 Petabytes ● Amazon: $120 million/year
BOINC storage architecture BOINC file management infrastructure storage applications dataset storage data archival data stream buffering locality scheduling
Data archival ● Goals – store large files for long periods – arbitrarily high reliability
Recovery in volunteer storage Server data
Recovery in volunteer storage Server data client s data
Recovery in volunteer storage Server client s data
Recovery in volunteer storage Server client s data X
Recovery in volunteer storage Server data client s data
Recovery in volunteer storage Server data client s data
Recovery in volunteer storage Server client s data
Volunteer storage issues ● high churn rate of hosts – ~90 day mean lifetime ● high latency of file transfers – hours or days ● Modeling volunteer storage systems – overlapping failure and recovery – server storage and bandwidth may be bottleneck
Replication ● Advantages: – Fast recovery (1 upload, 1 download) – Increase N to reduce server storage needs ● But: – High space overhead – Reliability decreases exponentially with N N M file hosts
Coding Divide file into N blocks, generate K additional “checksum” blocks. Recover file from any N blocks. Advantages: ● High reliability with low space overhead But: ● Recovering a block requires reassembling the entire file (network, space overhead) N K
Multi-level coding ● Divide file, encode each piece separately ● Use encoding for top-level chunks as well ● Can extend to > 2 levels N KN K
Hybrid coding/replication ● Use multi-level coding, but replicate each bottom- level block 2 or 3X. ● Most failures will be recovered with replication ● The idea: get both the fast recovery of replication and the high reliability of coding.
Distributed storage simulator ● Inputs: – host arrival rate, lifetime distribution, upload/download speeds, free disk space – parameters of files to be stored ● Policies that can be simulated – M-level coding, N and K coding values, R-fold replication ● Outputs – statistics of server disk space usage, network BW, “vulnerability” level
Multi-user projects ● Needed: – remote job submission mechanism – quota system – scheduling support for batches science portal BOINC server Scientists (users) sysadmins batches of jobs
Quota system ● Each user has “quota” ● Batch prioritization goals: – enforce quotas over the long term – give priority to short batches – don’t starve long batches
Batch prioritization ● Each user has “logical start time” LST(U) ● Prioritize batches by increasing end time ● Example: time B1 LST(U 1 ) B2B4B3 LST(U 2 )
Thank you!