Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 28 Nov. 2011.

Slides:

Advertisements

Similar presentations

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.

Advertisements

W4118 Operating Systems OS Overview Junfeng Yang.

An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.

Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.

BOINC The Year in Review David P. Anderson Space Sciences Laboratory U.C. Berkeley 22 Oct 2009.

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

System Center 2012 Setup The components of system center App Controller Data Protection Manager Operations Manager Orchestrator Service.

Volunteer Computing and Hubs David P. Anderson Space Sciences Lab University of California, Berkeley HUBbub September 26, 2013.

1 port BOSS on Wenjing Wu (IHEP-CC)

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

A Guided Tour of BOINC David P. Anderson Space Sciences Lab University of California, Berkeley TACC November 8, 2013.

Scientific Computing in the Consumer Digital Infrastructure David P. Anderson Space Sciences Lab University of California, Berkeley The Austin Forum November.

Exa-Scale Volunteer Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley.

Introduction to the BOINC software David P. Anderson Space Sciences Laboratory University of California, Berkeley.

Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

BOINC: Progress and Plans David P. Anderson Space Sciences Lab University of California, Berkeley BOINC:FAST August 2013.

TEMPLATE DESIGN © BOINC: Middleware for Volunteer Computing David P. Anderson Space Sciences Laboratory University of.

Application Software System Software.

2.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition System Programs (p73) System programs provide a convenient environment.

Course 03 Basic Concepts assist. eng. Jánó Rajmond, PhD

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Volunteer Computing and BOINC Dr. David P. Anderson University of California, Berkeley Dec 3, 2010.

Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 30 Dec

Volunteer Computing in the Next Decade David Anderson Space Sciences Lab University of California, Berkeley 4 May 2012.

Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.

Volunteer Computing: Involving the World in Science David P. Anderson U.C. Berkeley Space Sciences Lab February 16, 2007.

The Limits of Volunteer Computing Dr. David P. Anderson University of California, Berkeley March 20, 2011.

Volunteer Computing and Large-Scale Simulation David P. Anderson U.C. Berkeley Space Sciences Lab February 3, 2007.

Local Scheduling for Volunteer Computing David P. Anderson U.C. Berkeley Space Sciences Lab John McLeod VII Sybase March 30, 2007.

Using volunteered resources for data-intensive computing and storage David Anderson Space Sciences Lab UC Berkeley 10 April 2012.

Technology for Citizen Cyberscience Dr. David P. Anderson University of California, Berkeley May 2011.

Volunteer Computing with BOINC: a Tutorial David P. Anderson Space Sciences Laboratory University of California – Berkeley May 16, 2006.

Introduction to Operating Systems Concepts

Volunteer Computing and BOINC

Guide to Operating Systems, 5th Edition

The Future of Volunteer Computing

Chapter 6: Securing the Cloud

The 9th Annual BOINC Workshop

Operating System Structures

University of California, Berkeley

Chapter 1: Introduction

Processes and threads.

Volunteer Computing for Science Gateways

Introduction to Distributed Platforms

Designing a Runtime System for Volunteer Computing David P

Working With Azure Batch AI

Chapter 2: System Structures

Operating System Structure

CernVM Status Report Predrag Buncic (CERN/PH-SFT).

Quick Introduction to OS

The software infrastructure of II

Chapter 1: Introduction

Haiyan Meng and Douglas Thain

Virtualization Layer Virtual Hardware Virtual Networking

Chapter 2: System Structures

Guide to Operating Systems, 5th Edition

Chapter 3: Operating-System Structures

University of California, Berkeley

Ivan Reid (Brunel University London/CMS)

Chapter 2: Operating-System Structures

Outline Chapter 2 (cont) OS Design OS structure

Prof. Leonardo Mostarda University of Camerino

Operating System Introduction

Chapter 2: Operating-System Structures

Azure Container Service

Operating System Introduction

Exploring Multi-Core on

Presentation transcript:

Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 28 Nov. 2011

The consumer digital infrastructure ● 1.5 billion PCs, 5 billion mobile devices – 100 ExaFLOPS – capable of most scientific computing ● Cost of sustained HPC: ● volunteer « dedicated « rented

Volunteer computing ● BOINC – 450,000 active computers – ~30 science projects ● Some areas of research and development – VM-based applications – Volunteer storage – Multi-user projects – Emulating scheduling policies

VM-based applications Fundamental problems of volunteer computing: ● Heterogeneity – need to compile apps for Win, Mac – portability is hard even on Linux ● Security – currently: account-based sandboxing – not enough for untrusted apps Virtual machine technology can solve both

VirtualBox ● Open source (owned by Oracle) ● Rich feature set – directory sharing – control of network activity in VM ● Low runtime overhead ● Easy to install

Process structure BOINC client vboxwrapper VirtualBox daemon VM instance shared-mem msg passing cmdline tool file-based communication

Directory structure ● Host OS BOINC/slots/0/ VM image file shared/ input/output files application executable ● Guest OS /shared/

Components of a VM app version ● VM image – developer’s preferred environment – whatever libraries they want – can be shared among any/all apps ● vboxwrapper – main program ● application executable ● VM config file – specifies memory size, etc.

Example: 2.0 ● CernVM – minimal image (~100 MB compressed) – science libraries (~10 GB) are paged in using HTTP – contains the client of existing CERN job queueing systems ● Goal: provide more computing power to physicists with requiring them to change anything.

Future work ● Bundle VirtualBox with BOINC installer ● Using GPUs within VMs ● Multiple VM jobs on multicore hosts: how to do it efficiently ● Streamlined mechanisms for deploying VM-based apps

Volunteer storage ● A modern PC has ~1 TB disk ● 1M PCs * 100GB = 100 Petabytes ● Amazon: $120 million/year

BOINC storage architecture BOINC file management infrastructure storage applications dataset storage data archival data stream buffering locality scheduling

Data archival ● Goals – store large files for long periods – arbitrarily high reliability ● Issues – high churn rate of hosts – high latency of file transfers ● Models – overlapping failure and recovery – server storage and bandwidth may be bottleneck

Replication Divide file into N chunks, store each chunk on M hosts Advantages: ● Fast recovery (1 upload, 1 download) ● Increase N to reduce server storage needs But: ● High space overhead ● Reliability decreases exponentially with N

Coding Divide file into N blocks, generate K additional “checksum” blocks. Recover file from any N blocks. Advantages: ● High reliability with low space overhead But: ● Recovering a block requires reassembling the entire file (network, space overhead)

Multi-level coding ● Divide file, encode each piece separately ● Use encoding for top-level chunks as well ● Can extend to > 2 levels N KN K

Hybrid coding/replication ● Use multi-level coding, but replicate each bottom- level block 2 or 3X. ● Most failures will be recovered with replication ● The idea: get both the fast recovery of replication and the high reliability of coding.

Distributed storage simulator ● Inputs: – host arrival rate, lifetime distribution, upload/download speeds, free disk space – parameters of files to be stored ● Policies that can be simulated – M-level coding, N and K coding values, R-fold replication ● Outputs – statistics of server disk space usage, network BW, “vulnerability” level

Multi-user projects ● Needed: – remote job submission mechanism – quota system – scheduling support for batches science portal BOINC server Scientists (users) sysadmins batches of jobs

Quota system ● Each user has “quota” ● Batch prioritization goals: – enforce quotas over the long term – give priority to short batches – don’t starve long batches

Batch prioritization ● Each user has a ● For a user U: – Q(U) = fractional quota of U – LST(U) = “logical start time” ● For a batch B: – E(B) = estimated elapsed time of B given all resources – EQ(B) = E(B)/Q(U)

Batch prioritization ● When a user submits a batch B – logical end time LET(B) = LST(U) + EQ(B) – LST(U) += EQ(B) ● Prioritize batches by increasing LET(B) ● Example time B1 LET(B1 ) B2B4B3

Emulating scheduling policies ● Job scheduling policy – what jobs to run – whether to leave suspended jobs in memory ● Work fetch policy – when to get more jobs – what project to get them from – how much to request These policies have big impact on system performance. They must work in a large space of scenarios

Scenarios ● Preferences ● Hardware ● Availability (computing, network) ● # of projects ● For each project/application – distribution of job size – accuracy of runtime estimate – latency bound – resource usage – project availability

Issues ● How can we design good scheduling policies? ● How can we debug the BOINC client? ● How can we plan for the future? – many cores – faster GPUs – tight latency bounds – large-RAM applications

The BOINC client emulator Main logic Scheduling policies Availability Job execution Scheduler RPC Emulated (same source code) Simulate d

Inputs ● Client state file – describes hardware, availability, projects and their characteristics ● Preferences, configuration files ● Duration, time step of simulation

Outputs ● Figures of merit – idle fraction – wasted fraction – resource share violation – monotony – RPCs per job ● Timeline ● message log ● graphs of scheduling data

Interfaces ● Web-based – volunteers upload scenarios – can see all scenarios, run simulations against them, comment on them ● Scripted – sweep an input parameter – compare 2 policies across a set of scenarios

Future work ● Characterize the scenario population – Monte-Carlo sampling ● Study new policies – e.g. alternatives to EDF ● More features in emulator – memory usage – file transfer time – application checkpointing behavior ● Better model of scheduler behavior

Questions?