Explicit Control in a Batch-aware Distributed File System John Bent Douglas Thain Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Miron Livny University of.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Data-driven Workflow Planning in Cluster management Systems Srinath Shankar David J DeWitt Department of Computer Sciences University of Wisconsin-Madison,

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Peer-to-peer archival data trading Brian Cooper Joint work with Hector Garcia-Molina (and others) Stanford University.

Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi.

Computer Organization and Architecture

The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Case Study - GFS.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003 Pipeline and Batch Sharing in.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Workshop on the Future of Scientific Workflows Break Out #2: Workflow System Design Moderators Chris Carothers (RPI), Doug Thain (ND)

EFFECTIVE LOAD-BALANCING VIA MIGRATION AND REPLICATION IN SPATIAL GRIDS ANIRBAN MONDAL KAZUO GODA MASARU KITSUREGAWA INSTITUTE OF INDUSTRIAL SCIENCE UNIVERSITY.

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

Event Data History David Adams BNL Atlas Software Week December 2001.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

Data-Driven Batch Scheduling John Bent Ph.D. Oral Exam Department of Computer Sciences University of Wisconsin, Madison May 31, 2005.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Flexibility, Manageability and Performance in a Grid Storage Appliance John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley,

Storage Research Meets The Grid Remzi Arpaci-Dusseau.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Review CS File Systems - Partitions What is a hard disk partition?

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?

NeST: Network Storage John Bent, Venkateshwaran V Miron Livny, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau.

IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©

Bridging the Information Gap in Storage Protocol Stacks Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau University of Wisconsin,

Distributed Systems: Distributed File Systems Ghada Ahmed, PhD. Assistant Prof., Computer Science Dept. Web:

Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,

Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)

OPERATING SYSTEMS CS 3502 Fall 2017

Processes and Threads Processes and their scheduling

Migratory File Services for Batch-Pipelined Workloads

OGSA Data Architecture Scenarios

Storage Virtualization

Advanced Operating Systems

Haiyan Meng and Douglas Thain

Bridging the Information Gap in Storage Protocol Stacks

STORK: A Scheduler for Data Placement Activities in Grid

Laura Bright David Maier Portland State University

Specialized Cloud Architectures

Introduction to Operating Systems

Virtual Memory: Working Sets

Presentation transcript:

Explicit Control in a Batch-aware Distributed File System John Bent Douglas Thain Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Miron Livny University of Wisconsin, Madison

Grid computing Physicists invent distributed computing!Astronomers develop virtual supercomputers!

Grid computing Home storage Internet If it looks like a duck...

Are existing distributed file systems adequate for batch computing workloads? NO. Internal decisions inappropriate Caching, consistency, replication A solution: Batch-Aware Distributed File System (BAD-FS) Combines knowledge with external storage control Detail information about workload is known Storage layer allows external control External scheduler makes informed storage decisions Combining information and control results in Improved performance More robust failure handling Simplified implementation

Outline Introduction Batch computing Systems Workloads Environment Why not DFS? Our answer: BAD-FS Design Experimental evaluation Conclusion

Batch computing Not interactive computing Job description languages Users submit System itself executes Many different batch systems Condor LSF PBS Sun Grid Engine

Internet Batch computing Scheduler Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Job queue Home storage 12 34

Batch workloads General properties Large number of processes Process and data dependencies I/O intensive Different types of I/O Endpoint Batch Pipeline Our focus: Scientific workloads More generally applicable Many others use batch computing video production, data mining, electronic design, financial services, graphic rendering “Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci- Dussea, Miron Livny. HPDC 12, 2003.

Batch workloads Endpoint Batch dataset Pipeline Endpoint Pipeline

Cluster-to-cluster (c2c) Not quite p2p More organized Less hostile More homogeneity Correlated failures Each cluster is autonomous Run and managed by different entities An obvious bottleneck is wide-area How to manage flow of data into, within and out of these clusters? Internet Home store

Why not DFS? Distributed file system would be ideal Easy to use Uniform name space Designed for wide-area networks But... Not practical Embedded decisions are wrong Internet Home store

DFS’s make bad decisions Caching Must guess what and how to cache Consistency Output: Must guess when to commit Input: Needs mechanism to invalidate cache Replication Must guess what to replicate

BAD-FS makes good decisions Removes the guesswork Scheduler has detailed workload knowledge Storage layer allows external control Scheduler makes informed storage decisions Retains simplicity and elegance of DFS Practical and deployable

Outline Introduction Batch computing Systems Workloads Environment Why not DFS? Our answer: BAD-FS Design Experimental evaluation Conclusion

User-level; requires no privilege Packaged as a modified batch system A new batch system which includes BAD-FS General; will work on all batch systems Tested thus far on multiple batch systems Practical and deployable Internet SGE BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS Home store

Contributions of BAD-FS Scheduler Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Job queue Home storage Job queue 3) Expanded job description language BAD-FS Scheduler 4) BAD-FS scheduler 1) Storage managers 2) Batch-Aware Distributed File System Storage Manager Storage Manager Storage Manager Storage Manager BAD-FS

BAD-FS knowledge Remote cluster knowledge Storage availability Failure rates Workload knowledge Data type (batch, pipeline, or endpoint) Data quantity Job dependencies

Control through volumes Guaranteed storage allocations Containers for job I/O Scheduler Creates volumes to cache input data Subsequent jobs can reuse this data Creates volumes to buffer output data Destroys pipeline, copies endpoint Configures workload to access containers

Knowledge plus control Enhanced performance I/O scoping Capacity-aware scheduling Improved failure handling Cost-benefit replication Simplified implementation No cache consistency protocol

I/O scoping Technique to minimize wide-area traffic Allocate storage to cache batch data Allocate storage for pipeline and endpoint Extract endpoint AMANDA: 200 MB pipeline 500 MB batch 5 MB endpoint BAD-FS Scheduler Compute node Internet Steady-state: Only 5 of 705 MB traverse wide-area.

Capacity-aware scheduling Technique to avoid over-allocations Scheduler runs only as many jobs as fit

Endpoint Batch dataset Pipeline Endpoint Pipeline Endpoint Pipeline Endpoint Pipeline Endpoint Batch dataset Capacity-aware scheduling

64 batch-intensive synthetic pipelines Vary size of batch data 16 compute nodes Capacity-aware scheduling

Improved failure handling Scheduler understands data semantics Data is not just a collection of bytes Losing data is not catastrophic Output can be regenerated by rerunning jobs Cost-benefit replication Replicates only data whose replication cost is cheaper than cost to rerun the job Results in paper

Simplified implementation Data dependencies known Scheduler ensures proper ordering No need for cache consistency protocol in cooperative cache

Real workloads AMANDA Astrophysics study of cosmic events such as gamma- ray bursts BLAST Biology search for proteins within a genome CMS Physics simulation of large particle colliders HF Chemistry study of non-relativistic interactions between atomic nuclei and electors IBIS Ecology global-scale simulation of earth’s climate used to study effects of human activity (e.g. global warming)

Real workload experience Setup 16 jobs 16 compute nodes Emulated wide-area Configuration Remote I/O AFS-like with /tmp BAD-FS Result is order of magnitude improvement

BAD Conclusions Existing DFS’s insufficient Schedulers have workload knowledge Schedulers need storage control Caching Consistency Replication Combining this control with knowledge Enhanced performance Improved failure handling Simplified implementation

For more information Questions?

Why not BAD-scheduler and traditional DFS? Cooperative caching Data sharing Traditional DFS assume sharing is exception provision for arbitrary, unplanned sharing Batch workloads, sharing is rule Sharing behavior is completely known Data committal Traditional DFS must guess when to commit AFS uses close, NFS uses 30 seconds Batch workloads precisely define when

Is cap aware imp in real world? 1.Heterogeneity of remote resources 2.Shared disk 3.Workloads changing, some are very, very large.

User burden Additional info needed in declarative lang. User probably already knows this info Or can readily obtain Typically, this info already exists Scattered across collection of scripts, Makefiles, etc. BAD-FS improves current situation by collecting this info into one central location

Enhanced performance I/O scoping Scheduler knows I/O types Creates storage volumes accordingly Only endpoint I/O traverses wide-area Capacity-aware scheduling Scheduler knows I/O quantities Throttles workloads, avoids over-allocations

Improved failure handling Scheduler understands data semantics Lost data is not catastrophic Pipe data can be regenerated Batch data can be refetched Enables cost-benefit replication Measure replication cost data generation cost failure rate Replicate only data whose replication cost is cheaper than expected cost to reproduce Improves workload throughput

Capacity-aware scheduling Goal Avoid overallocations Cache thrashing Write failures Method Breadth-first Depth-first Idleness

Capacity-aware scheduling evaluation Workload 64 synthetic pipelines Varied pipe size Environment 16 compute nodes Configuration Breadth-first Depth-first BAD-FS Failures directly correlate to workload throughput.

Workload example: AMANDA Astrophysics study of cosmic events such as gamma-ray bursts Four stage pipeline 200 MB pipeline I/O 500 MB batch I/O 5 MB endpoint I/O Focus Scientific workloads Many others use batch computing video production, data mining, electronic design, financial services, graphic rendering

BAD-FS and scheduler BAD-FS Allows external decisions via volumes A guaranteed storage allocation Size, lifetime, and a type Cache volumes Read-only view of an external server Can be bound together into cooperative cache Scratch volumes Private read-write name space Batch-aware scheduler Rendezvous of control and information Understands storage needs and availability Controls storage decisions

Scheduler controls storage decisions What and how to cache? Answer: batch data and cooperatively Technique: I/O scoping and capacity-aware scheduling What and when to commit? Answer: endpoint data when ready Technique: I/O scoping and capacity-aware scheduling What and when to replicate? Answer: data whose cost to regenerate is high Technique: cost-benefit replication

I/O scoping Goal Minimize wide-area traffic Means Information about data type Storage volumes Method Create coop cache volumes for batch data Create scratch volumes to contain pipe Result Only endpoint data traverses wide-area Improved workload throughput

I/O scoping evaluation Workload 64 synthetic pipelines 100 MB of I/O each Varied data mix Environment 32 compute nodes Emulated wide-area Configuration Remote I/O Cache volumes Scratch volumes BAD-FS Wide-area traffic directly correlates to workload throughput.

Capacity-aware scheduling Goal Avoid over-allocations of storage Means Information about data quantities Information about storage availability Storage volumes Method Use depth-first scheduling to free pipe volumes User breadth-first scheduling to free batch Result No thrashing due to over-allocations of batch No failures due to over-allocations of pipe Improved throughput

Capacity-aware scheduling evaluation Workload 64 synthetic pipelines Pipe-intensive Environment 16 compute nodes Configuration Breadth-first Depth-first BAD-FS

Capacity-aware scheduling evaluation Workload 64 synthetic pipelines Pipe-intensive Environment 16 compute nodes Configuration Breadth-first Depth-first BAD-FS Failures directly correlate to workload throughput.

Cost-benefit replication Goal Avoid wasted replication overhead Means Knowledge of data semantics Data loss is not catastrophic Can be regenerated or refetched Method Measure Failure rate, f, within each cluster Cost, p, to reproduce data − Time to rerun jobs to regenerate pipe data − Time to refetch batch data from home Cost, r, to replicate data Replicate only when p*f > r Result Data is replicated only when it should be Can improve throughput

Cost-benefit replication evaluation Workload Synthetic pipelines of depth 3 Runtime 60 seconds Environment Artificially injected failures Configuration Always-copy Never-copy BAD-FS Trade-off overhead in environment without failure to gain throughput in environment with failure.

Real workloads Workload Real workloads 64 pipelines Environment 16 compute nodes Emulated wide-area Cold and warm First 16 are cold Subsequent 48 warm Configuration Remote I/O AFS-like BAD-FS

Experimental results not shown here I/O scoping Capacity planning Cost-benefit replication Other real workload results Large in the wild demonstration Works in c2c Works across multiple batch systems

Existing approaches Remote I/O Interpose and redirect all I/O home CON: Quickly saturates wide-area connection Pre-staging Manually push all input endpoint and batch Manually pull all endpoint output Manually configure workload to find pre-staged data CON: Repetitive, error-prone, laborious Traditional distributed file systems Locate remote compute nodes within same name space as home (e.g. AFS) Not an existing approach; impractical to deploy

Declarative language Existing languages express process specification requirements dependencies Add primitives to describe I/O behavior Modified language can express data dependencies type (i.e. endpoint, batch, pipe) quantities

Example: AMANDA on AFS Caching Batch data redundantly fetched Callback overhead Consistency Pipeline data committed on close Replication No idea which data is important AMANDA: 200 MB pipeline I/O 500 MB batch I/O 5 MB endpoint I/O This is slide in which I’m most interested in feedback. 500 MB 200 MB

Overview Practical Ease of use CachingConsistencyReplication Remote I/O √√ X √ X Pre-staging √ X √√ X Trad. DFS X √ XXX

I/O Scoping

Capacity-aware scheduling, batch-intense

Capacity-aware scheduling evaluation Workload 64 synthetic pipelines Pipe-intensive Environment 16 compute nodes

Failure handling

Workload experience

In the wild

Example workflow language: Condor DAGman Keyword job names file w/ execute instrs Keywords parent, child express relations … no declaration of data job A “instructions.A” job B “instructions.B” job C “instructions.C” job D “instructions.D” parent A child B parent C child D A B C D

Adding data primitives to a workflow language New keywords for container operations volume: create a container scratch: specify container type mount: how the app addresses the container extract: the desired endpoint output User must provide complete, exact I/O information to the scheduler Specify which procs use which data Specify size of data read and written

Extended workflow language job A “instructions.A” job B “instructions.B” job C “instructions.C” job D “instructions.D” parent A child B parent C child D volume B1 ftp://home/data 1GB volume P1 scratch 500 MB volume P2 scratch 500 MB A mount B1 /data C mount B1 /data A mount P1 /tmp B mount P1 /tmp C mount P2 /tmp D mount P2 /tmp extract P1/out ftp://home/out.1 extract P2/out ftp://home/out.2 out A B C D ftp://home /data out.1out.2 B1 out P2P1

Terminology Application Process Workload Pipeline I/O Batch I/O Endpoint I/O Pipe-depth Batch-width Scheduler Home storage Catalogue

Remote resources

Example scenario Workload Width 100, depth 2 1 GB batch 1 GB pipe 1 KB endpoint Environment Batch data archived at home Remote compute cluster available 1 KB 1 GB 1 KB 1 GB 1 KB 1 GB Home store Compute cluster

Ideal utilization of remote storage Minimize wide-area traffic by scoping I/O Transfer batch data once and cache Contain pipe data within compute cluster Only endpoint data should traverse wide-area Improve throughput through space mgmt Avoid thrashing due to excessive batch Avoid failure due to excessive pipe Cost-benefit checkpointing and replication Track data generation and replication costs Measure failure rates Use cost-benefit checkpointing algorithm Apply independent policy for each pipeline

Remote I/O Simplest conceptually Requires least amount of remote privilege But... Batch data fetched redundantly Pipe I/O unnecessarily crosses wide-area Wide-area bottleneck quickly saturates

Pre-staging Requires large user burden Needs access to local file sys for each cluster Manually pushes batch data May configures workload to use /tmp Must manually pulls endpoint outputs Good performance through I/O scoping but Tedious, repetitive, mistake-prone Availability of /tmp can’t be guaranteed Scheduler lacks knowledge to checkpoint