Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

Slides:

Advertisements

Similar presentations

Categories of I/O Devices

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.

Making Sense of Spark Performance

Distributed Computations

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Measuring Performance Based on slides by Henri Casanova.

Canadian Bioinformatics Workshops

Introduction CSE 410, Spring 2005 Computer Systems

Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.

Big Data is a Big Deal!.

MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

Seth Pugsley, Jeffrey Jestes,

Diskpool and cloud storage benchmarks used in IT-DSS

Hitting the SQL Server “Go Faster” Button

CSE 410, Spring 2006 Computer Systems

Spark Presentation.

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

How will execution time grow with SIZE?

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

CSE451 I/O Systems and the Full I/O Path Autumn 2002

Swapping Segmented paging allows us to have non-contiguous allocations

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage

MapReduce Simplied Data Processing on Large Clusters

Hitting the SQL Server “Go Faster” Button

April 30th – Scheduling / parallel

Department of Computer Science University of California, Santa Barbara

Cse 344 May 2nd – Map/reduce.

Cse 344 May 4th – Map/Reduce.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

CS110: Discussion about Spark

Operating Systems Chapter 5: Input/Output Management

CS703 - Advanced Operating Systems

CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.

Distributed System Gang Wu Spring，2018.

Mark Zbikowski and Gary Kimura

Data processing with Hadoop

CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.

Declarative Transfer Learning from Deep CNNs at Scale

Introduction to MapReduce

Why Threads Are A Bad Idea (for most purposes)

Year 10 Computer Science Hardware - CPU and RAM.

Department of Computer Science University of California, Santa Barbara

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

MapReduce: Simplified Data Processing on Large Clusters

Function of Operating Systems

Presentation transcript:

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker Monotasks Architecting for Performance Clarity in Data Analytics Frameworks Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

How can I make this faster? Starting abstractly: have a user, she’d running a workload on some system and getting some result. Pretty quickly asks: how can I make this faster?

How can I make this faster? She has lots of choices for how to do so. Lots of knobs to turn! To give a few examples…

Should I use a different cloud instance type? Change instance type! Can have significant impact (Shivaram’s work: 1.9x)

Should I trade more CPU for less I/O by using better compression? Change software config. How do I know what effect of different tradeoff will be?

How can I make this faster? ??? User has no way to know how to turn these knobs – yet could get big performance improvements from doing so!

How can I make this faster? ??? What’s more, this is getting worse: with each new optimization, there are more knobs, and the systems are more complicated, so it’s harder to know how to turn them

How can I make this faster? ???

Architect for performance clarity Make it easy to reason about performance We should architect for performance clarity. What do I mean by clarity? Much more than just “what’s the bottleneck”. Want to know how to change the knobs; this requires knowing how system performs under different conditions. No measure of utilization will tell me what happens if network is twice as fast. This goes way beyond measurement. Not aware of any system that was designed to provide this kind of clarity. Enabling the user to turn the existing N knobs is much more useful than adding an N+1th one!

For data analytics frameworks: Is it possible to architect for performance clarity? Does doing so require sacrificing performance? Intuitively: making performance easier to reason about might mean making system simpler, undoing performance improvements, and making it slower as a result

Key idea: use single-resource monotasks Reasoning about performance Today: why it’s hard Monotasks: why it’s easy Does using monotasks hurt performance? Using monotasks to predict job runtime

Split input file into words and emit count of 1 for each Example Spark Job Word Count: spark.textFile(“hdfs://…”) \ .flatMap(lambda l: l.split(“ “)) \ .map(lambda w: (w, 1)) \ .reduceByKey(lambda a, b: a + b) \ .saveAsTextFile(“hdfs://…”) Split input file into words and emit count of 1 for each User writes a high level description of job. Two parts, first…

Example Spark Job Word Count: spark.textFile(“hdfs://…”) \ .flatMap(lambda l: l.split(“ “)) \ .map(lambda w: (w, 1)) \ .reduceByKey(lambda a, b: a + b) \ .saveAsTextFile(“hdfs://…”) Split input file into words and emit count of 1 for each For each word, combine the counts, and save the output Second part

Spark Word Count Job: … … spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Map Stage: Split input file into words and emit count of 1 for each Reduce Stage: For each word, combine the counts, and save the output … Worker 1 Worker n … Worker 1 Worker n Tasks The framework’s job is to handle the details of executing the job: break it into tasks that can be parallelized on a cluster, handle data transfer, etc. Aside: this is how Spark works, but lots of other frameworks too!

Reduce Stage: For each word, combine the counts, and save the output Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Reduce Stage: For each word, combine the counts, and save the output … Worker 1 Worker n Focus on just reduce stage

Reduce Stage: For each word, combine the counts, and save the output Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Reduce task: Reduce Stage: For each word, combine the counts, and save the output Network … Worker 1 Worker n CPU (aggregate) Disk write Tasks use pipelining to parallelize the use of many resources

Challenge: Tasks pipeline multiple resources, Reduce task Network read CPU (filter) Disk write Task only using disk Task only using the network Task only using the CPU Challenge: Tasks pipeline multiple resources, resource use changes at fine time granularity Fine grained pipelining makes it hard to reason about performance, because the task is doing different things at different times, and this can change at the granularity of the pipelining. This is how things have always been built: fine-grained pipelining seen as fundamental to performance! But it’s also a fundamental challenge to reasoning about performance. This results in a few issues…

4 concurrent tasks on a worker First, tasks running concurrently! time

Concurrent tasks may contend for the same resource (e.g., network) Task 7 Task 3 Task 4 Task 8 A single scheduler is responsible for scheduling these multi-resource tasks! Can’t avoid contention that occurs at fine time granularity time

Challenge: Resource use controlled by operating system Task 18 Task 19 Reduce task Network read CPU (filter) Disk write Disk write controlled by OS buffer cache Alternate Disk write (data in buffer cache) Disk write controlled by OS! OS might reasonably hold all data in the cache, and actually write it to disk much later, when it will contend with other tasks running on the machine.

Time t: different tasks may be bottlenecked on different resources What’s the bottleneck? Single task may be bottlenecked on different resources at different times Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 Time t: different tasks may be bottlenecked on different resources Now given what I’ve said so far, say you wanted to answer a simple question: what’s the bottleneck? This starts to look pretty hairy pretty quickly. Can *you* answer this from looking at this picture? If you look at one slice in time, different tasks may be bottlenecked on different things! You could think about this for one task: but the task has different bottlenecks at different times!

How much faster would my job be with 2x disk throughput? Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 How much faster would my job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources? What if you wanted to ask a more complicated question? Emphasize: no amount of additional logging is going to make it easy to answer this question today.

Fundamental challenge: tasks have non-uniform resource use Concurrent tasks on a machine may contend Resource use is controlled outside of the framework No model for performance Contention exacerbated by the fact that resource use controlled outside of framework

Reasoning about performance Today: why it’s hard Monotasks: why it’s easy Does using monotasks hurt performance? Using monotasks to predict job runtime

Today: tasks use pipelining to parallelize multiple resources Proposal: build systems using monotasks that each consume just one resource

Today: Tasks have non-uniform resource use Concurrent tasks may contend Resource use outside of framework No model for performance

Monotasks: Each task uses one resource Today: Network read Today’s task: Tasks have non-uniform resource use CPU Disk write Task 1 Concurrent tasks may contend Monotasks: Each task uses one resource Network monotask Compute monotask Resource use outside of framework Disk monotask Monotasks don’t start until all dependencies complete No model for performance Dependencies: key to uniform resource use! Don’t want tasks to block.

Today: Monotasks: Dedicated schedulers control contention Tasks have non-uniform resource use Monotasks for one of today’s tasks: Each task uses one resource Network scheduler Concurrent tasks may contend CPU scheduler: 1 monotask / core Resource use outside of framework Disk drive scheduler: 1 monotask / disk No model for performance

Per-resource schedulers have complete control Today: Monotasks: Tasks have non-uniform resource use Each task uses one resource Network scheduler Concurrent tasks may contend Dedicated schedulers control contention CPU scheduler: 1 monotask / core Resource use outside of framework writes buffered Disk drive scheduler: 1 monotask / disk All writes flushed to disk No model for performance

Monotask times can be used to model performance Today: Monotasks: Monotask times can be used to model performance Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Concurrent tasks may contend Dedicated schedulers control contention Resource use outside of framework Per-resource schedulers have complete control No model for performance

Monotask times can be used to model performance Today: Monotasks: Monotask times can be used to model performance Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Modeled job runtime: max of ideal times Concurrent tasks may contend Dedicated schedulers control contention Resource use outside of framework Per-resource schedulers have complete control Ideal network runtime Ideal disk runtime No model for performance

How much faster would the job be with 2x disk throughput? Today: Monotasks: How much faster would the job be with 2x disk throughput? Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Concurrent tasks may contend Dedicated schedulers control contention Ideal network runtime Resource use outside of framework Per-resource schedulers have complete control Ideal disk runtime No model for performance

How much faster would the job be with 2x disk throughput? Today: Monotasks: How much faster would the job be with 2x disk throughput? Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Concurrent tasks may contend Dedicated schedulers control contention Modeled new job runtime Ideal network runtime Resource use outside of framework Per-resource schedulers have complete control Ideal disk runtime (2x disk concurrency) Monotask times can be used to model performance No model for performance Emphasize: model simple (and not perfectly accurate!). But as I’ll show later, sufficiently accurate to answer what-if questions.

How does this decomposition work? Today: Monotasks: 4 multi-resource tasks run concurrently Tasks have non-uniform resource use Each task uses one resource Concurrent tasks may contend Dedicated schedulers control contention How does this decomposition work? Resource use outside of framework Monotasks scheduled by per-resource schedulers Per-resource schedulers have complete control Monotask times can be used to model performance No model for performance Note that simple performance questions (e.g., bottleneck, time on each resource) are trivial

How does this decomposition work? .reduceByKey(lambda a, b: a + b).saveAsTextFile(“hdfs://…”) Network monotasks: request remote data Today’s reduce task: Network CPU Disk write Disk monotask: write output … CPU monotask: deserialize, combine counts, serialize Framework can do this under the hood! Key insight: possible for the same reasons Spark is widely-used: high-level abstraction API-compatible with Spark, implemented at application layer

Reasoning about performance Today: why it’s hard Monotasks: why it’s easy Does using monotasks hurt performance? Using monotasks to predict job runtime

Does using monotasks hurt performance? 3 benchmark workloads: Big data benchmark (10 queries run with scale factor 5) Sort (600GB sorted using 20 machines) Block coordinate descent (ML workload, 16 machines) For all workloads, runtime comparable to Spark At most 9% slower, sometimes faster

How much faster would jobs run if… Each machine had 2 disks instead of 1? Sort 600GB of key-value pairs on 20 machines Takeaway: monotask model correctly predicts new runtimes. Note that this is non-trivial (e.g., no runtimes increased by a factor of 2 because something else became the bottlencek), and monotasks model gets it correct in both cases. Predictions within 9% of the actual runtime

How much faster would job run if... 4x more machines Input stored in-memory No disk read No CPU time to deserialize Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted with at most 23% error

Leveraging Performance Clarity to Automatically Improve Performance Schedulers have complete visibility over resource use Network scheduler CPU scheduler: 1 monotask / core Can configure for best performance Disk drive scheduler: 1 monotask / disk

Leveraging Performance Clarity to Automatically Improve Performance Monotask schedulers automatically select ideal concurrency Monotasks allow you to get performance improvements that it would be hard to get any other way

Why do we care about performance clarity? By using single-resource monotasks, system can provide performance clarity without sacrificing performance Why do we care about performance clarity? Typical performance eval: group of experts Practical performance: 1 novice Insight from this talk: it *is* possible to architect for performance clarity, can do so with single-resource tasks Why do we care? We need to build systems that perform well for the average user

Reflecting on Monotasks Painful to re-architect existing system to use monotasks Pipelining deeply integrated (>20K lines of code changed) Implemented at high layer of software stack Should clarity be provided by the operating system?

Goal: provide performance clarity Only way to improve performance is to know what to speed up Using single-resource monotasks provides clarity without sacrificing performance With monotasks, easier to improve system performance only way to know how to make things faster is to know what you need to speed up! Insight from this talk: it *is* possible to architect for performance clarity, can do so with single-resource tasks With this one architectural change, can make lots of things easy – both for user and via automated tools. github.com/NetSys/spark-monotasks