Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Chapter 13: Query Processing
EE384y: Packet Switch Architectures
Advanced Piloting Cruise Plot.
1
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Processes and Operating Systems
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
1 Introduction to Transportation Systems. 2 PART I: CONTEXT, CONCEPTS AND CHARACTERIZATI ON.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Chapter 6 File Systems 6.1 Files 6.2 Directories
Video Services over Software-Defined Networks
Distributed and Parallel Processing Technology Chapter2. MapReduce
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin.
Fast Crash Recovery in RAMCloud
CS525: Special Topics in DBs Large-Scale Data Management
Sweet Storage SLOs with Frosting Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, Randy Katz.
Shredder GPU-Accelerated Incremental Storage and Computation
Chapter 10: Virtual Memory
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
© 2012 National Heart Foundation of Australia. Slide 2.
LO: Count up to 100 objects by grouping them and counting in 5s 10s and 2s. Mrs Criddle: Westfield Middle School.
Chapter 5 Test Review Sections 5-1 through 5-4.
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
25 seconds left…...
Subtraction: Adding UP
Equal or Not. Equal or Not
Slippery Slope
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
PSSA Preparation.
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Link-State Routing Protocols Routing Protocols and Concepts – Chapter.
Varys Efficient Coflow Scheduling Mosharaf Chowdhury,
Adaptive Segmentation Based on a Learned Quality Metric
Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.
New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
SDN + Storage.
Tail Latency: Networking
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
A Case for Performance- Centric Network Allocation Gautam Kumar, Mosharaf Chowdhury, Sylvia Ratnasamy, Ion Stoica UC Berkeley.
Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Managing Data Transfer in Computer Clusters with Orchestra
Presentation transcript:

Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Moving Data is Expensive Typical MapReduce jobs in Facebook spend 33% of job running time in large data transfers Application for training a spam classifier on Twitter data spends 40% time in communication 2

Limits Scalability Scalability of Netflix-like recommendation system is bottlenecked by communication 3 Did not scale beyond 60 nodes »Comm. time increased faster than comp. time decreased

Transfer Patterns Transfer: set of all flows transporting data between two stages of a job »Acts as a barrier Completion time: Time for the last receiver to finish Map Shuffle Reduce Broadcast Incast* 4

Contributions 1.Optimize at the level of transfers instead of individual flows 2.Inter-transfer coordination 5

TC (broadcast) HDFS Tree Cornet HDFS Tree Cornet HDFS Tree Cornet HDFS Tree Cornet TC (shuffle) Hadoop shuffle WSS Hadoop shuffle WSS shuffle broadcast 1broadcast 2 ITC Fair sharing FIFO Priority Fair sharing FIFO Priority 6 Shuffle Transfer Controller (TC) Shuffle Transfer Controller (TC) Broadcast Transfer Controller (TC) Broadcast Transfer Controller (TC) Broadcast Transfer Controller (TC) Broadcast Transfer Controller (TC) Inter-Transfer Controller (ITC) Inter-Transfer Controller (ITC) Orchestra

Outline Cooperative broadcast (Cornet) »Infer and utilize topology information Weighted Shuffle Scheduling (WSS) »Assign flow rates to optimize shuffle completion time Inter-Transfer Controller »Implement weighted fair sharing between transfers End-to-end performance 7 TC (broadcast) HDFS Tree Cornet HDFS Tree Cornet HDFS Tree Cornet HDFS Tree Cornet TC (shuffle) Hadoop shuffle WSS Hadoop shuffle WSS ITC Fair sharing FIFO Priority Fair sharing FIFO Priority ITC Shuffle TC Shuffle TC Broadcast TC Broadcast TC Broadcast TC Broadcast TC

Cornet : Cooperative broadcast ObservationsCornet Design Decisions 1.High-bandwidth, low-latency network Large block size (4-16MB) 2.No selfish or malicious peers No need for incentives (e.g., TFT) No (un)choking Everyone stays till the end 3.Topology matters Topology-aware broadcast 8 Broadcast same data to every receiver »Fast, scalable, adaptive to bandwidth, and resilient Peer-to-peer mechanism optimized for cooperative environments

Cornet performance 9 1GB data to 100 receivers on EC2 4.5x to 5x improvement Status quo

Topology-aware Cornet Many data center networks employ tree topologies Each rack should receive exactly one copy of broadcast »Minimize cross-rack communication Topology information reduces cross-rack data transfer »Mixture of spherical Gaussians to infer network topology 10

Topology-aware Cornet 11 ~2x faster than vanilla Cornet 200MB data to 30 receivers on DETER 3 inferred clusters

Status quo in Shuffle 12 r1r1 r1r1 r2r2 r2r2 s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s5s5 s5s5 Links to r 1 and r 2 are full: Link from s 3 is full: Completion time: 3 time units 2 time units 5 time units

Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent 13 Up to 1.5X improvement Completion time: 4 time units Weighted Shuffle Scheduling r1r1 r1r1 r2r2 r2r2 s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s5s5 s5s

Inter-Transfer Controller aka Conductor Weighted fair sharing »Each transfer is assigned a weight »Congested links shared proportionally to transfers weights Implementation: Weighted Flow Assignment (WFA) »Each transfer gets a number of TCP connections proportional to its weight »Requires no changes in the network nor in end host OSes 14

Benefits of the ITC Two priority classes »FIFO within each class Low priority transfer »2GB per reducer High priority transfers »250MB per reducer Shuffle using 30 nodes on EC2 43% reduction in high priority xfers 6% increase of the low priority xfer Without Inter-transfer Scheduling Priority Scheduling in Conductor

End-to-end evaluation Developed in the context of Spark – an iterative, in- memory MapReduce-like framework Evaluated using two iterative applications developed by ML researchers at UC Berkeley »Training spam classifier on Twitter data »Recommendation system for the Netflix challenge 16

Faster spam classification 17 Communication reduced from 42% to 28% of the iteration time Overall 22% reduction in iteration time

Scalable recommendation system Before After x faster at 90 nodes

Related work DCN architectures (VL2, Fat-tree etc.) »Mechanism for faster network, not policy for better sharing Schedulers for data-intensive applications (Hadoop scheduler, Quincy, Mesos etc.) »Schedules CPU, memory, and disk across the cluster Hedera »Transfer-unaware flow scheduling Seawall »Performance isolation among cloud tenants 19

Summary Optimize transfers instead of individual flows »Utilize knowledge about application semantics Coordinate transfers »Orchestra enables policy-based transfer management »Cornet performs up to 4.5x better than the status quo »WSS can outperform default solutions by 1.5x No changes in the network nor in end host OSes 20

BACKUP SLIDES 21

MapReduce logs Weeklong trace of 188,000 MapReduce jobs from a 3000-node cluster Maximum number of concurrent transfers is several hundreds 22 33% time in shuffle on average

Monarch ( Oakland11 ) Real-time spam classification from 345,000 tweets with urls »Logistic Regression »Written in Spark Spends 42% of the iteration time in transfers »30% broadcast »12% shuffle 100 iterations to converge 23

Collaborative Filtering Netflix challenge »Predict users ratings for movies they havent seen based on their ratings for other movies 385MB data broadcasted in each iteration 24 Does not scale beyond 60 nodes

Cornet performance 25 1GB data to 100 receivers on EC2 4.5x to 6.5x improvement

Shuffle bottlenecks An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer 26 At a senderAt a receiverIn the network

Current implementations 27 Shuffle 1GB to 30 reducers on EC2