University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota http://www.cs.umn.edu/~chandra 1

University of Minnesota Talk Outline Big Data and MapReduce MapReduce Background MapReduce in Non-Traditional Environments Concluding Remarks 2

University of Minnesota Big Data  Data-rich enterprises and communities  Both user-facing services and batch data processing  Commercial, social, scientific  E.g.: Google, Facebook, Yahoo, LHC,...  Data analysis is key!  Need massive scalability and parallelism  PB’s of data, millions of files, 1000’s of nodes, millions of users  Need to do this cost effectively and reliably  Use commodity hardware where failure is the norm  Share resources among multiple projects

University of Minnesota Big Data and MapReduce Simple data-parallel programming model and framework Designed for scalability and fault-tolerance Can express several data analysis algorithms Widely used Pioneered by Google: Processes several petabytes of data per day Popularized by open-source Hadoop project: Used at Yahoo!, Facebook, Amazon, …

University of Minnesota MapReduce Design Goals  Scalability 1000’s of machines, 10,000’s of disks TBs-PBs of data  Cost-efficiency Hardware: Commodity machines and network Administration: Automatic fault-tolerance, easy set up Programming: Easy to use and write applications Image Source: http://www.ibm.com

University of Minnesota MapReduce Applications (Industry) Google: Index construction for Google Search Article clustering for Google News Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail Facebook: Ad optimization Spam detection...

University of Minnesota MapReduce Applications (Research) Wide interest in academia/research: High Energy Physics (Indiana) Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)...

University of Minnesota MapReduce Computation Input Data Data Push Output Data MapReduce

University of Minnesota MapReduce Programming Model Data: Sequence of key-value records Map function: converts input key-value pairs to intermediate key-value pairs (K in, V in )  list(K inter, V inter ) Reduce function: converts intermediate key- value pairs to output key-value pairs (K inter, list(V inter ))  list(K out, V out )

University of Minnesota Example: Word Count def mapper(file, text): foreach word in text.split(): output(word, 1) def reducer(word, list(count)): output(word, sum(count))

University of Minnesota Word Count Example InputMapShuffle & SortReduceOutput the quick brown fox the fox ate the mouse how now brown cow Map Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the,1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1

University of Minnesota MapReduce Workflow Input Data Data Push Output Data MapReduce

University of Minnesota MapReduce Stages Push: Input split into large chunks and placed on local disks of cluster nodes Map: Chunks are served to “mapper” tasks Prefer mapper that has data locally Mappers save outputs to local disk before serving them to reducers Reduce: “Reducers” execute reduce tasks when map phase complete

University of Minnesota Partitioning/Shuffling Goal: Divide intermediate key space across reducers k reduce tasks => k partitions (simple hash fn) E.g.: k=3, keys: {1,…6} => partitions: {1,2}, {3,4}, {5,6} Shuffle: Send intermediate key-values to the relevant reducers All-to-all communication: since all mappers typically have all intermediate keys Combine: Local aggregation function for repeated keys produced by same map

University of Minnesota Fault Tolerance Task re-execution: Retry task(s) on another node On task or node failure OK for a map because it has no dependencies OK for reduce because map outputs are on disk Speculative execution: Launch copy of task on another node To handle stragglers (slow tasks) Use result from first task to finish 16

University of Minnesota Hadoop Open-source Apache project Software framework for distributed data processing Primary project: MapReduce implementation Other projects on top of MapReduce Implemented in Java Primary data analysis platform at Yahoo! 40,000+ machines running Hadoop

University of Minnesota Hadoop: Primary Components HDFS: Distributed File System Combines cluster’s local storage into a single namespace All data is replicated to multiple machines Provides locality information to clients MapReduce: Batch computation framework Tasks re-executed on failure Optimizes for data locality of input

University of Minnesota Traditional MapReduce Environments 20 Assumptions: Tightly-coupled clusters Dedicated compute nodes Data is centrally available/pre-placed Image Source: http://www.ibm.com

University of Minnesota But… Data may be distributed Data originates in geographically distributed manner Scientific instruments, sensors. E.g.: oceanic, atmospheric data Public/social data. E.g.: User blogs, traffic data 21 Commercial data. E.g.: Warehouse, ecommerce data Monitoring data. E.g.: CDN user access logs Mobile data. E.g.: phone pics, sensors May want to combine multiple data sources E.g.: CDC+Google Maps

University of Minnesota Computation may be distributed Distributed data centers/clouds E.g.: Amazon EC2 regions, Akamai CDN servers Computational Grids E.g.: FutureGrid Volunteer computing platforms E.g.: BOINC 22

University of Minnesota 23 Question: How to execute MapReduce in such non- traditional environments? Highly-Distributed Environments

University of Minnesota Research Overview Step 1: Understanding tradeoffs Compare different deployment architectures for MapReduce execution Step 2: Optimizing MapReduce execution Data placement/task scheduling based on system and application characteristics 24

University of Minnesota Step 1: Understanding Tradeoffs 25 Input Data Data Push Output Data MapReduce Goal: Understand what deployment architectures would work best Input Data

University of Minnesota 26 Architecture 1: Local MapReduce Data Source (US) Data Source (EU) Data Center (US)Data Center (EU) MapReduce Job Final Result Data Push (Fast) Data Push (Slow)

University of Minnesota 27 Architecture 2: Global MapReduce Data Source (US) Data Source (EU) Data Center (US)Data Center (EU) MapReduce Job Final Result Data Push (Fast) Data Push (Slow) Data Push (Fast) Data Push (Slow)

University of Minnesota 28 Architecture 3: Distributed MapReduce Data Source (US) Data Source (EU) Data Center (US)Data Center (EU) MapReduce Job Final Result Data Push (Fast) MapReduce Job Combine Results Data Push (Fast)

University of Minnesota Experimental Results: PlanetLab 29 Performance depends on network, application characteristics WordCount (Random)WordCount (Text) Result Combine cost dominant Data Push cost dominant PlanetLab: 4/1 US, 4/1 EU compute/data nodes, Hadoop 0.20.1

University of Minnesota Experimental Results: Amazon EC2 30 WordCount (Random)WordCount (Text) Amazon EC2: 6 US, 3 EU small instances, 1 data node each Performance depends on network, application characteristics

University of Minnesota Lessons Learnt Make MapReduce topology-aware Data placement and task scheduling should consider network locality Application-specific data aggregation critical High aggregation => Avoid initial data push cost Low aggregation => Avoid shuffle cost Make globally optimal decisions “Good” local decisions can adversely impact E2E performance 31

University of Minnesota Step 2: Optimizing MapReduce Execution Framework for modeling MapReduce execution Optimizer to determine an optimal execution plan (data placement and task scheduling) Topology-aware: Uses information about network and node characteristics Application-aware: Uses data aggregation characteristics Global optimization: Performs end-to-end, multi- phase optimization Implemented in Hadoop 1.0.1

University of Minnesota MapReduce Execution Model

University of Minnesota MapReduce Execution Model Parameters D i – Size of data supplied at each data source B ij – Link bandwidth from node i to node j C i – Mapper/Reducer compute rates α – Ratio of size of intermediate data to input data Execution Plan Each source: where to push data All mappers: where to shuffle data

University of Minnesota MapReduce Execution Model: Constraints x ij – fraction of node i’’s data pushed/shuffled to node j Each data source (mapper) must push (shuffle) all of its data One-reducer-per-key: y k denotes fraction reduced at reducer k

University of Minnesota MapReduce Execution Optimization Objective: Minimize Makespan subject to Model constraints Use model parameters to compute execution time: Push/shuffle time based on link bandwidths, size of data communicated over each link Map/reduce time based on compute rates, size of data computed at each node

University of Minnesota Benefit of Optimization PlanetLab measurements: 4 US, 2 Europe, 2 Asia nodes; 1 data source each Uniform Myopic Optimized α=0.1 (Data Aggregation) α=10 (Data Expansion) Model-driven optimization achieves minimum makespan under different scenarios

University of Minnesota Comparison to Hadoop Emulated PlanetLab, Hadoop 1.0.1 (Modified for model-based execution plans)

University of Minnesota Concluding Remarks MapReduce: Large-scale distributed data processing Scalable: large no. of machines and data Cheap: lower hardware, programming, admin costs Well-suited for several data analysis applications Rich area for research Resource management, algorithms, programming models Our focus: Optimization in highly-distributed environments Acknowledgments: Students, esp. Ben Heintz Jon Weissman (UMN), Ramesh Sitaraman (UMASS) 39

University of Minnesota Thank You! http://www.cs.umn.edu/~chandra 40

University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback