Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Slides:



Advertisements
Similar presentations
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Advertisements

大规模数据处理 / 云计算 Lecture 4 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 4/24/2011 This work is licensed under a Creative.
Spark: Cluster Computing with Working Sets
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
MapReduce Algorithm Design Data-Intensive Information Processing Applications ― Session #3 Jimmy Lin University of Maryland Tuesday, February 9, 2010 This.
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
大规模数据处理 / 云计算 Lecture 5 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 7/19/2011 This work is licensed under a Creative.
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
大规模数据处理 / 云计算 Lecture 3 – Mapreduce Algorithm Design 闫宏飞 北京大学信息科学技术学院 7/16/2013 This work is licensed under a Creative.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Big Data Infrastructure Week 3: From MapReduce to Spark (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Big Data Infrastructure Week 4: Analyzing Text (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Big Data Infrastructure
PySpark Tutorial - Learn to use Apache Spark with Python
Python Spark Intro for Data Science
Image taken from: slideshare
Big Data is a Big Deal!.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Spark.
MapReduce Types, Formats and Features
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Data-Intensive Distributed Computing
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
COS 418: Distributed Systems Lecture 1 Mike Freedman
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to Spark.
CS639: Data Management for Data Science
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo January 21, 2016 These slides are available at

Source: Google The datacenter is the computer! What’s the instruction set?

map f: (K1, V1) ⇒ List[(K2, V2)] List[(K1,V1)] List[K2,V2]) MapReduce reduce g: (K2, Iterable[V2]) ⇒ List[(K3, V3)]

RDD[T] RDD[U] filter f: (T) ⇒ Boolean map f: (T) ⇒ U RDD[T] RDD[U] flatMap f: (T) ⇒ TraversableOnce[U] RDD[T] RDD[U] mapPartitions f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U] RDD[(K, V)] RDD[(K, Iterable[V])] groupByKey reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] RDD[(K, V)] sort join RDD[(K, V)] RDD[(K, (V, W))] RDD[(K, W)] RDD[(K, V)] RDD[(K, (Iterable[V], Iterable[W]))] cogroup RDD[(K, W)] Spark And more!

What’s an RDD? Resilient Distributed Dataset (RDD) = partitioned= immutable Wait, so how do you actually do anything? Developers define transformations on RDDs Framework keeps track of lineage

Spark Word Count val textFile = sc.textFile(args.input()) textFile.flatMap(line => tokenize(line)).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args.output())

RDD Lifecycle RDD Transformation Action Transformations are lazy: Framework keeps track of lineage Actions trigger actual execution values

Spark Word Count val textFile = sc.textFile(args.input()) val a = textFile.flatMap(line => line.split(" ")) val b = a.map(word => (word, 1)) val c = b.reduceByKey(_ + _) c.saveAsTextFile(args.output()) RDDs Transformations Action

RDDs and Lineage textFile: RDD[String] On HDFS a: RDD[String].flatMap(line => line.split(" ")) Action! b: RDD[(String, Int)].map(word => (word, 1)) c: RDD[(String, Int)].reduceByKey(_ + _) Remember, transformations are lazy!

RDDs and Optimizations textFile: RDD[String] a: RDD[String] b: RDD[(String, Int)] c: RDD[(String, Int)] On HDFS.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) Action! Want MM? RDDs don’t need to be materialized! Lazy evaluation creates optimization opportunities

RDDs and Caching RDDs can be materialized in memory! textFile: RDD[String] a: RDD[String] b: RDD[(String, Int)] c: RDD[(String, Int)] On HDFS.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) Action! Cache it! Fault tolerance? ✗ Spark works even if the RDDs are partially cached!

Spark Architecture

datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node namenode namenode daemon job submission node jobtracker Compare

YARN Hadoop’s (original) limitations: Can only run MapReduce What if we want to run other distributed frameworks? YARN = Yet-Another-Resource-Negotiator Provides API to develop any generic distribution application Handles scheduling and resource request MapReduce (MR2) is one such application in YARN

YARN

Spark Programs Your application (driver program) SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS Spark context: tells the framework where to find the cluster Use the Spark context to create RDDs spark-shellspark-submit Scala, Java, Python, R

Spark Driver val textFile = sc.textFile(args.input()) textFile.flatMap(line => tokenize(line)).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args.output()) Your application (driver program) SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS spark-shellspark-submit What’s happening to the functions?

Spark Driver Note: you can run code “locally”, integrate cluster- computed values! val textFile = sc.textFile(args.input()) textFile.flatMap(line => tokenize(line)).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args.output()) Beware of the collect action! Your application (driver program) SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS spark-shellspark-submit

RDD[T] RDD[U] filter f: (T) ⇒ Boolean map f: (T) ⇒ U RDD[T] RDD[U] flatMap f: (T) ⇒ TraversableOnce[U] RDD[T] RDD[U] mapPartitions f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U] RDD[(K, V)] RDD[(K, Iterable[V])] groupByKey reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] RDD[(K, V)] sort join RDD[(K, V)] RDD[(K, (V, W))] RDD[(K, W)] RDD[(K, V)] RDD[(K, (Iterable[V], Iterable[W]))] cogroup RDD[(K, W)] Spark Transformations

InputSplit Source: redrawn from a slide by Cloduera, cc-licensed InputSplit Input File InputSplit RecordReader “mapper” InputFormat Starting Points

Physical Operators

Execution Plan

Can’t avoid this! … …

Spark Shuffle Implementations Hash shuffle Source:

Spark Shuffle Implementations Sort shuffle Source:

Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Remember this? So, in MapReduce, why are key-value pairs processed in sorted order in the reducer?

Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Remember this? Where are the combiners in Spark?

Reduce-like Operations reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] … … What happened to combiners? How can we optimize?

Spark #wins Richer operators RDD abstraction supports optimizations (pipelining, caching, etc.) Scala, Java, Python, R, bindings

Source: Wikipedia (Mahout) Algorithm design, redux

Two superpowers: Associativity Commutativity (sorting) What follows… very basic category theory…

v 1 ⊕ v 2 ⊕ v 3 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 8 ⊕ v 9 The Power of Associativity v 1 ⊕ v 2 ⊕ v 3 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 8 ⊕ v 9 You can put parenthesis where ever you want! Credit to Oscar Boykin for the idea behind these slides

v 1 ⊕ v 2 ⊕ v 3 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 8 ⊕ v 9 The Power of Commutativity v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 1 ⊕ v 2 ⊕ v 3 ⊕ v 8 ⊕ v 9 v 8 ⊕ v 9 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 1 ⊕ v 2 ⊕ v 3 You can swap order of operands however you want!

Implications for distributed processing? You don’t know when the tasks begin You don’t know when the tasks end You don’t know when the tasks interrupt each other You don’t know when intermediate data arrive … It’s okay!

Semigroup = ( M,  )  : M × M → M, s.t., ∀ m 1, m 2, m 3 ∋ M (m 1  m 2 )  m 3 = m 1  (m 2  m 3 ) Monoid = Semigroup + identity Commutative Monoid = Monoid + commutativity  s.t.,   m = m   = m, ∀ m ∋ M ∀ m 1, m 2 ∋ M, m 1  m 2 = m 2  m 1 Fancy Labels for Simple Concepts… A few examples?

Back to these… reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] Wait, I’ve seen this before?

Computing the mean v1 (again)

Computing the mean v2 (again)

Computing the mean v3 (again) Wait, I’ve seen this before? reduceByKey f: (V, V) ⇒ V RDD[(K, V)]

Co-occurrence Matrix: Stripes Wait, I’ve seen this before? reduceByKey f: (V, V) ⇒ V RDD[(K, V)]

Synchronization: Pairs vs. Stripes Approach 1: turn synchronization into an ordering problem Sort keys into correct order of computation Partition key space so that each reducer gets the appropriate set of partial results Hold state in reducer across multiple key-value pairs to perform computation Illustrated by the “pairs” approach Approach 2: construct data structures that bring partial results together Each reducer receives all the data it needs to complete the computation Illustrated by the “stripes” approach Commutative monoids! What about this?

f(B|A): “Pairs” For this to work: Must emit extra (a, *) for every b n in mapper Must make sure all a’s get sent to same reducer (use partitioner) Must make sure (a, *) comes first (define sort order) Must hold state in reducer across different key-value pairs (a, b 1 ) → 3 (a, b 2 ) → 12 (a, b 3 ) → 7 (a, b 4 ) → 1 … (a, *) → 32 (a, b 1 ) → 3 / 32 (a, b 2 ) → 12 / 32 (a, b 3 ) → 7 / 32 (a, b 4 ) → 1 / 32 … Reducer holds this value in memory

Two superpowers: Associativity Commutativity (sorting)

Because you can’t avoid this… … … And sort-based shuffling is pretty efficient!

Source: Google The datacenter is the computer! What’s the instruction set?

Exploit associativity and commutativity via commutative monoids (if you can) Algorithm design in a nutshell… Source: Wikipedia (Walnut) Exploit framework-based sorting to sequence computations (if you can’t)

Source: Wikipedia (Japanese rock garden) Questions? Remember: Assignment 2 due next Tuesday at 8:30am