Download presentation
Presentation is loading. Please wait.
Published byJoleen Miller Modified over 9 years ago
1
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo January 21, 2016 These slides are available at http://lintool.github.io/bigdata-2016w/
2
Source: Google The datacenter is the computer! What’s the instruction set?
3
map f: (K1, V1) ⇒ List[(K2, V2)] List[(K1,V1)] List[K2,V2]) MapReduce reduce g: (K2, Iterable[V2]) ⇒ List[(K3, V3)]
4
RDD[T] RDD[U] filter f: (T) ⇒ Boolean map f: (T) ⇒ U RDD[T] RDD[U] flatMap f: (T) ⇒ TraversableOnce[U] RDD[T] RDD[U] mapPartitions f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U] RDD[(K, V)] RDD[(K, Iterable[V])] groupByKey reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] RDD[(K, V)] sort join RDD[(K, V)] RDD[(K, (V, W))] RDD[(K, W)] RDD[(K, V)] RDD[(K, (Iterable[V], Iterable[W]))] cogroup RDD[(K, W)] Spark And more!
5
What’s an RDD? Resilient Distributed Dataset (RDD) = partitioned= immutable Wait, so how do you actually do anything? Developers define transformations on RDDs Framework keeps track of lineage
6
Spark Word Count val textFile = sc.textFile(args.input()) textFile.flatMap(line => tokenize(line)).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args.output())
7
RDD Lifecycle RDD Transformation Action Transformations are lazy: Framework keeps track of lineage Actions trigger actual execution values
8
Spark Word Count val textFile = sc.textFile(args.input()) val a = textFile.flatMap(line => line.split(" ")) val b = a.map(word => (word, 1)) val c = b.reduceByKey(_ + _) c.saveAsTextFile(args.output()) RDDs Transformations Action
9
RDDs and Lineage textFile: RDD[String] On HDFS a: RDD[String].flatMap(line => line.split(" ")) Action! b: RDD[(String, Int)].map(word => (word, 1)) c: RDD[(String, Int)].reduceByKey(_ + _) Remember, transformations are lazy!
10
RDDs and Optimizations textFile: RDD[String] a: RDD[String] b: RDD[(String, Int)] c: RDD[(String, Int)] On HDFS.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) Action! Want MM? RDDs don’t need to be materialized! Lazy evaluation creates optimization opportunities
11
RDDs and Caching RDDs can be materialized in memory! textFile: RDD[String] a: RDD[String] b: RDD[(String, Int)] c: RDD[(String, Int)] On HDFS.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) Action! Cache it! Fault tolerance? ✗ Spark works even if the RDDs are partially cached!
12
Spark Architecture
13
datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node namenode namenode daemon job submission node jobtracker Compare
14
YARN Hadoop’s (original) limitations: Can only run MapReduce What if we want to run other distributed frameworks? YARN = Yet-Another-Resource-Negotiator Provides API to develop any generic distribution application Handles scheduling and resource request MapReduce (MR2) is one such application in YARN
15
YARN
16
Spark Programs Your application (driver program) SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS Spark context: tells the framework where to find the cluster Use the Spark context to create RDDs spark-shellspark-submit Scala, Java, Python, R
17
Spark Driver val textFile = sc.textFile(args.input()) textFile.flatMap(line => tokenize(line)).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args.output()) Your application (driver program) SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS spark-shellspark-submit What’s happening to the functions?
18
Spark Driver Note: you can run code “locally”, integrate cluster- computed values! val textFile = sc.textFile(args.input()) textFile.flatMap(line => tokenize(line)).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args.output()) Beware of the collect action! Your application (driver program) SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS spark-shellspark-submit
19
RDD[T] RDD[U] filter f: (T) ⇒ Boolean map f: (T) ⇒ U RDD[T] RDD[U] flatMap f: (T) ⇒ TraversableOnce[U] RDD[T] RDD[U] mapPartitions f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U] RDD[(K, V)] RDD[(K, Iterable[V])] groupByKey reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] RDD[(K, V)] sort join RDD[(K, V)] RDD[(K, (V, W))] RDD[(K, W)] RDD[(K, V)] RDD[(K, (Iterable[V], Iterable[W]))] cogroup RDD[(K, W)] Spark Transformations
20
InputSplit Source: redrawn from a slide by Cloduera, cc-licensed InputSplit Input File InputSplit RecordReader “mapper” InputFormat Starting Points
21
Physical Operators
22
Execution Plan
23
Can’t avoid this! … …
24
Spark Shuffle Implementations Hash shuffle Source: http://0x0fff.com/spark-architecture-shuffle/
25
Spark Shuffle Implementations Sort shuffle Source: http://0x0fff.com/spark-architecture-shuffle/
26
Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Remember this? So, in MapReduce, why are key-value pairs processed in sorted order in the reducer?
27
Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Remember this? Where are the combiners in Spark?
28
Reduce-like Operations reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] … … What happened to combiners? How can we optimize?
29
Spark #wins Richer operators RDD abstraction supports optimizations (pipelining, caching, etc.) Scala, Java, Python, R, bindings
30
Source: Wikipedia (Mahout) Algorithm design, redux
31
Two superpowers: Associativity Commutativity (sorting) What follows… very basic category theory…
32
v 1 ⊕ v 2 ⊕ v 3 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 8 ⊕ v 9 The Power of Associativity v 1 ⊕ v 2 ⊕ v 3 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 8 ⊕ v 9 You can put parenthesis where ever you want! Credit to Oscar Boykin for the idea behind these slides
33
v 1 ⊕ v 2 ⊕ v 3 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 8 ⊕ v 9 The Power of Commutativity v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 1 ⊕ v 2 ⊕ v 3 ⊕ v 8 ⊕ v 9 v 8 ⊕ v 9 ⊕ v 4 ⊕ v 5 ⊕ v 6 ⊕ v 7 ⊕ v 1 ⊕ v 2 ⊕ v 3 You can swap order of operands however you want!
34
Implications for distributed processing? You don’t know when the tasks begin You don’t know when the tasks end You don’t know when the tasks interrupt each other You don’t know when intermediate data arrive … It’s okay!
35
Semigroup = ( M, ) : M × M → M, s.t., ∀ m 1, m 2, m 3 ∋ M (m 1 m 2 ) m 3 = m 1 (m 2 m 3 ) Monoid = Semigroup + identity Commutative Monoid = Monoid + commutativity s.t., m = m = m, ∀ m ∋ M ∀ m 1, m 2 ∋ M, m 1 m 2 = m 2 m 1 Fancy Labels for Simple Concepts… A few examples?
36
Back to these… reduceByKey f: (V, V) ⇒ V RDD[(K, V)] aggregateByKey seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)] Wait, I’ve seen this before?
37
Computing the mean v1 (again)
38
Computing the mean v2 (again)
39
Computing the mean v3 (again) Wait, I’ve seen this before? reduceByKey f: (V, V) ⇒ V RDD[(K, V)]
40
Co-occurrence Matrix: Stripes Wait, I’ve seen this before? reduceByKey f: (V, V) ⇒ V RDD[(K, V)]
41
Synchronization: Pairs vs. Stripes Approach 1: turn synchronization into an ordering problem Sort keys into correct order of computation Partition key space so that each reducer gets the appropriate set of partial results Hold state in reducer across multiple key-value pairs to perform computation Illustrated by the “pairs” approach Approach 2: construct data structures that bring partial results together Each reducer receives all the data it needs to complete the computation Illustrated by the “stripes” approach Commutative monoids! What about this?
42
f(B|A): “Pairs” For this to work: Must emit extra (a, *) for every b n in mapper Must make sure all a’s get sent to same reducer (use partitioner) Must make sure (a, *) comes first (define sort order) Must hold state in reducer across different key-value pairs (a, b 1 ) → 3 (a, b 2 ) → 12 (a, b 3 ) → 7 (a, b 4 ) → 1 … (a, *) → 32 (a, b 1 ) → 3 / 32 (a, b 2 ) → 12 / 32 (a, b 3 ) → 7 / 32 (a, b 4 ) → 1 / 32 … Reducer holds this value in memory
43
Two superpowers: Associativity Commutativity (sorting)
44
Because you can’t avoid this… … … And sort-based shuffling is pretty efficient!
45
Source: Google The datacenter is the computer! What’s the instruction set?
46
Exploit associativity and commutativity via commutative monoids (if you can) Algorithm design in a nutshell… Source: Wikipedia (Walnut) Exploit framework-based sorting to sequence computations (if you can’t)
47
Source: Wikipedia (Japanese rock garden) Questions? Remember: Assignment 2 due next Tuesday at 8:30am
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.