Presentation is loading. Please wait.

Presentation is loading. Please wait.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Similar presentations


Presentation on theme: "Other Map-Reduce (ish) Frameworks: Spark William Cohen 1."— Presentation transcript:

1 Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

2 Recap: Last month More concise languages for map-reduce pipelines Abstractions built on top of map-reduce – General comments – Specific systems Cascading, Pipes PIG, Hive Spark, Flink 2

3 Recap: Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 3

4 Spark Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 4 Set of concise dataflow operations (“transformation”) Dataflow operations are embedded in an API together with “actions” Set of concise dataflow operations (“transformation”) Dataflow operations are embedded in an API together with “actions” Sharded files are replaced by “RDDs” – resiliant distributed datasets RDDs can be cached in cluster memory and recreated to recover from error Sharded files are replaced by “RDDs” – resiliant distributed datasets RDDs can be cached in cluster memory and recreated to recover from error

5 Spark examples 5 spark is a spark context object

6 Spark examples 6 errors is a transformation, and thus a data strucure that explains HOW to do something count() is an action: it will actually execute the plan for errors and return a value. errors.filter() is a transformation collect() is an action everything is sharded, like in Hadoop and GuineaPig everything is sharded, like in Hadoop and GuineaPig

7 Spark examples 7 # modify errors to be stored in cluster memory subsequent actions will be much faster everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible) You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data. You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

8 Spark examples: wordcount 8 the action transformation on (key,value) pairs, which are special

9 Spark examples: batch logistic regression 9 reduce is an action – it produces a numby vector p.x and w are vectors, from the numpy package p.x and w are vectors, from the numpy package. Python overloads operations like * and + for vectors.

10 Spark examples: batch logistic regression Important note: numpy vectors/matrices are not just “syntactic sugar”. They are much more compact than something like a list of python floats. numpy operations like dot, *, + are calls to optimized C code a little python logic around a lot of numpy calls is pretty efficient Important note: numpy vectors/matrices are not just “syntactic sugar”. They are much more compact than something like a list of python floats. numpy operations like dot, *, + are calls to optimized C code a little python logic around a lot of numpy calls is pretty efficient

11 Spark examples: batch logistic regression 11 w is defined outside the lambda function, but used inside it So: python builds a closure – code including the current value of w – and Spark ships it off to each worker. So w is copied, and must be read-only.

12 Spark examples: batch logistic regression 12 dataset of points is cached in cluster memory to reduce i/o

13 Spark logistic regression example 13

14 Spark 14

15 Spark details: broadcast 15 So: python builds a closure – code including the current value of w – and Spark ships it off to each worker. So w is copied, and must be read-only.

16 Spark details: broadcast 16 alternative: create a broadcast variable, e.g., w_broad = spark.broadcast(w) which is accessed by the worker via w_broad.value() alternative: create a broadcast variable, e.g., w_broad = spark.broadcast(w) which is accessed by the worker via w_broad.value() what’s sent is a small pointer to w (e.g., the name of a file containing a serialized version of w) and when value is called, some clever all- reduce like machinery is used to reduce network load. little penalty for distributing something that’s not used by all workers

17 Spark details: mapPartitions 17 Common issue: map task requires loading in some small shared value more generally, map task requires some sort of initialization before processing a shard GuineaPig: special Augment … sideview … pattern for shared values can kludge up any initializer using Augment Raw Hadoop: mapper.configure() and mapper.close() methods

18 Spark details: mapPartitions 18 Spark: rdd.mapPartitions(f): will call f(iteratorOverShard) once per shard, and return an iterator over the mapped values. f() can do any setup/close steps it needs Also: there are transformations to partition an RDD with a user-selected function, like in Hadoop. Usually you partition and persist/cache.

19 Spark – from logistic regression to matrix factorization William Cohen 19

20 Recovering latent factors in a matrix m movies n users m movies x1y1 x2y2.. …… xnyn a1a2..…am b1b2……bm v11 … …… vij … vnm ~ V[i,j] = user i’s rating of movie j r W H V 20 Recap

21 21 Recap

22 22 strata Recap

23 MF HW How do you define the strata? first assign rows and columns to blocks then assign blocks to strata 1 1 1 1 1 2 2 2 2 2 12345 51234 45123 34512 23451 colBlock = rowBlock colBlock – rowBlock = 1 mod K strata i defined by colBlock – rowBlock = i mod K

24 MF HW Their algorithm: for epoch t=1,….,T in sequence – for stratum s=1,…,K in sequence for block b=1,… in stratum s in parallel – for triple (i,j,rating) in block b in sequence » do SGD step for (i,j,rating)

25 MF HW Our algorithm: cache the rating matrix into cluster memory for epoch t=1,….,T in sequence – for stratum s=1,…,K in sequence distribute H, W to workers for block b=1,… in stratum s, in parallel – run SGD and collect the updates that are performed » i.e., the deltas (i,j,deltaH) or (i,j,deltaW) – aggregate the deltas and apply the updates to H, W like the logistic regression training data like the outer loop for logistic regression broadcast numpy matrices sort of like in logistic regression


Download ppt "Other Map-Reduce (ish) Frameworks: Spark William Cohen 1."

Similar presentations


Ads by Google