Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Overview of this week Debugging tips for ML algorithms
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
The Hadoop Stack, Part 3 Introduction to Spark
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Storage in Big Data Systems
Introduction to Hadoop and HDFS
Other Map-Reduce (ish) Frameworks William Cohen 1.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Data Engineering How MapReduce Works
Other Map-Reduce (ish) Frameworks William Cohen 1.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Matrix Factorization 1. Recovering latent factors in a matrix m columns v11 … …… vij … vnm n rows 2.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Spark: Cluster Computing with Working Sets
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Spark vs Hadoop.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Workflows 3: Graphs, PageRank, Loops, Spark
Introduction to Spark.
Matrix Factorization.
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
CMPT 733, SPRING 2016 Jiannan Wang
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Overview of big data tools
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Spark and Scala.
Introduction to Spark.
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

Recap: Last month More concise languages for map-reduce pipelines Abstractions built on top of map-reduce – General comments – Specific systems Cascading, Pipes PIG, Hive Spark, Flink 2

Recap: Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 3

Spark Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 4 Set of concise dataflow operations (“transformation”) Dataflow operations are embedded in an API together with “actions” Set of concise dataflow operations (“transformation”) Dataflow operations are embedded in an API together with “actions” Sharded files are replaced by “RDDs” – resiliant distributed datasets RDDs can be cached in cluster memory and recreated to recover from error Sharded files are replaced by “RDDs” – resiliant distributed datasets RDDs can be cached in cluster memory and recreated to recover from error

Spark examples 5 spark is a spark context object

Spark examples 6 errors is a transformation, and thus a data strucure that explains HOW to do something count() is an action: it will actually execute the plan for errors and return a value. errors.filter() is a transformation collect() is an action everything is sharded, like in Hadoop and GuineaPig everything is sharded, like in Hadoop and GuineaPig

Spark examples 7 # modify errors to be stored in cluster memory subsequent actions will be much faster everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible) You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data. You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

Spark examples: wordcount 8 the action transformation on (key,value) pairs, which are special

Spark examples: batch logistic regression 9 reduce is an action – it produces a numby vector p.x and w are vectors, from the numpy package p.x and w are vectors, from the numpy package. Python overloads operations like * and + for vectors.

Spark examples: batch logistic regression Important note: numpy vectors/matrices are not just “syntactic sugar”. They are much more compact than something like a list of python floats. numpy operations like dot, *, + are calls to optimized C code a little python logic around a lot of numpy calls is pretty efficient Important note: numpy vectors/matrices are not just “syntactic sugar”. They are much more compact than something like a list of python floats. numpy operations like dot, *, + are calls to optimized C code a little python logic around a lot of numpy calls is pretty efficient

Spark examples: batch logistic regression 11 w is defined outside the lambda function, but used inside it So: python builds a closure – code including the current value of w – and Spark ships it off to each worker. So w is copied, and must be read-only.

Spark examples: batch logistic regression 12 dataset of points is cached in cluster memory to reduce i/o

Spark logistic regression example 13

Spark 14

Spark details: broadcast 15 So: python builds a closure – code including the current value of w – and Spark ships it off to each worker. So w is copied, and must be read-only.

Spark details: broadcast 16 alternative: create a broadcast variable, e.g., w_broad = spark.broadcast(w) which is accessed by the worker via w_broad.value() alternative: create a broadcast variable, e.g., w_broad = spark.broadcast(w) which is accessed by the worker via w_broad.value() what’s sent is a small pointer to w (e.g., the name of a file containing a serialized version of w) and when value is called, some clever all- reduce like machinery is used to reduce network load. little penalty for distributing something that’s not used by all workers

Spark details: mapPartitions 17 Common issue: map task requires loading in some small shared value more generally, map task requires some sort of initialization before processing a shard GuineaPig: special Augment … sideview … pattern for shared values can kludge up any initializer using Augment Raw Hadoop: mapper.configure() and mapper.close() methods

Spark details: mapPartitions 18 Spark: rdd.mapPartitions(f): will call f(iteratorOverShard) once per shard, and return an iterator over the mapped values. f() can do any setup/close steps it needs Also: there are transformations to partition an RDD with a user-selected function, like in Hadoop. Usually you partition and persist/cache.

Spark – from logistic regression to matrix factorization William Cohen 19

Recovering latent factors in a matrix m movies n users m movies x1y1 x2y2.. …… xnyn a1a2..…am b1b2……bm v11 … …… vij … vnm ~ V[i,j] = user i’s rating of movie j r W H V 20 Recap

21 Recap

22 strata Recap

MF HW How do you define the strata? first assign rows and columns to blocks then assign blocks to strata colBlock = rowBlock colBlock – rowBlock = 1 mod K strata i defined by colBlock – rowBlock = i mod K

MF HW Their algorithm: for epoch t=1,….,T in sequence – for stratum s=1,…,K in sequence for block b=1,… in stratum s in parallel – for triple (i,j,rating) in block b in sequence » do SGD step for (i,j,rating)

MF HW Our algorithm: cache the rating matrix into cluster memory for epoch t=1,….,T in sequence – for stratum s=1,…,K in sequence distribute H, W to workers for block b=1,… in stratum s, in parallel – run SGD and collect the updates that are performed » i.e., the deltas (i,j,deltaH) or (i,j,deltaW) – aggregate the deltas and apply the updates to H, W like the logistic regression training data like the outer loop for logistic regression broadcast numpy matrices sort of like in logistic regression