Other Map-Reduce (ish) Frameworks William Cohen 1.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
Overview of this week Debugging tips for ML algorithms
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
MapReduce.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark: Cluster Computing with Working Sets
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
MapReduce.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
An Introduction to HDInsight June 27 th,
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Other Map-Reduce (ish) Frameworks William Cohen 1.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Image taken from: slideshare
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Spark vs Hadoop.
Pig Latin - A Not-So-Foreign Language for Data Processing
Workflows and Abstractions for Map-Reduce
Workflows 3: Graphs, PageRank, Loops, Spark
Introduction to Spark.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Overview of big data tools
CSE 491/891 Lecture 21 (Pig).
VI-SEEM data analysis service
CSE 491/891 Lecture 24 (Hive).
Big ML.
Introduction to Spark.
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Pig Hive HBase Zookeeper
Presentation transcript:

Other Map-Reduce (ish) Frameworks William Cohen 1

Outline More concise languages for map-reduce pipelines Abstractions built on top of map-reduce – General comments – Specific systems Cascading, Pipes PIG, Hive Spark, Flink 2

Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop 3

Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 4

STREAMING AND MRJOB: MORE CONCISE MAP-REDUCE PIPELINES 5

Hadoop streaming start with stream & sort pipeline cat data | mapper.py | sort –k1,1 | reducer.py run with hadoop streaming instead bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file mapper.py –file reducer.py -mapper mapper.py -reducer reducer.py -input /hdfs/path/to/inputDir -output /hdfs/path/to/outputDir -mapred.map.tasks=20 -mapred.reduce.tasks=20 6

mrjob word count Python level over map- reduce – very concise Can run locally in Python Allows a single job or a linear chain of steps 7

mrjob most freq word 8

MAP-REDUCE ABSTRACTIONS: CASCADING, PIPES, SCALDING 9

Y:Y=Hadoop+X Cascading – Java library for map-reduce workflows – Also some library operations for common mappers/reducers 10

Cascading WordCount Example 11 Input format Output format: pairs Bind to HFS path A pipeline of map-reduce jobs Append a step: apply function to the “line” field Append step: group a (flattened) stream of “tuples” Replace line with bag of words Append step: aggregate grouped values Run the pipeline

Cascading WordCount Example Is this inefficient? We explicitly form a group for each word, and then count the elements…? We could be saved by careful optimization: we know we don’t need the GroupBy intermediate result when we run the assembly…. Many of the Hadoop abstraction levels have a similar flavor: Define a pipeline of tasks declaratively Optimize it automatically Run the final result The key question: does the system successfully hide the details from you?

Y:Y=Hadoop+X Cascading – Java library for map-reduce workflows expressed as “ Pipe ”s, to which you add Each, Every, GroupBy, … – Also some library operations for common mappers/reducers e.g. RegexGenerator – Turing-complete since it’s an API for Java Pipes – C++ library for map-reduce workflows on Hadoop Scalding – More concise Scala library based on Cascading 13

MORE DECLARATIVE LANGUAGES 14

Hive and PIG: word count Declarative ….. Fairly stable PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 15

More on Pig Pig Latin – atomic types + compound types like tuple, bag, map – execute locally/interactively or on hadoop can embed Pig in Java (and Python and …) can call out to Java from Pig Similar (ish) system from Microsoft: DryadLinq 16

17 Tokenize – built-in function Flatten – special keyword, which applies to the next step in the process – ie foreach is transformed from a MAP to a FLATMAP

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton User defined functions as operators – also for loading, aggregates, … 18 PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce

Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 19 First: an iterative algorithm in Pig Latin

Julien Le Dem - Yahoo How to use loops, conditionals, etc? Embed PIG in a real programming language. How to use loops, conditionals, etc? Embed PIG in a real programming language. 20

21

An example from Ron Bekkerman 22

Example: k-means clustering An EM-like algorithm: Initialize k cluster centroids E-step: associate each data instance with the closest centroid – Find expected values of cluster assignments given the data and centroids M-step: recalculate centroids as an average of the associated data instances – Find new centroids that maximize that expectation 23

k-means Clustering centroids 24

Parallelizing k-means 25

Parallelizing k-means 26

Parallelizing k-means 27

k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 28

k-means in Apache Pig: input data Assume we need to cluster documents – Stored in a 3-column table D: Initial centroids are k randomly chosen docs – Stored in table C in the same format as above DocumentWordCount doc1Carnegie2 doc1Mellon2 29

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 30

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 31

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 32

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 33

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 34

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 35

k-means in Apache Pig: M-step D_C_W = JOIN CLUSTERS BY d, D BY d; D_C_W g = GROUP D_C_W BY (c, w); SUMS = FOREACH D_C_W g GENERATE c, w, SUM(i d ) AS sum; D_C_W gg = GROUP D_C_W BY c; SIZES = FOREACH D_C_W gg GENERATE c, COUNT(D_C_W) AS size; SUMS_SIZES = JOIN SIZES BY c, SUMS BY c; C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS i c ; Finally - embed in Java (or Python or ….) to do the looping 36

The problem with k-means in Hadoop I/O costs 37

Data is read, and model is written, with every iteration Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 38

SCHEMES DESIGNED FOR ITERATIVE HADOOP PROGRAMS: SPARK AND FLINK 39

Spark word count example Research project, based on Scala and Hadoop Now APIs in Java and Python as well 40 Familiar-looking API for abstract operations (map, flatMap, reduceByKey, …) Most API calls are “lazy” – ie, counts is a data structure defining a pipeline, not a materialized table. Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database) Familiar-looking API for abstract operations (map, flatMap, reduceByKey, …) Most API calls are “lazy” – ie, counts is a data structure defining a pipeline, not a materialized table. Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database)

Spark logistic regression example 41

Spark logistic regression example Allows caching data in memory 42

Spark logistic regression example 43

FLINK Recent Apache Project – formerly Stratosphere 44 ….

FLINK Apache Project – just getting started 45 …. Java API

FLINK 46

FLINK Like Spark, in-memory or on disk Everything is a Java object Unlike Spark, contains operations for iteration – Allowing query optimization Very easy to use and install in local model – Very modular – Only needs Java 47

MORE EXAMPLES IN PIG 48

Phrase Finding in PIG 49

Phrase Finding 1 - loading the input 50

… 51

PIG Features comments -- like this /* or like this */ ‘shell-like’ commands: – fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell 52

… 53

PIG Features comments -- like this /* or like this */ ‘shell-like’ commands: – fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, … – schemas can include complex types: bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation – operators include +, -, and, or, … – can extend this set easily (more later) DESCRIBE alias -- shows the schema ILLUSTRATE alias -- derives a sample tuple 54

Phrase Finding 1 - word counts 55

56

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP r BY x – like a shuffle-sort: produces relation with fields group and r, where r is a bag 57

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce 58

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map-reduce – aggregates: COUNT, SUM, AVERAGE, MAX, MIN, … – you can write your own 59

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce 60

Phrase Finding 3 - assembling phrase- and word-level statistics 61

62

63

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map-reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … 64

Phrase Finding 4 - adding total frequencies 65

66

How do we add the totals to the phraseStats relation? STORE triggers execution of the query plan…. it also limits optimization 67

Comment: schema is lost when you store…. 68

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton – newer pigs allow singleton relation to be cast to a scalar 69

Phrase Finding 5 - phrasiness and informativeness 70

How do we compute some complicated function? With a “UDF” 71

72

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton User defined functions as operators – also for loading, aggregates, … 73

The full phrase-finding pipeline 74

75

GUINEA PIG 76

GuineaPig: PIG in Python Pure Python (< 1500 lines) Streams Python data structures – strings, numbers, tuples (a,b), lists [a,b,c] – No records: operations defined functionally Compiles to Hadoop streaming pipeline – Optimizes sequences of MAPs Runs locally without Hadoop – compiles to stream-and-sort pipeline – intermediate results can be viewed Can easily run parts of a pipeline 77

GuineaPig: PIG in Python Pure Python, streams Python data structures – not too much new to learn (eg field/record notation, special string operations, UDFs, …) – codebase is small and readable Compiles to Hadoop or stream-and-sort, can easily run parts of a pipeline – intermediate results often are (and always can be) stored and inspected – plan is fairly visible Syntax includes high-level operations but also fairly detailed description of an optimized map-reduce step – Flatten | Group(by=…, retaining=…, reducingTo=…) 78

79 A wordcount example class variables in the planner are data structures

Wordcount example …. Data structure can be converted to a series of “abstract map-reduce tasks” 80

More examples of GuineaPig 81 Join syntax, macros, Format command Incremental debugging, when intermediate views are stored: % python wrdcmp.py –store result … % python wrdcmp.py –store result –reuse cmp

More examples of GuineaPig 82 Full Syntax for Group Group(wc, by=lambda (word,count):word[:k], retaining=lambda (word,count):count, reducingTo=ReduceToSum()) equiv to: Group(wc, by=lambda (word,count):word[:k], reducingTo= ReduceTo(int, lambda accum,word,count): accum+count))