Other Map-Reduce (ish) Frameworks William Cohen 1.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Overview of this week Debugging tips for ML algorithms
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
MapReduce.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark: Cluster Computing with Working Sets
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Using Large-Vocabulary Classifiers William W. Cohen.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Using Large-Vocabulary Classifiers William W. Cohen.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Other Map-Reduce (ish) Frameworks William Cohen 1.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Image taken from: slideshare
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Spark vs Hadoop.
Announcements Working AWS codes will be out soon
Pig Latin - A Not-So-Foreign Language for Data Processing
Workflows 3: Graphs, PageRank, Loops, Spark
Introduction to Spark.
10-605: Map-Reduce Workflows
Overview of big data tools
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Big ML.
Introduction to Spark.
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Map Reduce, Types, Formats and Features
Presentation transcript:

Other Map-Reduce (ish) Frameworks William Cohen 1

Outline More concise languages for map-reduce pipelines Abstractions built on top of map-reduce – General comments – Specific systems Cascading, Pipes PIG, Hive Spark, Flink 2

Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop 3

Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 4

STREAMING AND MRJOB: MORE CONCISE MAP-REDUCE PIPELINES 5

Hadoop streaming start with stream & sort pipeline cat data | mapper.py | sort –k1,1 | reducer.py run with hadoop streaming instead bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file mapper.py –file reducer.py -mapper mapper.py -reducer reducer.py -input /hdfs/path/to/inputDir -output /hdfs/path/to/outputDir -mapred.map.tasks=20 -mapred.reduce.tasks=20 6

mrjob word count Python level over map- reduce – very concise Can run locally in Python Allows a single job or a linear chain of steps 7

mrjob most freq word 8

MAP-REDUCE ABSTRACTIONS 9

Abstractions On Top Of Hadoop MRJob and other tools to make Hadoop pipelines more concise (Dumbo, …) still keep the same basic language of map-reduce jobs How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 10

Abstractions On Top Of Hadoop Some obvious streaming processes: – for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test 11 Example: stem words in a stream of word-count pairs: (“aardvarks”,1)  (“aardvark”,1) Example: stem words in a stream of word-count pairs: (“aardvarks”,1)  (“aardvark”,1) Proposed syntax: table2 = MAP table1 TO λ row : f(row)) Proposed syntax: table2 = MAP table1 TO λ row : f(row)) f(row)  row’ Example: apply stop words (“aardvark”,1)  (“aardvark”,1) (“the”,1)  deleted Example: apply stop words (“aardvark”,1)  (“aardvark”,1) (“the”,1)  deleted Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) f(row)  {true,false}

Abstractions On Top Of Hadoop A non-obvious? streaming processes: – for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) 12 Example: tokenizing a line “I found an aardvark”  [“i”, “found”,”an”,”aardvark”] “We love zymurgy”  [“we”,”love”,”zymurgy”]..but final table is one word per row Example: tokenizing a line “I found an aardvark”  [“i”, “found”,”an”,”aardvark”] “We love zymurgy”  [“we”,”love”,”zymurgy”]..but final table is one word per row “i” “found” “an” “aardvark” “we” “love” … “i” “found” “an” “aardvark” “we” “love” … Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row)) Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row)) f(row)  list of rows

Abstractions On Top Of Hadoop Another example from the Naïve Bayes test program… 13

NB Test Step (Can we do better?) X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … … Event counts How: Stream and sort: for each C[X=w^Y=y]=n print “w C[Y=y]=n” sort and build a list of values associated with each key w Like an inverted index wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=world News]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

NB Test Step 1 (Can we do better?) X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … … Event counts wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=world News]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNe ws]=4464 The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row)  field

NB Test Step 1 (Can we do better?) The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row)  field Aside: you guys know how to implement this, right? 1.Output pairs (f(row),row) with a map/streaming process 2.Sort pairs by key – which is f(row) 3.Reduce and aggregate by appending together all the values associated with the same key

Abstractions On Top Of Hadoop And another example from the Naïve Bayes test program… 17

Request-and-answer id 1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 w 2,1 w 2,2 w 2,3 …. id 3 w 3,1 w 3,2 …. id 4 w 4,1 w 4,2 … id 5 w 5,1 w 5,2 …... Test data Record of all event counts for each word wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 Step 2: stream through and for each test case id i w i,1 w i,2 w i,3 …. w i,ki request the event counters needed to classify id i from the event-count DB, then classify using the answers Classification logic

Request-and-answer Break down into stages – Generate the data being requested (indexed by key, here a word) Eg with group … by – Generate the requests as (key, requestor) pairs Eg with flatmap … to – Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together – Postprocess the joined result

wCounters aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 wCountersRequests aardvarkC[w^Y=sports]=2~ctr to id1 agentC[w^Y=sports]=…~ctr to id345 agentC[w^Y=sports]=…~ctr to id9854 agentC[w^Y=sports]=…~ctr to id345 …C[w^Y=sports]=…~ctr to id34742 zyngaC[…]~ctr to id1 zyngaC[…]… wRequest found~ctr to id1 aardvark~ctr to id1 … zynga ~ctr to id1 … ~ctr to id2

wCounters aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 wCountersRequests aardvarkC[w^Y=sports]=2id1 agentC[w^Y=sports]=…id345 agentC[w^Y=sports]=…id9854 agentC[w^Y=sports]=…id345 …C[w^Y=sports]=…id34742 zyngaC[…]id1 zyngaC[…]… wRequest foundid1 aardvarkid1 … zynga id1 … id2

MAP-REDUCE ABSTRACTIONS: CASCADING, PIPES, SCALDING 22

Y:Y=Hadoop+X Cascading – Java library for map-reduce workflows – Also some library operations for common mappers/reducers 23

Cascading WordCount Example 24 Input format Output format: pairs Bind to HFS path A pipeline of map-reduce jobs Append a step: apply function to the “line” field Append step: group a (flattened) stream of “tuples” Replace line with bag of words Append step: aggregate grouped values Run the pipeline

Cascading WordCount Example Is this inefficient? We explicitly form a group for each word, and then count the elements…? We could be saved by careful optimization: we know we don’t need the GroupBy intermediate result when we run the assembly…. Many of the Hadoop abstraction levels have a similar flavor: Define a pipeline of tasks declaratively Optimize it automatically Run the final result The key question: does the system successfully hide the details from you?

Y:Y=Hadoop+X Cascading – Java library for map-reduce workflows expressed as “ Pipe ”s, to which you add Each, Every, GroupBy, … – Also some library operations for common mappers/reducers e.g. RegexGenerator – Turing-complete since it’s an API for Java Pipes – C++ library for map-reduce workflows on Hadoop Scalding – More concise Scala library based on Cascading 26

MORE DECLARATIVE LANGUAGES 27

Hive and PIG: word count Declarative ….. Fairly stable PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 28

More on Pig Pig Latin – atomic types + compound types like tuple, bag, map – execute locally/interactively or on hadoop can embed Pig in Java (and Python and …) can call out to Java from Pig Similar (ish) system from Microsoft: DryadLinq 29

30 Tokenize – built-in function Flatten – special keyword, which applies to the next step in the process – ie foreach is transformed from a MAP to a FLATMAP

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton User defined functions as operators – also for loading, aggregates, … 31 PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce

ANOTHER EXAMPLE: COMPUTING TFIDF IN PIG LATIN 32

(docid,token)  (docid,token,tf(token in doc)) (docid,token,tf)  (docid,token,tf,length(doc)) (docid,token,tf,n)  (…,tf/n) (docid,token,tf,n,tf/n)  (…,df) ndocs.total_docs (docid,token,tf,n,tf/n)  (docid,token,tf/n * id) relation-to-scalar casting 33

Other PIG features … Macros, nested queries, FLATTEN “operation” – transforms a bag or a tuple into its individual elements – this transform affects the next level of the aggregate STREAM and DEFINE … SHIP DEFINE myfunc `python myfun.py` SHIP(‘myfun.py’) … r = STREAM s THROUGH myfunc AS (…); DEFINE myfunc `python myfun.py` SHIP(‘myfun.py’) … r = STREAM s THROUGH myfunc AS (…); 34

TF-IDF in PIG - another version 35

Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 36 First: an iterative algorithm in Pig Latin

Julien Le Dem - Yahoo How to use loops, conditionals, etc? Embed PIG in a real programming language. How to use loops, conditionals, etc? Embed PIG in a real programming language. 37

38

An example from Ron Bekkerman 39

Example: k-means clustering An EM-like algorithm: Initialize k cluster centroids E-step: associate each data instance with the closest centroid – Find expected values of cluster assignments given the data and centroids M-step: recalculate centroids as an average of the associated data instances – Find new centroids that maximize that expectation 40

k-means Clustering centroids 41

Parallelizing k-means 42

Parallelizing k-means 43

Parallelizing k-means 44

k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 45

k-means in Apache Pig: input data Assume we need to cluster documents – Stored in a 3-column table D: Initial centroids are k randomly chosen docs – Stored in table C in the same format as above DocumentWordCount doc1Carnegie2 doc1Mellon2 46

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 47

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 48

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 49

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 50

D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 51

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dXc; SQR = FOREACH C GENERATE c, i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 52

k-means in Apache Pig: M-step D_C_W = JOIN CLUSTERS BY d, D BY d; D_C_W g = GROUP D_C_W BY (c, w); SUMS = FOREACH D_C_W g GENERATE c, w, SUM(i d ) AS sum; D_C_W gg = GROUP D_C_W BY c; SIZES = FOREACH D_C_W gg GENERATE c, COUNT(D_C_W) AS size; SUMS_SIZES = JOIN SIZES BY c, SUMS BY c; C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS i c ; Finally - embed in Java (or Python or ….) to do the looping 53

The problem with k-means in Hadoop I/O costs 54

Data is read, and model is written, with every iteration Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 55

SCHEMES DESIGNED FOR ITERATIVE HADOOP PROGRAMS: SPARK AND FLINK 56

Spark word count example Research project, based on Scala and Hadoop Now APIs in Java and Python as well 57 Familiar-looking API for abstract operations (map, flatMap, reduceByKey, …) Most API calls are “lazy” – ie, counts is a data structure defining a pipeline, not a materialized table. Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database) Familiar-looking API for abstract operations (map, flatMap, reduceByKey, …) Most API calls are “lazy” – ie, counts is a data structure defining a pipeline, not a materialized table. Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database)

Spark logistic regression example 58

Spark logistic regression example Allows caching data in memory 59

Spark logistic regression example 60

FLINK Recent Apache Project – just moved to top- level at 0.8 – formerly Stratosphere 61 ….

FLINK Apache Project – just getting started 62 …. Java API

FLINK 63

FLINK Like Spark, in-memory or on disk Everything is a Java object Unlike Spark, contains operations for iteration – Allowing query optimization Very easy to use and install in local model – Very modular – Only needs Java 64

MORE EXAMPLES IN PIG 65

Phrase Finding in PIG 66

Phrase Finding 1 - loading the input 67

… 68

PIG Features comments -- like this /* or like this */ ‘shell-like’ commands: – fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell 69

… 70

PIG Features comments -- like this /* or like this */ ‘shell-like’ commands: – fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, … – schemas can include complex types: bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation – operators include +, -, and, or, … – can extend this set easily (more later) DESCRIBE alias -- shows the schema ILLUSTRATE alias -- derives a sample tuple 71

Phrase Finding 1 - word counts 72

73

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP r BY x – like a shuffle-sort: produces relation with fields group and r, where r is a bag 74

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce 75

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map-reduce – aggregates: COUNT, SUM, AVERAGE, MAX, MIN, … – you can write your own 76

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce 77

Phrase Finding 3 - assembling phrase- and word-level statistics 78

79

80

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map-reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … 81

Phrase Finding 4 - adding total frequencies 82

83

How do we add the totals to the phraseStats relation? STORE triggers execution of the query plan…. it also limits optimization 84

Comment: schema is lost when you store…. 85

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton – newer pigs allow singleton relation to be cast to a scalar 86

Phrase Finding 5 - phrasiness and informativeness 87

How do we compute some complicated function? With a “UDF” 88

89

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton User defined functions as operators – also for loading, aggregates, … 90

The full phrase-finding pipeline 91

92