MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

Slides:

Advertisements

Similar presentations

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Advertisements

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)

C# and LINQ Yuan Yu Microsoft Research Silicon Valley.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Programming Types of Testing.

Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

XP Chapter 3 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Analyzing Data For Effective Decision Making.

Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.

Cloud Computing Other High-level parallel processing languages Keke Chen.

Big Data Analytics Training

Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.

Database Management 9. course. Execution of queries.

Using Large-Vocabulary Classifiers William W. Cohen.

Hive Facebook 2009.

Analyzing Data For Effective Decision Making Chapter 3.

Phrase Finding with “Request-and-Answer” William W. Cohen.

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Using Large-Vocabulary Classifiers William W. Cohen.

Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Other Map-Reduce (ish) Frameworks William Cohen 1.

A NoSQL Database - Hive Dania Abed Rabbou.

 Agenda 2/20/13 o Review quiz, answer questions o Review database design exercises from 2/13 o Create relationships through “Lookup tables” o Discuss.

RESTORE IMPLEMENTATION as an extension to pig Vijay S.

Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

BACS 287 Structured Query Language 1. BACS 287 Visual Basic Table Access Visual Basic provides 2 mechanisms to access data in tables: – Record-at-a-time.

Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Other Map-Reduce (ish) Frameworks William Cohen 1.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.

Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.

Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

By: Eliav Menachi.  On Android, all application data (including files) are private to that application  Android provides a standard way for an application.

Image taken from: slideshare

Big Data, Data Mining, Tools

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Design of Pig B. Ramamurthy.

Spark vs Hadoop.

Announcements Working AWS codes will be out soon

Aggregation Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together,

Pig Latin - A Not-So-Foreign Language for Data Processing

10-605: Map-Reduce Workflows

Slides borrowed from Adam Shook

CSE 491/891 Lecture 21 (Pig).

CSE 491/891 Lecture 24 (Hive).

Presentation transcript:

MAP-REDUCE ABSTRACTIONS 1

Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) – naive Bayes training – naïve Bayes testing – phrase scoring How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 2

Abstractions On Top Of Hadoop Some obvious streaming processes: – for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test 3 Example: stem words in a stream of word-count pairs: (“aardvarks”,1)  (“aardvark”,1) Example: stem words in a stream of word-count pairs: (“aardvarks”,1)  (“aardvark”,1) Proposed syntax: table2 = MAP table1 TO λ row : f(row)) Proposed syntax: table2 = MAP table1 TO λ row : f(row)) f(row)  row’ Example: apply stop words (“aardvark”,1)  (“aardvark”,1) (“the”,1)  deleted Example: apply stop words (“aardvark”,1)  (“aardvark”,1) (“the”,1)  deleted Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) f(row)  {true,false}

Abstractions On Top Of Hadoop A non-obvious? streaming processes: – for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) 4 Example: tokenizing a line “I found an aardvark”  [“i”, “found”,”an”,”aardvark”] “We love zymurgy”  [“we”,”love”,”zymurgy”]..but final table is one word per row Example: tokenizing a line “I found an aardvark”  [“i”, “found”,”an”,”aardvark”] “We love zymurgy”  [“we”,”love”,”zymurgy”]..but final table is one word per row “i” “found” “an” “aardvark” “we” “love” … “i” “found” “an” “aardvark” “we” “love” … Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row)) Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row)) f(row)  list of rows

Abstractions On Top Of Hadoop Another example from the Naïve Bayes test program… 5

NB Test Step (Can we do better?) X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … … Event counts How: Stream and sort: for each C[X=w^Y=y]=n print “w C[Y=y]=n” sort and build a list of values associated with each key w Like an inverted index wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=world News]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

NB Test Step 1 (Can we do better?) X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … … Event counts wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=world News]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNe ws]=4464 The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row)  field

NB Test Step 1 (Can we do better?) The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value  Grouping operation Special case of a map-reduce Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row)  field Aside: you guys know how to implement this, right? 1.Output pairs (f(row),row) with a map/streaming process 2.Sort pairs by key – which is f(row) 3.Reduce and aggregate by appending together all the values associated with the same key

Abstractions On Top Of Hadoop And another example from the Naïve Bayes test program… 9

Request-and-answer id 1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 w 2,1 w 2,2 w 2,3 …. id 3 w 3,1 w 3,2 …. id 4 w 4,1 w 4,2 … id 5 w 5,1 w 5,2 …... Test data Record of all event counts for each word wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 Step 2: stream through and for each test case id i w i,1 w i,2 w i,3 …. w i,ki request the event counters needed to classify id i from the event-count DB, then classify using the answers Classification logic

Request-and-answer Break down into stages – Generate the data being requested (indexed by key, here a word) Eg with group … by – Generate the requests as (key, requestor) pairs Eg with flatmap … to – Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together – Postprocess the joined result

wCounters aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 wCountersRequests aardvarkC[w^Y=sports]=2~ctr to id1 agentC[w^Y=sports]=…~ctr to id345 agentC[w^Y=sports]=…~ctr to id9854 agentC[w^Y=sports]=…~ctr to id345 …C[w^Y=sports]=…~ctr to id34742 zyngaC[…]~ctr to id1 zyngaC[…]… wRequest found~ctr to id1 aardvark~ctr to id1 … zynga ~ctr to id1 … ~ctr to id2

wCounters aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 wCountersRequests aardvarkC[w^Y=sports]=2id1 agentC[w^Y=sports]=…id345 agentC[w^Y=sports]=…id9854 agentC[w^Y=sports]=…id345 …C[w^Y=sports]=…id34742 zyngaC[…]id1 zyngaC[…]… wRequest foundid1 aardvarkid1 … zynga id1 … id2

IMPLEMENTATIONS OF MAP-REDUCE ABSTRACTIONS 14

Hive and PIG: word count Declarative ….. Fairly stable PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 15

More on Pig Pig Latin – atomic types + compound types like tuple, bag, map – execute locally/interactively or on hadoop can embed Pig in Java (and Python and …) can call out to Java from Pig Similar (ish) system from Microsoft: DryadLinq 16

17 Tokenize – built-in function Flatten – special keyword, which applies to the next step in the process – ie foreach is transformed from a MAP to a FLATMAP

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton User defined functions as operators – also for loading, aggregates, … 18 PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce

Phrase Finding in PIG 19

Phrase Finding 1 - loading the input 20

… 21

PIG Features comments -- like this /* or like this */ ‘shell-like’ commands: – fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell 22

… 23

PIG Features comments -- like this /* or like this */ ‘shell-like’ commands: – fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, … – schemas can include complex types: bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation – operators include +, -, and, or, … – can extend this set easily (more later) DESCRIBE alias -- shows the schema ILLUSTRATE alias -- derives a sample tuple 24

Phrase Finding 1 - word counts 25

26

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP r BY x – like a shuffle-sort: produces relation with fields group and r, where r is a bag 27

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce 28

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map-reduce – aggregates: COUNT, SUM, AVERAGE, MAX, MIN, … – you can write your own 29

PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce 30

Phrase Finding 3 - assembling phrase- and word-level statistics 31

32

33

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map-reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … 34

Phrase Finding 4 - adding total frequencies 35

36

How do we add the totals to the phraseStats relation? STORE triggers execution of the query plan…. it also limits optimization 37

Comment: schema is lost when you store…. 38

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton – newer pigs allow singleton relation to be cast to a scalar 39

Phrase Finding 5 - phrasiness and informativeness 40

How do we compute some complicated function? With a “UDF” 41

42

PIG Features LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, bag, map, tuple, … FOREACH alias GENERATE … AS …, … – transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias -- debugging GROUP alias BY … FOREACH alias GENERATE group, SUM(….) – GROUP/GENERATE … aggregate op together act like a map- reduce JOIN r BY field, s BY field, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, … CROSS r, s, … – use with care unless all but one of the relations are singleton User defined functions as operators – also for loading, aggregates, … 43

The full phrase-finding pipeline 44

45