Part II NoSQL Database (MapReduce) Yuan Xue

Part II NoSQL Database (MapReduce) Yuan Xue (yuan.xue@vanderbilt.edu)

Outline  Why MapReduce is needed  What is MapReduce  How to use MapReduce (Programming Model)  How MapReduce works (Under the hood/Implementation)  Best practice with MapReduce (Design Pattern)  More on MapReduce  MapReduce and NoSQL  MapReduce and SQL (Hive/Pig)  MapReduce for machine learning (Mahout)

Motivation Example  Classical Example: Word Count  large number of files of words,  Count the number of times each distinct word appears  How to do it if all documents are on the same machine?  How to do it if documents are distributed across multiple machines?  Computation Synchronization  Failure handling Word Count

Knowing the field: Cluster Environment Cheap nodes fail, especially if you have many – Mean time between failures for 1 node = 3 years – MTBF for 1000 nodes = 1 day – Solution: Build fault-tolerance into system Commodity network = low bandwidth – Solution: Push computation to the data

Need for Programming Model and Execution Framework Programming Interface Execution Framework Distributed computing environment Coordinate computation Handles failures Simple APIs Program development with limited knowledge of execution platform

Need for Programming Model and Execution Framework Programming Interface Map() Reduce() Execution Framework Distributed computing environment Coordinate computation Handles failures Simple APIs Program development with limited knowledge of execution platform MapReduce

MapReduce and Hadoop  Pioneered by Google  Popularized by open-source Hadoop project  Used by Yahoo!, Facebook, Amazon  AWS – EMR cloud service  Applications  At Google:  Index building for Google Search  Article clustering for Google News  Statistical machine translation  At Yahoo!:  Index building for Yahoo! Search  Spam detection for Yahoo! Mail  At Facebook:  Data mining  Ad optimization  Spam detection

MapReduce Overview  Map/Reduce Programming Model  model from Lisp (and other functional languages)  Many problems can be phrased this way  Easy to distribute across nodes  Nice retry/failure semantics  MapReduce Execution Framework  Automatic parallelization & distribution  Fault tolerance  I/O scheduling  Monitoring & status updates  Large-Scale Data Processing  Want to use 1000s of CPUs  But don’t want hassle of managing things

MapReduce Programing Model (basic version)  Input: a set of key/value pairs  User supplies two functions:  map(k,v)  list(k1,v1)  reduce(k1, list(v1))  v2  (k1,v1) is an intermediate key/value pair  Output is the set of (k1,v2) pairs

MapReduce Execution Semantic

Example Application – Word Count map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)

Check out the code in Hadoop  https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Example Execution Illustration

Programming Model Extended – Combiner  Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k  E.g., popular words in Word Count  Can save network time by pre-aggregating at mapper  combine(k1, list(v1))  v2  Usually same as reduce function  Works only if reduce function is commutative and associative

Execution Workflow User Program Worker Master Worker fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data

Data flow  Input, final output are stored on a distributed file system  Scheduler tries to schedule map tasks “close” to physical storage location of input data  Push computation to data, minimize network use  Intermediate results are stored on local FS of map and reduce workers  Sync point  Allows recovery if a reducer crashes  Output is often input to another map reduce task

Partition Function  Inputs to map tasks are created by contiguous splits of input file  For reduce, we need to ensure that records with the same intermediate key end up at the same worker  System uses a default partition function e.g., hash(key) mod R  Sometimes useful to override  E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file

Coordination and Failure Handling  Master data structures  Task status: (idle, in-progress, completed)  Idle tasks get scheduled as workers become available  When a map task completes, it sends the master the location and sizes of its R intermediate files  Master pushes this info to reducers  Master pings workers periodically to detect failures If a task crashes: – Retry on another node OK for a map because it had no dependencies OK for reduce because map outputs are on disk – If the same task repeatedly fails, fail the job or ignore that input block For fault tolerance to work, your map and reduce tasks must be side-effect-free If a node crashes: – Relaunch its current tasks on other nodes – Relaunch any maps the node previously ran Necessary because their output files were lost along with the crashed node

Execution Illustration

How many Map and Reduce jobs?  M map tasks, R reduce tasks  Rule of thumb:  Make M and R much larger than the number of nodes in cluster  One DFS chunk per map is common  Improves dynamic load balancing and speeds recovery from worker failure  Usually R is smaller than M, because output is spread across R files

Execution Summary  How is this distributed?  Partition input key/value pairs into chunks, run map() tasks in parallel  After all map()s are complete, consolidate all emitted values for each unique emitted key  Now partition space of output map keys, and run reduce() in parallel  If map() or reduce() fails, reexecute! Prasad 23 L06MapReduce

Announcement  Lab 2 extension  Lab 2 help session (NoSQL data modeling ) – March 20

Outline  Why MapReduce is needed  What is MapReduce  How to use MapReduce (Programming Model)  How MapReduce works (Under the hood/Implementation)  Best practice with MapReduce (Design Pattern)  MapReduce and SQL (Hive/Pig)  MapReduce and NoSQL  MapReduce for machine learning (Mahout)

References

MapReduce: Recap  Programmers must specify: map (k, v) → * reduce (k’, v’) → *  All values with the same key are reduced together  Optionally, also: partition (k’, number of partitions) → partition for k’  Often a simple hash of the key, e.g., hash(k’) mod n  Divides up key space for parallel reduce operations combine (k’, v’) → *  Mini-reducers that run in memory after the map phase  Used as an optimization to reduce network traffic  The execution framework handles everything else…

“Everything Else”  The execution framework handles everything else…  Scheduling: assigns workers to map and reduce tasks  “Data distribution”: moves processes to data  Synchronization: gathers, sorts, and shuffles intermediate data  Errors and faults: detects worker failures and restarts  Limited control over data and execution flow  All algorithms must expressed in m, r, c, p  You don’t know:  Where mappers and reducers run  When a mapper or reducer begins or finishes  Which input a particular mapper is processing  Which intermediate key a particular reducer is processing

Relational Algebra and SQL  An algebra is a formal structure consisting of sets and operations on those sets.  Relational algebra is a formal system for manipulating relations.  Operands of this algebra are relations.  Operations of this algebra include  usual set operations (since relations are sets of tuples)  Union, intersection, difference  special operations defined for relations  selection  projection  join  Aggregate

Union  Mappers are fed by all records of two sets to be united. Reducer is used to eliminate duplicates.

Intersection  Mappers are fed by all records of two sets to be intersected. Reducer emits only records that occurred twice. It is possible only if both sets contain this record because record includes primary key and can occur in one set only once.

Difference  Let’s we have two sets of records – R and S. We want to compute difference R – S. Mapper emits all tuples and tag which is a name of the set this record came from. Reducer emits only records that came from R but not from S.

Select-project-join query in SQL IDNameEmailPassword Alice00Alicealice00@gmail.com Aadf1234 Bob2013Bobbob13@gmail. com qwer6789 Cathy123Cathycath@vandyTyuoa~!@ IDTimestampAuthorContent 00012013.12.20.11.20.2 Alice00Hello 00022013.12.20.11.23.6 Bob2013Nice weather 00032014.1.6.1.25. 2 Alice00@Bob Not sure.. User Tweet FolloweeFollowerTimestamp Alice00Bob20132011.1.1.3.6.6 Bob2013Cathy1232012.10.2.6.7.7 Alice00Cathy1232012.11.1.2.3.3 Cathy123Alice002012.11.1.2.6.6 Bob2013Alice002012.11.1.2.6.7 Follow SELECT content FROM Tweet, Follow WHERE Follower = ‘Alice00’ AND Author = Followee; Select condition Join condition

Relational Algebra Quick Overview  http://www.databasteknik.se/webbkursen/relalg-lecture/ http://www.databasteknik.se/webbkursen/relalg-lecture/  https://www.cs.rochester.edu/~nelson/courses/csc_173/relations/algebra.ht ml https://www.cs.rochester.edu/~nelson/courses/csc_173/relations/algebra.ht ml  Implementation in single machine  Implementing Projection  To implement projection we must  process every tuple in the relation  remove any duplicates that result  To avoid duplicates we cansort the result and remove consecutive tuples that are equal  requires time O(N log N) where N is the size of the original relation  implement the result as a set  set insertion guarantees no duplicates  by using a hash table, insertion is O(1), so projection is O(N)

Projection in MapReduce Use a Reducer to eliminate possible duplicates.

Implementing Selection  In the absence of an index we  apply the predicate to every tuple in the relation  insert matches in the resulting relation  duplicates can't occur  take O(N) time  Given an index, and a predicate that uses the index key, we  Lookup tuples using the key  evaluate only those tuples with the predicate  take O(K) time, where K tuples match the key

Selection in MapReduce  No Reducer is needed. Or Reducer function simply passes the key-value pairs to the output

Join implementation  Implementing Join with Nested Loops  A nested loop join on relations R1 (with N domains) and R2 (with M domains), considers all |R1| x |R2| pairs of tuples.

Index Join  An index join exploits the existence of an index for one of the domains used in the join to find matching tuples more quickly.

Sort Join  If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.  R3= join(R1,Ai,R2,Bj)  Merge the tuples of both relations into a single list  list elements must identify the original relation  Sort the list based on the value in the join domains Ai and Bj  all tuples on the sorted list with a common value for the join domains are consecutive  Pair all (consecutive) tuples from the two relations with the same value in the join domains

Projection in MapReduce 1 Repartition join—A reduce-side join for situations where you’re joining two or more large datasets together 2 Replication join—A map-side join that works in situations where one of the datasets is small enough to cache 3 Semi-join—Another map-side join where one dataset is initially too large to fit into memory, but after some filtering can be reduced down to a size that can fit in memory (details in Hadoop in Practice book)

Repartition Join (Reduce Join, Sort-Merge Join)  This algorithm joins of two sets R and L on some key k.  Mapper goes through all tuples from R and L, extracts key k from the tuples, marks tuple with a tag that indicates a set this tuple came from (‘R’ or ‘L’), and emits tagged tuple using k as a key.  Reducer receives all tuples for a particular key k and put them into two buckets – for R and for L. When two buckets are filled, Reducer runs nested loop over them and emits a cross join of the buckets. Each emitted tuple is a concatenation R-tuple, L- tuple, and key k.  This approach has the following disadvantages:  Mapper emits absolutely all data, even for keys that occur only in one set and have no pair in the other.  Reducer should hold all data for one key in the memory. If data doesn’t fit the memory, its Reducer’s responsibility to handle this by some kind of swap.  Repartition Join is a most generic technique  Multiple large data sets are being joined by a foreign key.  If all but one of the data sets can be fit into memory, try using the replicated join.  Flexibility of being able to execute any join operation.

Repartition Join (Reduce Join, Sort-Merge Join)

Repartition Join Optimization the repartition join is not space efficient; it requires all of the output values for a given join value to be loaded into memory before it can perform the multiway join. It’s more efficient to load the smaller of the datasets into memory and then iterate over the larger of datasets, performing the join along the way.

Replicated Join (Map Join, Hash Join)  Scenario:  Join a small set with a large one (say, a list of users with a list of log records).  Two sets – R and L, R is relative small.  Method  R can be distributed to all Mappers and each Mapper can load it and index by the join key– using a hash table.  Mapper goes through tuples of the set L and joins them with the corresponding tuples from R that are stored in the hash table.  Advantage  No need in sorting or transmission of the set L over the network

Replicated Join

Aggregate Functions in SQL  Aggregate function: summarize information from multiple tuples  Basic aggregate operations in SQL  COUNT, SUM, MAX, MIN, AVG SELECT COUNT(*) FROM Follow WHERE Follower = ‘Alice00’;  Retrieve the number of people that “Alice00” is following SELECT COUNT(*) FROM Follow WHERE Followee = ‘Alice00’;  Retrieve the number of people who are following “Alice00”

Aggregation  Aggregation can be performed in one MapReduce job as follows. Mapper extract from each tuple values to group by and aggregate and emits them. Reducer receives values to be aggregated already grouped and calculates an aggregation function.

More on Design Pattern – Summarization Pattern  numerical summarizations, inverted index, and counting with counters.

Design Patterns  Summarization  Filtering  Data Organization  Joins  Metapatterns  Input and output

Filtering patterns Extract interesting subsets  Filtering  Bloom filtering  Top ten  Distinct Summarization patterns top-down summaries  Numerical summarizations  Inverted index  Counting with counters I only want some of my data! I only want a top-level view of my data!

Data organization patterns Reorganize, restructure  Structured to hierarchical  Partitioning  Binning  Total order sorting  Shuffling Join patterns Bringing data sets together  Reduce-side join  Replicated join  Composite join  Cartesian product I want to change the way my data is organized! I want to mash my different data sources together!

Metapatterns Patterns of patterns  Job chaining  Chain folding  Job merging Input and output patterns Custom input and output  Generating data  External source output  External source input  Partition pruning I want to solve a complex problem with multiple patterns! I want to get data or put data in an unusual place!

Summary Mapper object map state one object per task Reducer object reduce state one call per input key-value pair one call per intermediate key (Key, Value) Check point, global state

MapReduce and HBase  Hbase as a source

MapReduce and HBase  Hbase as a sink

MapReduce and Hbase  Hbase as a shared resource  Join with Hbase

Part II NoSQL Database (MapReduce) Yuan Xue

Similar presentations

Presentation on theme: "Part II NoSQL Database (MapReduce) Yuan Xue"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part II NoSQL Database (MapReduce) Yuan Xue

Similar presentations

Presentation on theme: "Part II NoSQL Database (MapReduce) Yuan Xue"— Presentation transcript:

Similar presentations

About project

Feedback