Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

Similar presentations


Presentation on theme: "MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015."— Presentation transcript:

1 MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

2 What does Scalable Mean? Operationally – In the past: works even if data does not fit in main memory – Now: can make use of 1000s of cheap computers Algorithmically – In the past: if you have N data items, you must do no more than N m operations – polynomial time algorithms – Now: if you have N data items, you should do no more than N m / k operations, for some large k Polynomial time algorithms must be parallelized

3 Example: Find matching DNA Sequences Given a set of sequences Find all sequences equal to “GATTACGATTATTA”

4 Sequential (Linear) search Time = 0 Match? NO

5 Sequential (Linear) search 40 Records, 40 comparison N Records, N comparison The algorithmic complexity is order N: O(N)

6 What if Sorted Sequences? GATATTTTAAGC < GATTACGATTATTA No Match – keep searching in other half… O(log N)

7 New Task: Read Trimming Given a set of DNA sequences Trim the final n bps of each sequence Generate a new dataset Can we use an index? – No we have to touch every record no matter what. – O(N) Can we do any better?

8 Parallelization O(?)

9 New Task: Convert 405K TIFF Images to PNG

10 Another Example: Computing Word Frequency of Every Word in a Single document

11

12 There is a pattern here … A function that maps a read to a trimmed read A function that maps a TIFF image to a PNG image A function that maps a document to its most common word A function that maps a document to a histogram of word frequencies.

13 Compute Word Frequency Across all Documents

14

15 (word, count)

16 How to split things into pieces – How to write map and reduce MAPREDUCE

17 Map Reduce Map-reduce: high-level programming model and implementation for large-scale data processing. Programming model is based on functional programming – Every record is assumed to be in form of pairs. Google: paper published 2004 Free variant: Hadoop – java – Apache

18 Example: Upper-case Mapper in ML

19 Example: Explode Mapper

20 Example: Filter Mapper

21 Example: Chaining Keyspaces Output key is int

22 Data Model Files A File = a bag of (key, value) pairs A map-reduce program: – Input: a bag of (inputkey, value) pairs – Output: a bag of (outputkey, value) pairs

23 Step 1: Map Phase User provides the Map function: – Input: (input key, value) – Output: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file

24 Step 2: Reduce Phase User provides Reduce function – Input: (intermediate key, bag of values) – Output: bag of output (values) The system will group all pairs with the same intermediate key, and passes the bag of values to the reduce function

25 Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list Reduce() combines those intermediate values into one or more final values for that same output key In practice, usually only one final value per key

26 Example: Sum Reducer

27 In summary Input and output : Each a set of key/value pairs Programmer specifies two function Map(in_key, in_value) -> list(out_key, intermediate_value) – Process input key/value pair – Produces set of intermediate pairs Reduce (out_key, list(intermediate_value)) -> list(out_value) – Combines all intermediate values for a particular key – Produces a set of merged output values (usually just one) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

28 Example: What does this do? Word count application of map reduce

29 Example: Word Length Histogram

30 Big = Yellow = 10+letters Medium = Red = 5..9 letters Small = Blue = 2..4 letters Tiny = Pink = 1 letter

31

32

33

34 More Examples: Building an Inverted Index Input Tweet1, (“I love pancakes for breakfast”) Tweet2, (“I dislike pancakes”) Tweet3, (“what should I eat for breakfast”) Tweet4, (“I love to eat”) Desired output “pancakes”, (tweet1, tweet2) “breakfast”, (tweet1, tweet3) “eat”, (tweet3, tweet4) “love”, (tweet1, tweet4)

35 More Examples: Relational Joins

36 Relational Join MapReduce: Before Map Phase

37 Relational Join MapReduce: Map Phase

38 Relational Join MapReduce: Reduce Phase

39 Relational Join in MapReduce, Another Example MAP: REDUCE:

40 Simple Social Network Analysis: Count Friends MAP SHUFFLE

41 Taxonomy of Parallel Architectures

42 Cluster Computing Large number of commodity servers, connected by high speed, commodity network Rack holds a small number of servers Data center: holds many racks Massive parallelism – 100s, or 1000s servers – Many hours Failure – If medium time between failure is 1 year – Then, 1000 servers have one failure / hour

43 Distributed File System (DFS) For very large files: TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times (>2), on different racks, for fault tolerance Implementations: – Google’s DFS: GFS, proprietary – Hadoop’s DFS: HDFS, open source

44 HDFS: Motivation Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Why not use an existing file system? – Different workload and design priorities – Handles much bigger dataset sizes than other file systems

45 Assumptions High component failure rates – Inexpensive commodity components fail all the time Modest number of HUGE files – Just a few million – Each is 100MB or larger; multi-GB files typical Files are write-once, mostly appended to – Perhaps concurrently Large streaming reads High sustained throughput favored over low latency

46 Hdfs Design Decisions Files are stored as blocks – Much larger size than most filesystems (default is 64MB) Reliability through replication – Each block replicated across 3+ DataNodes Single master (NameNode) coordinates access, metadata – Simple centralized management No data caching – Little benefit due to large data sets, streaming reads Familiar interface, but customize API – Simplify the problem; focus on distributed apps

47 Based on GFS Architecture

48 Referanslar https://class.coursera.org/datasci-001/lecture https://www.youtube.com/watch?v=xWgdny1 9yQ4 https://www.youtube.com/watch?v=xWgdny1 9yQ4


Download ppt "MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015."

Similar presentations


Ads by Google