Download presentation
Presentation is loading. Please wait.
Published bySybil Fleming Modified over 8 years ago
1
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems MapReduce Spring 2016
2
What is MapReduce? Programming model – expressing distributed computations at a massive scale – “the computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: map and reduce” Execution framework – organizing and performing data-intensive computations – processing parallelizable problems across huge datasets using a large number of computers (nodes) Open-source implementation: Hadoop and others 1
3
How Much Data ? Google processes 20 PB a day (2008) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC (Large Hadron Collider) will generate 15 PB a year 2 640K ought to be enough for anybody
4
Who cares ? Ready-made large-data problems – Lots of user-generated content, even more user behavior data Examples: Facebook friend suggestions, Google ad. placement – Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing – Provision Hadoop clusters on-demand in the cloud – Lower barriers to entry for tackling large-data problems – Commoditization and democratization of large-data capabilities 3
5
Spread Work Over Many Machines Challenges – Workload partitioning: how do we assign work units to workers? – Load balancing: what if we have more work units than workers? – Synchronization: what if workers need to share partial results? – Aggregation: how do we aggregate partial results? – Termination: how do we know all the workers have finished? – Fault tolerance: what if workers die? Common theme – Communication between workers (e.g., to exchange states) – Access to shared resources (e.g., data) We need a synchronization mechanism 4
6
Current Methods Programming models – Shared memory (pthreads) – Message passing (MPI) Design Patterns – Master-slaves – Producer-consumer flows – Shared work queues 5 Message Passing P1P1 P2P2 P3P3 P4P4 P5P5 Shared Memory P1P1 P2P2 P3P3 P4P4 P5P5 Memory master slaves producerconsumer producerconsumer work queue
7
Problem with Current Solutions Lots of programming work – communication and coordination – workload partitioning – status reporting – optimization – locality Repeat for every problem you want to solve Stuff breaks – One server may stay up three years (1,000 days) – If you have 10,000 servers, expect to lose 10 a day 6
8
What We Need A Distributed System – Scalable – Fault-tolerant – Easy to program – Applicable to many problems – …… 7
9
How Do We Scale Up ? Divide & Conquer 8 “Work” w1w1 w2w2 w3w3 r1r1 r2r2 r3r3 “Result” “worker” Partition Combine
10
General Ideas Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations – map (k, v) → – reduce (k’, v’) → All values with the same key are sent to the same reducer – The execution framework handles everything else… 9 Map Reduce
11
General Ideas 10 map Shuffle and Sort: aggregate values by keys reduce k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 a15b27c2368 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3
12
Two More Functions Apart from Map and Reduce, the execution framework handles everything else… Not quite…usually, programmers can also specify: – partition (k’, number of partitions) → partition for k’ Divides up key space for parallel reduce operations Often a simple hash of the key, e.g., hash(k’) mod n – combine (k’, v’) → * Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic 11
13
12 combine ba12c9ac52bc78 partition map k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 Shuffle and Sort: aggregate values by keys reduce a15b27c298 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 c2
14
Motivation for Local Aggregation Ideal scaling characteristics: – Twice the data, twice the running time – Twice the resources, half the running time Why can’t we achieve this? – Synchronization requires communication – Communication kills performance Thus… avoid communication! – Reduce intermediate data via local aggregation – Combiners can help 13
15
Word Count v1.0 14 Input: { } Output:. e.g.
16
15 … Group by reduce key … … … … <“obama”, {1}> <“the”, {1, 1}> <“is”, {1, 1, 1}>
17
Word Count v2.0 16
18
Word Count v3.0 17 Key: preserve state across input key-value pairs!
19
Combiner Design Combiners and reducers share same method signature – Sometimes, reducers can serve as combiners – Often, not… Remember: combiner are optional optimizations – Should not affect algorithm correctness – May be run 0, 1, or multiple times Example: find average of all integers associated with the same key 18
20
Computing the Mean v1.0 19 Why can’t we use reducer as combiner?
21
Computing the Mean v2.0 20 Why doesn’t this work? combiners must have the same input and output key-value type, which also must be the same as the mapper output type and the reducer input type
22
Computing the Mean v3.0 21
23
Computing the Mean v4.0 22
24
MapReduce Runtime Handles scheduling – Assigns workers to map and reduce tasks Handles “data distribution” – Moves processes to data Handles synchronization – Gathers, sorts, and shuffles intermediate data Handles errors and faults – Detects worker failures and restarts Everything happens on top of a distributed FS 23
25
Execution 24 split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2) schedule reduce (3) read (4) local write (5) remote read (6) write Input files Map phase Intermediate files (on local disk) Reduce phase Output files
26
Implementation Google has a proprietary implementation in C++ – Bindings in Java, Python Hadoop is an open-source implementation in Java – Development led by Yahoo, used in production – Now an Apache project – Rapidly expanding software ecosystem Lots of custom research implementations – For GPUs, cell processors, etc. 25
27
Distributed File System Don’t move data to workers… move workers to the data! – Store data on the local disks of nodes in the cluster – Start up the workers on the node that has the data local Why? – Not enough RAM to hold all the data in memory – Disk access is slow, but disk throughput (data transfer rate) is reasonable A distributed file system is the answer – GFS (Google File System) for Google’s MapReduce – HDFS (Hadoop Distributed File System) for Hadoop 26
28
GFS Commodity hardware over “exotic” hardware – Scale “out”, not “up” Scale out (horizontally): add more nodes to a system Scale up (vertically): add resources to a single node in a system High component failure rates – Inexpensive commodity components fail all the time “Modest” number of huge files – Multi-gigabyte files are common, if not encouraged Files are write-once, mostly appended to – Perhaps concurrently Large streaming reads over random access – High sustained throughput over low latency 27
29
Seeks vs. Scans Consider a 1 TB database with 100-byte records – We want to update 1 percent of the records Scenario 1: random access – Each update takes ~30 ms (seek, read, write) – 10 8 updates = ~35 days Scenario 2: rewrite all records – Assume 100 MB/s throughput – Time = 5.6 hours(!) Lesson: avoid random seeks! 28
30
GFS Files stored as chunks – Fixed size (64MB) Reliability through replication – Each chunk replicated across 3+ chunk servers Single master to coordinate access, keep metadata – Simple centralized management No data caching – Little benefit due to large datasets, streaming reads Simplify the API – Push some of the issues onto the client (e.g., data layout) 29
31
Relational Databases vs. MapReduce Relational databases: – Multipurpose: analysis and transactions; batch and interactive – Data integrity via ACID transactions – Lots of tools in software ecosystem (for ingesting, reporting, etc.) – Supports SQL (and SQL integration, e.g., JDBC) – Automatic SQL query optimization MapReduce (Hadoop): – Designed for large clusters, fault tolerant – Data is accessed in “native format” – Supports many query languages – Programmers retain control over performance – Open source 30
32
Workloads OLTP (online transaction processing) – Typical applications: e-commerce, banking, airline reservations – User facing: real-time, low latency, highly-concurrent – Tasks: relatively small set of “standard” transactional queries – Data access pattern: random reads, updates, writes (involving relatively small amounts of data) OLAP (online analytical processing) – Typical applications: business intelligence, data mining – Back-end processing: batch workloads, less concurrency – Tasks: complex analytical queries, often ad hoc – Data access pattern: table scans, large amounts of data involved per query 31
33
OLTP/OLAP Integration OLTP database for user-facing transactions – Retain records of all activity – Periodic ETL (e.g., nightly) Extract-Transform-Load (ETL) – Extract records from source – Transform: clean data, check integrity, aggregate, etc. – Load into OLAP database OLAP database for data warehousing – Business intelligence: reporting, ad hoc queries, data mining, etc. – Feedback to improve OLTP services 32
34
Relational Algebra in MapReduce Projection – Map over tuples, emit new tuples with appropriate attributes – No reducers, unless for regrouping or resorting tuples – Alternatively: perform in reducer, after some other processing Selection – Map over tuples, emit only tuples that meet criteria – No reducers, unless for regrouping or resorting tuples – Alternatively: perform in reducer, after some other processing 33
35
Relational Algebra in MapReduce Group by – Example: What is the average time spent per URL? – In SQL: SELECT url, AVG(time) FROM visits GROUP BY url – In MapReduce: Map over tuples, emit time, keyed by url Framework automatically groups values by keys Compute average in reducer Optimize with combiners 34
36
Join in MapReduce Reduce-side Join: group by join key – Map over both sets of tuples – Emit tuple as value with join key as the intermediate key – Execution framework brings together tuples sharing the same key – Perform actual join in reducer – Similar to a “sort-merge join” in database terminology 35
37
Reduce-side Join: Example 36 R1R1 R4R4 S2S2 S3S3 R1R1 R4R4 S2S2 S3S3 keysvalues Map R1R1 R4R4 S2S2 S3S3 keysvalues Reduce Note: no guarantee if R is going to come first or S
38
Join in MapReduce Map-side Join: parallel scans – Assume two datasets are sorted by the join key 37 R1R1 R2R2 R3R3 R4R4 S1S1 S2S2 S3S3 S4S4 A sequential scan through both datasets to join (called a “merge join” in database terminology)
39
Join in MapReduce Map-side Join – If datasets are sorted by join key, join can be accomplished by a scan over both datasets – How can we accomplish this in parallel? Partition and sort both datasets in the same manner – In MapReduce: Map over one dataset, read from other corresponding partition No reducers necessary (unless to repartition or resort) 38
40
Join in MapReduce In-memory Join – Basic idea: load one dataset into memory, stream over other dataset Works if R << S and R fits into memory Called a “hash join” in database terminology – MapReduce implementation Distribute R to all nodes Map over S, each mapper loads R in memory, hashed by join key For every tuple in S, look up join key in R No reducers, unless for regrouping or resorting tuples 39
41
Which Join Algorithm to Use? In-memory join > map-side join > reduce-side join – Why? Limitations of each? – In-memory join: memory – Map-side join: sort order and partitioning – Reduce-side join: general purpose 40
42
Processing Relational Data: Summary MapReduce algorithms for processing relational data: – Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce – Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer – Multiple strategies for relational joins Complex operations require multiple MapReduce jobs – Example: top ten URLs in terms of average time spent – Opportunities for automatic optimization 41
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.