Download presentation
Presentation is loading. Please wait.
Published byBrian Cummings Modified over 9 years ago
CSC 536 Lecture 3
Outline Akka example: mapreduce Distributed transactions
MapReduce Framework: Motivation Want to process lots of data ( > 1 TB) Want to parallelize the job across hundreds/thousands of commodity CPUs connected by a commodity networks Want to make this easy, re-usable
Example Uses at Google Pagerank wordcount distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation …
Programming Model Users implement interface of two functions: mapper (in_key, in_value) -> list((out_key, intermediate_value)) reducer (out_key, intermediate_values list) -> (out_key, out_value)
Map phase Records from the data source are fed into the mapper function as (key, value) pairs (filename, content) (goal: wordcount) (web page URL, web page content) (goal: web link- graph reversal) mapper produces one or more intermediate (output key, intermediate value) pairs from the input (word, 1) (link URL, web page URL)
Reduce phase After the Map phase is over, all the intermediate values for a given output key are combined together into a list (“hello”, 1), (“hello”, 1), (“hello”, 1) -> (“hello”, [1,1,1]) Done by intermediate aggregator step of MapReduce reducer function combines those intermediate values into one or more final values for that same output key (“hello”, [1,1,1]) -> (“hello”, 3)
Parallelism mapper functions run in parallel, creating different intermediate values from different input data sets reducer functions also run in parallel, each working on a different output key All values are processed independently
MapReduce example: wordcount Problem: Count the number of occurrences of words in a set of files Input to any MapReduce job: A set of (input_key, input_value) pairs In wordcount: (input_key, input_value) = (filename, content) filenames = ["a.txt", "b.txt", "c.txt"] content = {} for filename in filenames: f = open(filename) content[filename] = f.close()
MapReduce example: wordcount The content of the input files a.txt: The quick brown fox jumped over the lazy grey dogs. b.txt: That's one small step for a man, one giant leap for mankind. c.txt: Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go.
MapReduce example: wordcount Map phase: Function mapper is applied to every (filename, content) pair mapper moves through the words in the file for each word it encounters, it returns the intermediate key and value (word, 1) A call to mapper("a.txt", content["a.txt"]) returns: [('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)] The output of the Map phase is the concatenation of the lists for map("a.txt", content["a.txt"]), map("b.txt", content[“b.txt"]), and map("c.txt", content[“c.txt"])
MapReduce example: wordcount The output of the Map phase [('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), ('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1), ('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1), ('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1), ('mankind', 1)]
MapReduce example: wordcount The Map phase of MapReduce is logically trivial But when the input dictionary has, say, 10 billion keys, and those keys point to files held on thousands of different machines, implementing the map phase is actually quite non-trivial. The MapReduce library should handle: knowing which files are stored on what machines, making sure that machine failures don’t affect the computation, making efficient use of the network, and storing the output in a useable form. The programmer only writes the mapper function The MapReduce framework takes care of everything else
MapReduce example: wordcount In preparation for the Reduce phase, the MapReduce library groups together all the intermediate values which have the same key to obtain this intermediate dictionary: {'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 'that': [1], 'little': [1], 'small': [1], 'step': [1], 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}
MapReduce example: wordcount In the Reduce phase, a programmer-defined function reducer(out_key, intermediate_value_list) is applied to each entry in the intermediate dictionary. For wordcount, reducer sums up the list of intermediate values, and returns both out_key and the sum as the output. def reduce(out_key, intermediate_value_list): return (out_key, sum(intermediate_value_list))
MapReduce example: wordcount The output from the Reduce phase, and from the complete MapReduce computation, is: [('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]
MapReduce example: wordcount Map and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the intermediate values corresponding to the same key get sent to the same machine
MapReduce example: wordcount Map and Reduce can be done in parallel... but how is the grouping step that takes place between the Map phase and the Reduce phase done? For the reducer functions to work in parallel, we need to ensure that all the intermediate values corresponding to the same key get sent to the same machine The general idea: Imagine you’ve got 1000 machines that you’re going to use to run reduce on. As the mapper functions compute the output keys and intermediate value lists, they compute hash(out_key) mod 1000 for some hash function. This number is used to identify the machine in the cluster that the corresponding reducer will be run on, and the resulting output key and value list is then sent to that machine. Because every machine running mapper uses the same hash function, this ensures that value lists corresponding to the same output key all end up at the same machine. Furthermore, by using a hash we ensure that the output keys end up pretty evenly spread over machines in the cluster
mapreduce example project mapreduce in lecture 3 code
MapReduce optimizations Locality Fault Tolerance Time optimization Bandwidth optimization
Locality Master program divvies up tasks based on location of data tries to have mapper tasks on same machine as physical file data, or at least same rack mapper task inputs are divided into 64 MB blocks same size as Google File System chunks
Redundancy for Fault Tolerance Master detects worker failures via periodic heartbeats Re-executes completed & in-progress mapper tasks Re-executes in-progress reducer tasks
Redundancy for time optimization Reduce phase can’t start until Map phase is complete Slow workers significantly lengthen completion time A single slow disk controller can rate-limit the whole process Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first "wins” Effect: Dramatically shortens job completion time
Bandwidth Optimizations “Aggregator” function can run on same machine as a mapper function Causes a mini-reduce phase to occur before the real Reduce phase, to save bandwidth
Distributed Transactions
Distributed transactions Transactions, like mutual exclusion, protect shared data against simultaneous access by several concurrent processes. Transactions allow a process to access and modify multiple data items as a single atomic transaction. If the process backs out halfway during the transaction, everything is restored to the point just before the transaction started.
Distributed transactions: example 1 A customer dials into her bank web account and does the following: Withdraws amount x from account 1. Deposits amount x to account 2. If telephone connection is broken after the first step but before the second, what happens? Either both or neither should be completed. Requires special primitives provided by the DS.
The Transaction Model Examples of primitives for transactions Write data to a file, a table, or otherwiseWRITE Read data from a file, a table, or otherwiseREAD Kill the transaction and restore the old valuesABORT_TRANSACTION Terminate the transaction and try to commitEND_TRANSACTION Make the start of a transactionBEGIN_TRANSACTION DescriptionPrimitive
Distributed transactions: example 2 a)Transaction to reserve three flights commits b)Transaction aborts when third flight is unavailable BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi full => ABORT_TRANSACTION (b) BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi; END_TRANSACTION (a)
ACID Transactions are Atomic: to the outside world, the transaction happens indivisibly. Consistent: the transaction does not violate system invariants. Isolated (or serializable): concurrent transactions do not interfere with each other. Durable: once a transaction commits, the changes are permanent.
Flat, nested and distributed transactions a)A nested transaction b)A distributed transaction
Implementation of distributed transactions For simplicity, we consider transactions on a file system. Note that if each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will not vanish if the transaction aborts. Other methods required.
Atomicity If each process executing a transaction just updates the file in place, transactions will not be atomic, and changes will vanish if the transaction aborts.
Solution 1: Private Workspace a)The file index and disk blocks for a three-block file b)The situation after a transaction has modified block 0 and appended block 3 c)After committing
Solution 2: Writeahead Log (a) A transaction (b) – (d) The log before each statement is executed Log [x = 0 / 1] [y = 0/2] [x = 0/4] (d) Log [x = 0 / 1] [y = 0/2] (c) Log [x = 0 / 1] (b) x = 0; y = 0; BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y; END_TRANSACTION; (a)
Concurrency control (1) We just learned how to achieve atomicity; we will learn about durability when discussing fault tolerance Need to handle consistency and isolation Concurrency control allows several transactions to be executed simultaneously, while making sure that the data is left in a consistent state This is done by scheduling operations on data in an order whereby the final result is the same as if all transactions had run sequentially
Concurrency control (2) General organization of managers for handling transactions
Concurrency control (3) General organization of managers for handling distributed transactions.
Serializability The main issue in concurrency control is the scheduling of conflicting operations (operating on same data item and one of which is a write operation) Read/Write operations can be synchronized using: Mutual exclusion mechanisms, or Scheduling using timestamps Pessimistic/optimistic concurrency control
The lost update problem TransactionT: balance = b.getBalance(); b.setBalance(balance*1.1); a.withdraw(balance/10) TransactionU: balance = b.getBalance(); b.setBalance(balance*1.1); c.withdraw(balance/10) balance = b.getBalance(); $200 balance = b.getBalance(); $200 b.setBalance(balance*1.1); $220 b.setBalance(balance*1.1); $220 a.withdraw(balance/10) $80 c.withdraw(balance/10) $280 Accounts a, b, and c start with $100, $200, and $300, respectively
The inconsistent retrievals problem TransactionV: a.withdraw(100) b.deposit(100) TransactionW : aBranch.branchTotal() a.withdraw(100); $100 total = a.getBalance() $100 total = total+b.getBalance() $300 total = total+c.getBalance() b.deposit(100) $300 Accounts a and b start with $200 each.
A serialized interleaving of T and U TransactionT: balance = b.getBalance() b.setBalance(balance*1.1) a.withdraw(balance/10) TransactionU: balance = b.getBalance() b.setBalance(balance*1.1) c.withdraw(balance/10) balance = b.getBalance()$200 b.setBalance(balance*1.1)$220 balance = b.getBalance()$220 b.setBalance(balance*1.1)$242 a.withdraw(balance/10) $80 c.withdraw(balance/10)$278
A serialized interleaving of V and W TransactionV: a.withdraw(100); b.deposit(100) TransactionW: aBranch.branchTotal() a.withdraw(100); $100 b.deposit(100) $300 total = a.getBalance() $100 total = total+b.getBalance() $400 total = total+c.getBalance()...
Read and write operation conflict rules Operations of different transactions ConflictReason read NoBecause the effect of a pair of read operations does not depend on the order in which they are executed readwriteYesBecause the effect of a read and a write operation depends on the order of their execution write YesBecause the effect of a pair of write operations depends on the order of their execution
Serializability Two transactions are serialized if and only if All pairs of conflicting operations of the two transactions are executed in the same order at all objects they both access.
A non-serialized interleaving of operations of transactions T and U TransactionT: U: x = read(i) write(i, 10) y = read(j) write(j, 30) write(j, 20) z = read (i)
Recoverability of aborts Aborted transactions must be prevented from affecting other concurrent transactions Dirty reads Cascading aborts
A dirty read when transaction T aborts TransactionT: a.getBalance() a.setBalance(balance + 10) TransactionU: a.getBalance() a.setBalance(balance + 20) balance = a.getBalance()$100 a.setBalance(balance + 10)$110 balance = a.getBalance()$110 a.setBalance(balance + 20) $130 commit transaction abort transaction
Cascading aborts Suppose: Transaction U has seen the effects of transaction T Transaction V has seen the effects of transaction U T decides to abort
Cascading aborts Suppose: Transaction U has seen the effects of transaction T Transaction V has seen the effects of transaction U T decides to abort V and U must abort
Transactions T and U with locks
Two-phase locking (2) Idea: the scheduler grants locks in a way that creates only serializable schedules. In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule. Dirty reads and cascading aborts are still possible
Two-phase locking (2) Idea: the scheduler grants locks in a way that creates only serializable schedules. In 2-phase-locking, the transaction acquires all the locks it needs in the first phase, and then releases them in the second. This will insure a serializable schedule. Dirty reads and cascading aborts are still possible Under strict 2-phase locking, a transaction that needs to read or write an object must be delayed until other transactions that wrote the same object have committed or aborted Locks are held until transaction commits or aborts Example: CORBA Concurrency Control Service
Two-phase locking in a distributed system The data is assumed to be distributed across multiple machines Centralized 2PL: central scheduler grants locks Primary 2PL: local scheduler is coordinator for local data Distributed 2PL: (data may be replicated) the local schedulers use a distributed mutual exclusion algorithm to obtain a lock The local scheduler forwards Read/Write operations to data managers holding the replicas
Two-phase locking issues Exclusive locks reduce concurrency more than necessary. It is sometimes preferable to allow concurrent transactions to read an object; two types of locks may be needed (read locks and write locks) Deadlocks are possible. Solution 1: acquire all locks in the same order. Solution 2: use a graph to detect potential deadlocks.
Deadlock with write locks TransactionT U OperationsLocksOperationsLocks a.deposit(100); write lockA b.deposit(200) write lockB b.withdraw(100) waits forU’sa.withdraw(200);waits forT’s lock onB A
The wait-for graph B A Waits for Held by T U U T Waits for
A cycle in a wait-for graph U V T
Deadlock prevention with timeouts Transaction TTransaction U OperationsLocksOperationsLocks a.deposit(100); write lock A b.deposit(200) write lock B b.withdraw(100) waits for U ’s a.withdraw(200); waits for T’s lock onB A (timeout elapses) T’s lock onA becomes vulnerable, unlockA, abort T a.withdraw(200); write locksA unlockA, B
Disadvantages of locking High overhead Deadlocks Locks cannot be released until the end of the transaction, which reduces concurrency In most applications, the likelihood of two clients accessing the same object is low
Pessimistic timestamp concurrency control A transaction’s request to write an object is valid only if that object was last read and written by an earlier transaction A transaction’s request to read an object is valid only if that object was last written by an earlier transaction Advantage: Non-blocking and deadlock-free Disadvantage: Transactions may need to abort and restart
Operation conflicts for timestamp ordering Rule TcTc TiTi 1.writereadTcTc must notwrite an object that has beenread by anyTiTi where this requires thatTcTc ≥ the maximum read timestamp of the object. 2.write TcTc must notwrite an object that has beenwritten by anyTiTi where TiTi >TcTc this requires thatTcTc > write timestamp of the committedobject. 3.readwriteTcTc must notread an object that has beenwritten by anyTiTi where this requires thatTcTc > write timestamp of the committed object. TiTi >TcTc TiTi >TcTc
Pessimistic Timestamp Ordering Concurrency control using timestamps.
Optimistic timestamp ordering Idea: just go ahead and do the operations without paying attention to what concurrent transactions are doing: Keep track of when each data item has been read and written. Before committing, check whether any item has been changed since the transaction started. If so, abort. If not, commit. Advantage: deadlock free and fast. Disadvatange: it can fail and transactions must be run again. Example: Scala Software Transactional Memory (next week)
Similar presentations
© 2025 Inc.
All rights reserved.