Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1
MapReduce Programming model for distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Developed by Google – built on old, principles of parallel and distributed processing Hadoop – adoption of open-source implementation by Yahoo (now Apache project)
Big Data Big data – issue to grapple with Web-scale synonymous with data-intensive processing Public, private repositories of vast data Behavior data important - BI
4 th paradigm Manipulate, explore, mine massive data – 4 th paradigm of science (theory, experiments, simulations) In CS, systems must be able to scale Increases in capacity > improvements in bandwidth
Problems/Solutions NLP and IR Data driven algorithmic approach to capture statistical regularities – Data – corpora (NLP), collections (IR) – representations of data -features (superficial, deep) – Method – algorithms Examples – Is span or not, is word part of an address or location
Problems/Solutions Who shot Lincoln? – NLP – sophisticated linguistics syntactic, semantic analysis – on left of “who shot Lincoln”, tally up, redundancy based approach Probability distribution of sequence of words – Training, smoothing – Markov assumption N-gram language model, conditional probability of a word is given by n-1 previous words
MapReduce (MR) MapReduce – level of abstraction and beneficial division of labor – Programming model – powerful abstraction separates what from how of data intensive processing
Big Ideas behind MapReduce Scale out not up – Purchasing symmetric multi-processing machines (SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective Why? Machine with 2x processors > 2x cost – Barroso & Holzle analysis using TPC benchmarks SMP – communication order magnitude faster – Cluster of low end approach 4x more cost effective than high end – However, even low end only 10-50% utilization – not energy efficient
Big Ideas behind MapReduce Assume failures are common – Assume cluster machines mean-time failure 1000 days – 10,000 server cluster, 10 failures a day – MR copes with failure Move processing to the data – MR assume architecture where processors/storage co-located – Run code on processor attached to data
Big Ideas behind MapReduce Process data sequentially not random – If 1TB DB with 10 10, 100B records – If update 1%, take 1 month – If read entire DB and rewrites all records with updates, takes < 1 work day on single machine – Solid state won’t help – MR – designed for batch processing, trade latency for throughput
Big Ideas behind MapReduce Hide system-level details from application developer – Writing distributed programs difficult Details across threads, processes, machines Code runs concurrently is unpredictable – Deadlocks, race conditions, etc. – MR isolates develop from system-level details No locking, starvation, etc. Well-defined interfaces Separates what (programmer) from how (responsibility of execution framework) Framework designed once and verified for correctness
Big Ideas behind MapReduce Seamless scalability – Given 2x data, algorithms takes at most 2x to run – Given cluster 2x large, take ½ time to run – The above is unobtainable for algorithms 9 women can’t have a baby in 9 months E.g. 2x programs takes longer Degree of parallelization increases communication – MR small step toward attaining Algorithm fixed, framework executes algorithm If use 10 machines 10 hours, 100 machines 1 hour
Motivation for MapReduce Still waiting for parallel processing to replace sequential Progress of Moore’s law - most problems could be solved by single computer, so ignore parallel, etc. Around 2005, no longer true – Semiconductor industry ran out of opportunities to improve Faster clocks cheaper pipelines, superscalar architecture – Then came multi-core Not matched by advances in software
Motivation Parallel processing only way forward MapReduce to the rescue – Anyone can download open source Hadoop implementation of MapReduce – Rent a cluster from a utility cloud – Process TB within the week Multiple cores in a chip, multiple machines in a cluster
Motivation MapReduce: effective data analysis tool – First widely-adopted step away from von Neumann model Can’t treat multi-core processor, cluster as conglomeration of many von Neumann machine image that communicates over network Wrong abstraction MR – organize computations not over individual machines, but over clusters Datacenter is the computer
Motivation Previous models of parallel computation – PRAM Arbitrary number of processors, share unbounded large memory, operate synchronously on shared input – LogP, BSP MR most successful abstraction for large-scale resources – Manages complexity, hides details, presents well-defined behavior – Makes certain tasks easier, others harder MapReduce first in new class of programming models