Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1.

Similar presentations


Presentation on theme: "Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1."— Presentation transcript:

1 Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1

2 MapReduce Programming model for distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Developed by Google – built on old, principles of parallel and distributed processing Hadoop – adoption of open-source implementation by Yahoo (now Apache project)

3 Big Data Big data – issue to grapple with Web-scale synonymous with data-intensive processing Public, private repositories of vast data Behavior data important - BI

4 4 th paradigm Manipulate, explore, mine massive data – 4 th paradigm of science (theory, experiments, simulations) In CS, systems must be able to scale Increases in capacity > improvements in bandwidth

5 Problems/Solutions NLP and IR Data driven algorithmic approach to capture statistical regularities – Data – corpora (NLP), collections (IR) – representations of data -features (superficial, deep) – Method – algorithms Examples – Is email span or not, is word part of an address or location

6 Problems/Solutions Who shot Lincoln? – NLP – sophisticated linguistics syntactic, semantic analysis – 2001- on left of “who shot Lincoln”, tally up, redundancy based approach Probability distribution of sequence of words – Training, smoothing – Markov assumption N-gram language model, conditional probability of a word is given by n-1 previous words

7 MapReduce (MR) MapReduce – level of abstraction and beneficial division of labor – Programming model – powerful abstraction separates what from how of data intensive processing

8 Big Ideas behind MapReduce Scale out not up – Purchasing symmetric multi-processing machines (SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective Why? Machine with 2x processors > 2x cost – Barroso & Holzle analysis using TPC benchmarks SMP – communication order magnitude faster – Cluster of low end approach 4x more cost effective than high end – However, even low end only 10-50% utilization – not energy efficient

9 Big Ideas behind MapReduce Assume failures are common – Assume cluster machines mean-time failure 1000 days – 10,000 server cluster, 10 failures a day – MR copes with failure Move processing to the data – MR assume architecture where processors/storage co-located – Run code on processor attached to data

10 Big Ideas behind MapReduce Process data sequentially not random – If 1TB DB with 10 10, 100B records – If update 1%, take 1 month – If read entire DB and rewrites all records with updates, takes < 1 work day on single machine – Solid state won’t help – MR – designed for batch processing, trade latency for throughput

11 Big Ideas behind MapReduce Hide system-level details from application developer – Writing distributed programs difficult Details across threads, processes, machines Code runs concurrently is unpredictable – Deadlocks, race conditions, etc. – MR isolates develop from system-level details No locking, starvation, etc. Well-defined interfaces Separates what (programmer) from how (responsibility of execution framework) Framework designed once and verified for correctness

12 Big Ideas behind MapReduce Seamless scalability – Given 2x data, algorithms takes at most 2x to run – Given cluster 2x large, take ½ time to run – The above is unobtainable for algorithms 9 women can’t have a baby in 9 months E.g. 2x programs takes longer Degree of parallelization increases communication – MR small step toward attaining Algorithm fixed, framework executes algorithm If use 10 machines 10 hours, 100 machines 1 hour

13 Motivation for MapReduce Still waiting for parallel processing to replace sequential Progress of Moore’s law - most problems could be solved by single computer, so ignore parallel, etc. Around 2005, no longer true – Semiconductor industry ran out of opportunities to improve Faster clocks cheaper pipelines, superscalar architecture – Then came multi-core Not matched by advances in software

14 Motivation Parallel processing only way forward MapReduce to the rescue – Anyone can download open source Hadoop implementation of MapReduce – Rent a cluster from a utility cloud – Process TB within the week Multiple cores in a chip, multiple machines in a cluster

15 Motivation MapReduce: effective data analysis tool – First widely-adopted step away from von Neumann model Can’t treat multi-core processor, cluster as conglomeration of many von Neumann machine image that communicates over network Wrong abstraction MR – organize computations not over individual machines, but over clusters Datacenter is the computer

16 Motivation Previous models of parallel computation – PRAM Arbitrary number of processors, share unbounded large memory, operate synchronously on shared input – LogP, BSP MR most successful abstraction for large-scale resources – Manages complexity, hides details, presents well-defined behavior – Makes certain tasks easier, others harder MapReduce first in new class of programming models


Download ppt "Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1."

Similar presentations


Ads by Google