Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Computer Abstractions and Technology
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Big Data. What is it? Massive volumes of rapidly growing data: – Smartphones broadcasting location (few secs) – Chips in cars diagnostic tests (1000s.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Introduction CS 524 – High-Performance Computing.
CS CS 5150 Software Engineering Lecture 19 Performance.
1 CS 501 Spring 2007 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Project Proposal (Title + Abstract) Due Wednesday, September 4, 2013.
Advances in Language Design
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca)
Computer System Architectures Computer System Software
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Structures and Algorithms in Parallel Computing Lecture 4.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Parallel IO for Cluster Computing Tran, Van Hoai.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Next Generation of Apache Hadoop MapReduce Owen
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
BIG DATA/ Hadoop Interview Questions.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Conclusions on CS3014 David Gregg Department of Computer Science
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Green cloud computing 2 Cs 595 Lecture 15.
Morgan Kaufmann Publishers
EE 193: Parallel Computing
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 1 Introduction.
Multithreaded Programming
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1

MapReduce Programming model for distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Developed by Google – built on old, principles of parallel and distributed processing Hadoop – adoption of open-source implementation by Yahoo (now Apache project)

Big Data Big data – issue to grapple with Web-scale synonymous with data-intensive processing Public, private repositories of vast data Behavior data important - BI

4 th paradigm Manipulate, explore, mine massive data – 4 th paradigm of science (theory, experiments, simulations) In CS, systems must be able to scale Increases in capacity > improvements in bandwidth

Problems/Solutions NLP and IR Data driven algorithmic approach to capture statistical regularities – Data – corpora (NLP), collections (IR) – representations of data -features (superficial, deep) – Method – algorithms Examples – Is span or not, is word part of an address or location

Problems/Solutions Who shot Lincoln? – NLP – sophisticated linguistics syntactic, semantic analysis – on left of “who shot Lincoln”, tally up, redundancy based approach Probability distribution of sequence of words – Training, smoothing – Markov assumption N-gram language model, conditional probability of a word is given by n-1 previous words

MapReduce (MR) MapReduce – level of abstraction and beneficial division of labor – Programming model – powerful abstraction separates what from how of data intensive processing

Big Ideas behind MapReduce Scale out not up – Purchasing symmetric multi-processing machines (SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective Why? Machine with 2x processors > 2x cost – Barroso & Holzle analysis using TPC benchmarks SMP – communication order magnitude faster – Cluster of low end approach 4x more cost effective than high end – However, even low end only 10-50% utilization – not energy efficient

Big Ideas behind MapReduce Assume failures are common – Assume cluster machines mean-time failure 1000 days – 10,000 server cluster, 10 failures a day – MR copes with failure Move processing to the data – MR assume architecture where processors/storage co-located – Run code on processor attached to data

Big Ideas behind MapReduce Process data sequentially not random – If 1TB DB with 10 10, 100B records – If update 1%, take 1 month – If read entire DB and rewrites all records with updates, takes < 1 work day on single machine – Solid state won’t help – MR – designed for batch processing, trade latency for throughput

Big Ideas behind MapReduce Hide system-level details from application developer – Writing distributed programs difficult Details across threads, processes, machines Code runs concurrently is unpredictable – Deadlocks, race conditions, etc. – MR isolates develop from system-level details No locking, starvation, etc. Well-defined interfaces Separates what (programmer) from how (responsibility of execution framework) Framework designed once and verified for correctness

Big Ideas behind MapReduce Seamless scalability – Given 2x data, algorithms takes at most 2x to run – Given cluster 2x large, take ½ time to run – The above is unobtainable for algorithms 9 women can’t have a baby in 9 months E.g. 2x programs takes longer Degree of parallelization increases communication – MR small step toward attaining Algorithm fixed, framework executes algorithm If use 10 machines 10 hours, 100 machines 1 hour

Motivation for MapReduce Still waiting for parallel processing to replace sequential Progress of Moore’s law - most problems could be solved by single computer, so ignore parallel, etc. Around 2005, no longer true – Semiconductor industry ran out of opportunities to improve Faster clocks cheaper pipelines, superscalar architecture – Then came multi-core Not matched by advances in software

Motivation Parallel processing only way forward MapReduce to the rescue – Anyone can download open source Hadoop implementation of MapReduce – Rent a cluster from a utility cloud – Process TB within the week Multiple cores in a chip, multiple machines in a cluster

Motivation MapReduce: effective data analysis tool – First widely-adopted step away from von Neumann model Can’t treat multi-core processor, cluster as conglomeration of many von Neumann machine image that communicates over network Wrong abstraction MR – organize computations not over individual machines, but over clusters Datacenter is the computer

Motivation Previous models of parallel computation – PRAM Arbitrary number of processors, share unbounded large memory, operate synchronously on shared input – LogP, BSP MR most successful abstraction for large-scale resources – Manages complexity, hides details, presents well-defined behavior – Makes certain tasks easier, others harder MapReduce first in new class of programming models