Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Cloud Computing Lecture #2 Introduction to MapReduce Jimmy Lin The iSchool University of Maryland Monday, September 8, 2008 This work is licensed under.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
More MapReduce Jimmy Lin The iSchool University of Maryland Tuesday, March 31, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
HADOOP ADMIN: Session -2
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
WINTER Template Distributed Computing at Web Scale Kyonggi University. DBLAB. Haesung Lee.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
CPS 216: Advanced Database Systems Shivnath Babu.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Next Generation of Apache Hadoop MapReduce Owen
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
Apache hadoop & Mapreduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Cloud Distributed Computing Environment Hadoop
Word Co-occurrence Chapter 3, Lin and Dyer.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details Chris Dyer Department of Linguistics University of Maryland Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009) (Bonus session)

Agenda Hadoop “nuts and bolts” “Hello World” Hadoop example (distributed word count) Running Hadoop in “standalone” mode Running Hadoop on EC2 Open-source Hadoop ecosystem Exercises and “office hours”

Hadoop “nuts and bolts”

Source:

Hadoop Zen Don’t get frustrated (take a deep breath)… Remember this when you experience those moments This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none) But… Hadoop is the path to data nirvana?

Cloud 9 Library used for teaching cloud computing courses at Maryland Demos, sample code, etc. Computing conditional probabilities Pairs vs. stripes Complex data types Boilerplate code for working various IR collections Dog food for research Open source, anonymous svn access

JobTracker TaskTracker Master node Slave node Client

From Theory to Practice Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster

Data Types in Hadoop WritableDefines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComprableDefines a sort order. All keys must be of this type (but not values). IntWritable LongWritable Text … Concrete classes for different data types.

Complex Data Types in Hadoop How do you implement complex data types? The easiest way: Encoded it as Text, e.g., (a, b) = “a:b” Use regular expressions to parse and extract data Works, but pretty hack-ish The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping Alternatives: Cloud 9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types

Input file (on HDFS) InputSplit RecordReader Mapper Partitioner Reducer RecordWriter Output file (on HDFS) InputFormat OutputFormat

What version should I use?

“Hello World” Hadoop example

Hadoop in “standalone” mode

Hadoop in EC2

From Theory to Practice Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster

On Amazon: With EC2 You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster 0. Allocate Hadoop cluster EC2 Your Hadoop Cluster 7. Clean up! Uh oh. Where did the data go?

On Amazon: EC2 and S3 Your Hadoop Cluster S3 (Persistent Store) EC2 (The Cloud) Copy from S3 to HDFS Copy from HFDS to S3

Open-source Hadoop ecosystem

Hadoop/HDFS

Hadoop streaming

HDFS/FUSE

EC2/S3/EBS

EMR

Pig

HBase

Hypertable

Hive

Mahout

Cassandra

Dryad

CUDA

CELL

Beware of toys!

Exercises

Questions? Comments? Thanks to the organizations who support our work: