Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.

Similar presentations


Presentation on theme: "Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative."— Presentation transcript:

1 Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Chris Dyer Department of Linguistics University of Maryland Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009) (Bonus session)

2 Agenda Hadoop “nuts and bolts” “Hello World” Hadoop example (distributed word count) Running Hadoop in “standalone” mode Running Hadoop on EC2 Open-source Hadoop ecosystem Exercises and “office hours”

3 Hadoop “nuts and bolts”

4 Source: http://davidzinger.wordpress.com/2007/05/page/2/

5 Hadoop Zen Don’t get frustrated (take a deep breath)… Remember this when you experience those W$*#T@F! moments This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none) But… Hadoop is the path to data nirvana?

6 Cloud 9 Library used for teaching cloud computing courses at Maryland Demos, sample code, etc. Computing conditional probabilities Pairs vs. stripes Complex data types Boilerplate code for working various IR collections Dog food for research Open source, anonymous svn access

7 JobTracker TaskTracker Master node Slave node Client

8 From Theory to Practice Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster

9 Data Types in Hadoop WritableDefines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComprableDefines a sort order. All keys must be of this type (but not values). IntWritable LongWritable Text … Concrete classes for different data types.

10 Complex Data Types in Hadoop How do you implement complex data types? The easiest way: Encoded it as Text, e.g., (a, b) = “a:b” Use regular expressions to parse and extract data Works, but pretty hack-ish The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping Alternatives: Cloud 9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types

11 Input file (on HDFS) InputSplit RecordReader Mapper Partitioner Reducer RecordWriter Output file (on HDFS) InputFormat OutputFormat

12 What version should I use?

13 “Hello World” Hadoop example

14 Hadoop in “standalone” mode

15 Hadoop in EC2

16 From Theory to Practice Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster

17 On Amazon: With EC2 You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster 0. Allocate Hadoop cluster EC2 Your Hadoop Cluster 7. Clean up! Uh oh. Where did the data go?

18 On Amazon: EC2 and S3 Your Hadoop Cluster S3 (Persistent Store) EC2 (The Cloud) Copy from S3 to HDFS Copy from HFDS to S3

19 Open-source Hadoop ecosystem

20 Hadoop/HDFS

21 Hadoop streaming

22 HDFS/FUSE

23 EC2/S3/EBS

24 EMR

25 Pig

26 HBase

27 Hypertable

28 Hive

29 Mahout

30 Cassandra

31 Dryad

32 CUDA

33 CELL

34 Beware of toys!

35 Exercises

36 Questions? Comments? Thanks to the organizations who support our work:


Download ppt "Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative."

Similar presentations


Ads by Google