Download presentation
Presentation is loading. Please wait.
Published byAnissa Wilkerson Modified over 9 years ago
2
GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics
3
2 Data created July 2009 Version 1 file format N-gram \t year \t match_count \t page_count \t volume_count \n N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram Year is the publication year match_count is the occurrences for that year page_count is the number of pages on which the ngram appeared volume_count is the number of books where the ngram occurred Version 1
4
3 http://aws.amazon.com/datasets/8172056142375670 http://aws.amazon.com/datasets/8172056142375670 Stored in AWS Simple Storage Service (S3) AWS Public Dataset
5
4 Stored as compressed data Luckily Hadoop supports GZIP BZIP2 LZO (see below) DEFLATE (zlib implementation) But Hadoop does not support WinZip And Hadoop supports LZO only if you create a version with it yourself AWS Public Dataset
6
5 Compression Format ToolAlgorithmFilename Extension Multiple files? Able to be Split? DEFLATE (zlib)No CLI toolsDEFLATE.deflateNo gzip DEFLATE+.gzNo bzip2.bz2NoYes LZOlzopLZO.lzoNo Hadoop Compression Formats Source: Hadoop The Definitive Guide
7
6 Compression FormatTool DEFLATE (zlib) org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.GzipCodec LZO com.hadoop.compression.LzopCodec Hadoop Compression Formats Source: Hadoop The Definitive Guide
8
Project Assignment I 7 Use the nwcdatabucket as the bucket for input Use the tmp folder in nwcdatabucket Input is nwcdatabucket/tmp Write Python code (in > 1.py files) Find the twenty most frequently occurring 5-grams for a 10 year period. You may hard-code the 10 year period E.g. 1950 to 1959 You need not worry about error checking the range
9
Project Assignment II 8 Setting reducers Use the extra arguments in the bottom of the first page The following creates 1 reducer -D mapred.reduce.tasks=1 Upload your results as a text file Upload your Python code modules
10
The end has come. End of the Part 3 PowerPoint 9
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.