인공지능연구실 이남기 ( beohemian@gmail.com ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( beohemian@gmail.com )
Environment Cloudera QuickStart VM with 5.4.2 Guide for Download http://ailab.ssu.ac.kr/rb/?c=8/29&cat=2015_2_%EC%9C%A0%EB%B9%84%EC %BF%BC%ED%84%B0%EC%8A%A4+%EC%9D%91%EC%9A%A9%EC%8B%9C% EC%8A%A4%ED%85%9C&uid=660
Contents Using HDFS Running MapReduce Job : WordCount How To Use How To Upload File How To View and Manipulate File Exercise Running MapReduce Job : WordCount Goal Remind MapReduce Code Review Run WordCount Program Extra Exercise : Number of Connection per Hour Meaningful Data from ‘Access Log’ Foundation of Regural Expression Run MapReduce Job Importing Data With Sqoop Review MySQL
Using HDFS With Exercise
Using HDFS How to use HDFS How to Upload File How to View and Manipulate File
Using HDFS – How To Use (1) You see a help message describing all the commands associated with HDFS $ hadoop fs
Using HDFS – How To Use (2) You see the contents of directory in HDFS: $ hadoop fs –ls / $ hadoop fs –ls /user $ hadoop fs –ls /user/cloudera
Exercise How To Use
Using HDFS – How To Upload File (1) Unzip ‘shakespeare.tar.gz’: $ cd ~/training_materials/developer/data $ tar zxvf shakespeare.tar.gz
Using HDFS – How To Upload File (2) Insert ‘shakespeare’ directory into HDFS: $ hadoop fs -put shakespeare /user/cloudera/shakespeare
Exercise How To Upload
Using HDFS – How To View and Manipulate Files (1) Remove directory $ hadoop fs –ls shakespeare $ hadoop fs –rm shakespeare/glossary
Using HDFS – How To View and Manipulate Files (2) Print the last 50 lines of Herny IV $ hadoop fs –cat shakespeare/histories \ | tail –n 50
Using HDFS – How To View and Manipulate Files (3) Download file and manipulate $ hadoop fs –get shakespeare/poems \ ~/shakepoems.txt $ less ~/shakepoems.txt If you want to know other command: $ hadoop fs
Exercise How To View and Manipulate Files
Running a MapReduce Job With Exercise
Running a MapReduce Job Goal Remind MapReduce Code Review Run WordCount Program
Running a MapReduce Job – Goal Works of Shakespeare Final Result ALL'S WELL THAT ENDS WELL DRAMATIS PERSONAE KING OF FRANCE (KING:) DUKE OF FLORENCE (DUKE:) BERTRAM Count of Rousillon. LAFEU an old lord. PAROLLES a follower of Bertram. Steward | | servants to the Countess of Rousillon.Clown | A Page. (Page:) COUNTESS OFROUSILLON mother to Bertram. (COUNTESS:) HELENA a gentlewoman protected by the Countess. … Key Value A 2027 ADAM 16 AARON 72 ABATE 1 ABIDE ABOUT 18 ACHIEVE ACKNOWN … Run WordCount We will submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare
Running a MapReduce Job – Mapper
Running a MapReduce Job – Shuffle & Sort
Running a MapReduce Job – SumReducer
Running a MapReduce Job – WordCount Code Review WordCount.java A simple MapReduce driver class WordMapper.java A Mapper class for the job SumReducer.java A reducer class for the job
Running a MapReduce Job – Code Review : WordCount.java public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf( "Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1);
Running a MapReduce Job – Code Review : WordMapper.java Ex ) Text File => the cat sat on the mat The aardvark sat on the sofa public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } Key (LongWritable) Value (Text) 1000055 the cat sat on the mat 1000257 the aardvark sat one the sofa Write to context Object Key (Text) Value (IntWritable) the 1 cat … The map method runs once for each line of text in the input file.
Running a MapReduce Job – Code Review : WordReduce.java Input Data Output Data public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); Key Values aardvark [1] cat mat on [1,1] sat sofa the [1,1,1,1] … Key Value aardvark 1 cat mat on 2 sat sofa the 4 … SumReducer The reduce method runs once for each key received from the shuffle and sort phase of the MapReduce framework
Running a MapReduce Job – Run WordCount in HDFS Complie the three Java classes and Collect complied Java files into a JAR file: $ cd ~/{Your Workspace} $ javac –classpath `hadoop classpath` *.java $ jar cvf wc.jar *.class
Running a MapReduce Job – Run WordCount in HDFS Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare: $ hadoop jar wc.jar WordCount shakespeare \ wordcounts wc.jar – jar file WordCount – Class Name containing Main method(Driver Class) shakespeare – Input directory wordcounts – Output directory
Exercise MapReduce Job : WordCount
Extra Exercise MapReduce Job : Number of Connection per Hour
Extra Exercise – Meaningful data from ‘Access Log’ data Let's extract meaningful data from ‘Access Log’ data. 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 – 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/loading.gif HTTP/1.1" 304 – 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:04:42 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:04:43 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:06:22 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:06:23 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:12:41 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:12:41 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:13:28 -0700] "GET / HTTP/1.1" 200 524 …
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address – userid
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address – userid [The time that the request was received.]
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address – userid [The time that the request was received.] “The request line from the client is given in double quotes.”
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address – userid [The time that the request was received.] “The request line from the client is given in double quotes.” status code
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address – userid [The time that the request was received.] “The request line from the client is given in double quotes.” status-code size of object returned to the client.
Extra Exercise – Meaningful data from ‘Access Log’ data 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 IP Address – userid [The time that the request was received.] “The request line from the client is given in double quotes.” status-code size of object returned to the client. http://httpd.apache.org/docs/2.2/logs.html
Extra Exercise – Meaningful data from ‘Access Log’ data 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 – 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/loading.gif HTTP/1.1" 304 – 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:04:42 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:04:43 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:06:22 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:06:23 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:12:41 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:12:41 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:13:28 -0700] "GET / HTTP/1.1" 200 524 …
Extra Exercise – Meaningful data from ‘Access Log’ data 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 – 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/loading.gif HTTP/1.1" 304 – 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:04:42 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:04:43 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:06:22 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:06:23 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:12:41 -0700] "GET / HTTP/1.1" 200 524 10.223.157.186 - - [15/Jul/2009:21:12:41 -0700] "GET /favicon.ico HTTP/1.1" 404 209 10.223.157.186 - - [15/Jul/2009:21:13:28 -0700] "GET / HTTP/1.1" 200 524 … How many times hourly connections?
Extra Exercise – Regular Expression 10.223.157.186 - - [15/Jul/2009:20:50:39 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 – Using Regural Expression \d : Matches any digit character(0-9). Ex )+1-(444)-555-1234 \w : Matches any word character. Ex )Hello World! \s : Matches any whitespace character. Ex )Hello World! + : Matches 1 or more the preceding token. Ex ) \w+ Hello World!
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise – Regular Expression [15/Jul/2009:20:50:39 -0700] \[\d+\/\w+\/\d+:\d+:\d+:\d+\s+-\w+\]
Extra Exercise [15/Jul/2009:20:50:39 -0700]
Extra Exercise – Run MapReduce [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:21:04:42 -0700] [15/Jul/2009:21:04:43 -0700] [15/Jul/2009:21:06:22 -0700] [15/Jul/2009:21:06:23 -0700] [15/Jul/2009:21:12:41 -0700] [15/Jul/2009:21:12:41 -0700] [15/Jul/2009:21:13:28 -0700] … Mapper Key Value 20 1 21 …
Extra Exercise – Run MapReduce [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:21:04:42 -0700] [15/Jul/2009:21:04:43 -0700] [15/Jul/2009:21:06:22 -0700] [15/Jul/2009:21:06:23 -0700] [15/Jul/2009:21:12:41 -0700] [15/Jul/2009:21:12:41 -0700] [15/Jul/2009:21:13:28 -0700] … Shuffle & Sort Key Values 20 [1,1,1] 21 [1,1,1,1,1, …] …
Extra Exercise – Run MapReduce [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:20:50:39 -0700] [15/Jul/2009:21:04:42 -0700] [15/Jul/2009:21:04:43 -0700] [15/Jul/2009:21:06:22 -0700] [15/Jul/2009:21:06:23 -0700] [15/Jul/2009:21:12:41 -0700] [15/Jul/2009:21:12:41 -0700] [15/Jul/2009:21:13:28 -0700] … Reducer Key Value 20 87681 21 85914 …
Extra Exercise – Run MapReduce Final Output Key Value 119827 1 165533 2 246174 3 273089 4 273020 5 264181 6 294837 7 312028 8 327732 9 300460 …
Extra Exercise
Exercise MapReduce Job : Number of Connection per Hour
Importing Data With Sqoop Review MySQL and Exercise
Importing Data With Sqoop Log on to MySQL: $ mysql --user=root \ --password=cloudera Select Database > use retail_db; Show Databases: > show databases;
Importing Data With Sqoop – Review MySQL (1) Log on to MySQL: $ mysql --user=root \ --password=cloudera Show Databases: > show databases; Select Databases: > use retail_db; Show Tables: > show tables;
Importing Data With Sqoop – Review MySQL (2) Review ‘customers’ table schema: > DESCRIBE customers;
Importing Data With Sqoop – Review MySQL (3) Review ‘customers’ table: > DESCRIBE customers; … > SELECT * FROM customers LIMIT 5;
Importing Data With Sqoop – How To Use (1) List the databases (schemas) in your database server: $ sqoop list-databases \ --connect jdbc:mysql://localhost \ --username root --password cloudera List the tables in the ‘retail_db’ database: $ sqoop list-tables \ --connect jdbc:mysql://localhost/retail_db \ --username root --password cloudera
Importing Data With Sqoop – How To Use (2) Import the ‘customers’ table into HDFS $ sqoop import \ --connect jdbc:mysql://localhost/retail_db \ --table customers --fields-terminated-by '\t' \ --username training --password training Verify that the command has worked $ hadoop fs –ls customers $ hadoop fs –tail movie/part-m-00000