© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR )“ [16/Feb/2006:00:06: ] "GET / HTTP/1.1" " 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /kdr.css HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /images/KDnuggets_logo.gif HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 3b: Gawk for Web Log Analysis
© 2006 KDnuggets Gawk - introduction A very powerful text processing and pattern matching language gawk is a Gnu version of awk Syntax similar to C See for manualhttp:// Many awk/gawk tutorials, e.g. Note: The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977.
© 2006 KDnuggets Gawk - running Several ways of running from the Unix prompt: % gawk ‘commands’ file % cat file | gawk ‘commands’ % cat file | gawk –f prog.gawk’
© 2006 KDnuggets Gawk – fields and records Gawk divides the file into records and fields Each line is a record (by default) Fields are delimited by a special character Default: white space (blank or tab) Can be changed with –F option E.g. to have comma as a delimiter, use gawk –F”,” file.csv
© 2006 KDnuggets Gawk fields and variables Fields are accessed with the $ prefix Special variables: $1 is the first field, $2 is the second… $0 is a special field which is the entire line NF is a special variable - number of fields in the current record NR is a special variable – current record number
© 2006 KDnuggets Gawk conditions gawk –F"d" 'condition' file gawk processes each line of file, using the delimiter d (default is whitespace) to split each line into fields. The default action is to print the entire line.
© 2006 KDnuggets Sample log file We will use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. We will give useful code examples – for full gawk introduction see elsewhere You are encouraged to try the code examples in this lecture on this file You should get the same answers!
© 2006 KDnuggets Sample log file d100.log ip1664.com - - [16/Nov/2005:00:00: ] "GET /robots.txt HTTP/1.0" "-" "msnbot/1.0 (+ ip1664.com - - [16/Nov/2005:00:00: ] "GET /gpspubs/sigkdd-kdd99-panel.html HTTP/1.0" "-" "msnbot/1.0 (+ ip2283.unr - - [16/Nov/2005:00:01: ] "GET /dmcourse/data_mining_course/assignments/assignment-3.html HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip2283.unr - - [16/Nov/2005:00:01: ] "GET /dmcourse/dm.css HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip1389.net - - [16/Nov/2005:00:02: ] "GET /gpspubs/kdd99-est-ben-lift/sld021.htm HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02: ] "GET /gpspubs/kdd99-est-ben-lift/img021.gif HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02: ] "GET /favicon.ico HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1946.com - - [16/Nov/2005:00:02: ] "GET /news/2001/n10/15i.html HTTP/1.0" "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; …
© 2006 KDnuggets Example 1: Lines with Status not equal 200 Status code is field $9 in the log file How many lines had status code not 200: % gawk '$9 != 200' d100.log | wc Result: 27 Note: to count status code equal to 200, use '$9 == 200' not '$9 = 200' ( this sets $9 to be 200)
© 2006 KDnuggets Example 2: Count referrals from Google Gawk has powerful pattern matching variable ~ "pattern" Example: how many log lines had a referral (field $11 in the log line) from google: % gawk '$11 ~ "google"' d100.log | wc Result: 2
© 2006 KDnuggets Example 3: complex condition How many hits had GET method and status 404? (status 404 is an error code) Method is field $6 in the log, but the request is surrounded by " ". We can use % gawk '$6 ~ "GET" && $9 == 404' d100.log | wc Result: 1
© 2006 KDnuggets Example 4a: Counting ".html" requests The requested file is field $7. We can use this condition to match files that end in.html Note: $ in the pattern matches the end of string % gawk '$7 ~ ".html$"' d100.log | wc Result: 21
© 2006 KDnuggets Example 4b: Counting htm or html requests Some files may also end in.htm, so we can use % gawk '$7 ~ ".html$|.htm$"' d100.log | wc Result: 22
© 2006 KDnuggets Example 4c: Counting directory requests Some requests can be for a directory, e.g. a request for the homepage would have "GET / HTTP/1.1" string. We can count these requests by % gawk '$7 ~ "/$"' d100.log | wc Result: 6
© 2006 KDnuggets Example 4d: Counting all HTML pages or count html, htm, and directory pages by % gawk '$7 ~ "(html|htm|/)$"' d100.log | wc Result: 28
© 2006 KDnuggets Gawk computations More general form of gawk statements is gawk '{statements;…}' file The statements are executed for each line of file Statements include the usual conditionals, loops, etc Details in gawk manual/tutorials
© 2006 KDnuggets Example 5: External referrers Example: Print referrers to html pages, excluding direct access (where referrer is "-" ) Note: to test if $11 is "-", we need to escape a double quote as \" Code: (all on one line) % gawk '{if ($7~"html$" && $11!="\"-\"") print $11}' d100.log | wc Result: 7
© 2006 KDnuggets Gawk statements: BEGIN, END To execute statements before reading the first line we use BEGIN keyword To execute statements after the last line is read we use END keyword gawk 'BEGIN{stat1;…}{stat2;…}END{stat3;…}' file
© 2006 KDnuggets Example 6 Sum all the object sizes for access code 200 gawk '{if ($9 == 200) sumsize+=$10} END{print sumsize}' d100.log Result: Note: we did not initialize sumsize; all variables by default are initialized to zero