© 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"

Slides:



Advertisements
Similar presentations
CST8177 awk. The awk program is not named after the sea-bird (that's auk), nor is it a cry from a parrot (awwwk!). It's the initials of the authors, Aho,
Advertisements

Lecture 7.2  awk. History of AWK The name AWK – Initials of designers: Alfred V. Alo, Peter J. Weinberger, and Brian W. Kernighan. Appear 1977, stable.
The Web Warrior Guide to Web Design Technologies
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
1 Unix Talk #2 AWK overview Patterns and actions Records and fields Print vs. printf.
AWK: The Duct Tape of Computer Science Research Tim Sherwood UC San Diego.
Guide To UNIX Using Linux Third Edition
Guide To UNIX Using Linux Third Edition
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
Shell Scripting Awk (part1) Awk Programming Language standard unix language that is geared for text processing and creating formatted reports but it.
Introduction to Shell Script Programming
1 Operating Systems Lecture 3 Shell Scripts. 2 Shell Programming 1.Shell scripts must be marked as executable: chmod a+x myScript 2. Use # to start a.
1 Operating Systems Lecture 3 Shell Scripts. 2 Brief review of unix1.txt n Glob Construct (metacharacters) and other special characters F ?, *, [] F Ex.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Introduction to Programming Workshop 1 PHYS1101 Discovery Skills in Physics Dr. Nigel Dipper Room 125d
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
UNIX and Shell Programming (06CS36) Unit 1 Continued… Shrinivas R. Mangalwede Department of Computer Science and Engineering K.L.S. Gogte Institute of.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
1 System Administration Introduction to Scripting, Perl Session 3 – Sat 10 Nov 2007 References:  chapter 1, The Unix Programming Environment, Kernighan.
Introduction to Awk Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data manipulation tasks.
Programmable Text Processing with awk Lecturer: Prof. Andrzej (AJ) Bieszczad Phone: “UNIX for Programmers and Users”
CMSC 104, Version 9/011 Introduction to C Topics Compilation Using the gcc Compiler The Anatomy of a C Program 104 C Programming Standards and Indentation.
Awk Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Chapter 12: gawk Yes it sounds funny. In this chapter … Intro Patterns Actions Control Structures Putting it all together.
A talk about AWK Don Newcomb 18 Jan What is AWK? AWK is an interpreted computer language It is primarily used for text processing and data formatting.
Revision Lecture Mauro Jaskelioff. AWK Program Structure AWK programs consists of patterns and procedures Pattern_1 { Procedure_1} Pattern_2 { Procedure_2}
Introduction to Unix – CS 21
BY A Mikati & M Shaito Awk Utility n Introduction n Some basics n Some samples n Patterns & Actions Regular Expressions n Boolean n start /end n.
1 P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Awk Programming (2) Ruibin Bai (Room AB326) Division of Computer Science The University.
5 1 Data Files CGI/Perl Programming By Diane Zak.
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming basics.
LIN Unix Lecture 7 Hana Filip. LIN Text Processing Command Line Utility Programs (cont.) sed LAST WEEK wc sort tr uniq awk TODAY join paste.
CSCI 330 UNIX and Network Programming
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
1 P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Awk Programming Ruibin Bai (Room AB326) Division of Computer Science The University.
Alon Efrat Computer Science Department University of Arizona Unix Tools.
CISC 1480/KRF Copyright © 1999 by Kenneth R. Frazer 1 AWK q A programming language for handling common data manipulation tasks with only a few lines of.
The awk command. Introduction Awk is a programming language used for manipulating data and generating reports. The data may come from standard input,
Sed. Class Issues vSphere Issues – root only until lab 3.
1 Lecture 10 Introduction to AWK COP 3344 Introduction to UNIX.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
L071 Introduction to C Topics Compilation Using the gcc Compiler The Anatomy of a C Program Reading Sections
Programming Languages Meeting 12 November 18/19, 2014.
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
Zaap Visualization of web traffic from http server logs.
AWK One tool to create them all AWK Marcel Nijenhof Eth-0 11 Augustus 2010.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Awk Programmable Filters 1.
Arun Vishwanathan Nevis Networks Pvt. Ltd.
CSC 4630 Meeting 7 February 7, 2007.
>> Introduction to JavaScript
CS 330 Class 7 Comments on Exam Programming plan for today:
* Lecture # 7 Instructor: Rida Noor Department of Computer Science
PROGRAMMING THE BASH SHELL PART IV by İlker Korkmaz and Kaya Oğuz
Introduction to C Topics Compilation Using the gcc Compiler
CS 403: Programming Languages
John Carelli, Instructor Kutztown University
Programming Languages
CSE 303 Concepts and Tools for Software Development
Awk.
Introduction to Computer Science
Log Analysis with GAWK Back to Basics.
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR )“ [16/Feb/2006:00:06: ] "GET / HTTP/1.1" " 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /kdr.css HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" [16/Feb/2006:00:06: ] "GET /images/KDnuggets_logo.gif HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 3b: Gawk for Web Log Analysis

© 2006 KDnuggets Gawk - introduction A very powerful text processing and pattern matching language gawk is a Gnu version of awk Syntax similar to C See for manualhttp:// Many awk/gawk tutorials, e.g. Note: The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977.

© 2006 KDnuggets Gawk - running  Several ways of running from the Unix prompt: % gawk ‘commands’ file % cat file | gawk ‘commands’ % cat file | gawk –f prog.gawk’

© 2006 KDnuggets Gawk – fields and records  Gawk divides the file into records and fields  Each line is a record (by default)  Fields are delimited by a special character  Default: white space (blank or tab)  Can be changed with –F option  E.g. to have comma as a delimiter, use gawk –F”,” file.csv

© 2006 KDnuggets Gawk fields and variables Fields are accessed with the $ prefix Special variables:  $1 is the first field, $2 is the second…  $0 is a special field which is the entire line  NF is a special variable - number of fields in the current record  NR is a special variable – current record number

© 2006 KDnuggets Gawk conditions gawk –F"d" 'condition' file  gawk processes each line of file, using the delimiter d (default is whitespace) to split each line into fields.  The default action is to print the entire line.

© 2006 KDnuggets Sample log file  We will use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file.  We will give useful code examples – for full gawk introduction see elsewhere  You are encouraged to try the code examples in this lecture on this file  You should get the same answers!

© 2006 KDnuggets Sample log file d100.log ip1664.com - - [16/Nov/2005:00:00: ] "GET /robots.txt HTTP/1.0" "-" "msnbot/1.0 (+ ip1664.com - - [16/Nov/2005:00:00: ] "GET /gpspubs/sigkdd-kdd99-panel.html HTTP/1.0" "-" "msnbot/1.0 (+ ip2283.unr - - [16/Nov/2005:00:01: ] "GET /dmcourse/data_mining_course/assignments/assignment-3.html HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip2283.unr - - [16/Nov/2005:00:01: ] "GET /dmcourse/dm.css HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip1389.net - - [16/Nov/2005:00:02: ] "GET /gpspubs/kdd99-est-ben-lift/sld021.htm HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02: ] "GET /gpspubs/kdd99-est-ben-lift/img021.gif HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02: ] "GET /favicon.ico HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1946.com - - [16/Nov/2005:00:02: ] "GET /news/2001/n10/15i.html HTTP/1.0" "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; …

© 2006 KDnuggets Example 1: Lines with Status not equal 200  Status code is field $9 in the log file  How many lines had status code not 200: % gawk '$9 != 200' d100.log | wc Result: 27 Note: to count status code equal to 200, use '$9 == 200' not '$9 = 200' ( this sets $9 to be 200)

© 2006 KDnuggets Example 2: Count referrals from Google  Gawk has powerful pattern matching  variable ~ "pattern"  Example: how many log lines had a referral (field $11 in the log line) from google: % gawk '$11 ~ "google"' d100.log | wc Result: 2

© 2006 KDnuggets Example 3: complex condition  How many hits had GET method and status 404?  (status 404 is an error code)  Method is field $6 in the log, but the request is surrounded by " ". We can use % gawk '$6 ~ "GET" && $9 == 404' d100.log | wc Result: 1

© 2006 KDnuggets Example 4a: Counting ".html" requests  The requested file is field $7. We can use this condition to match files that end in.html  Note: $ in the pattern matches the end of string % gawk '$7 ~ ".html$"' d100.log | wc Result: 21

© 2006 KDnuggets Example 4b: Counting htm or html requests Some files may also end in.htm, so we can use % gawk '$7 ~ ".html$|.htm$"' d100.log | wc Result: 22

© 2006 KDnuggets Example 4c: Counting directory requests Some requests can be for a directory, e.g. a request for the homepage would have "GET / HTTP/1.1" string.  We can count these requests by % gawk '$7 ~ "/$"' d100.log | wc Result: 6

© 2006 KDnuggets Example 4d: Counting all HTML pages  or count html, htm, and directory pages by % gawk '$7 ~ "(html|htm|/)$"' d100.log | wc Result: 28

© 2006 KDnuggets Gawk computations  More general form of gawk statements is gawk '{statements;…}' file  The statements are executed for each line of file  Statements include the usual conditionals, loops, etc  Details in gawk manual/tutorials

© 2006 KDnuggets Example 5: External referrers  Example: Print referrers to html pages, excluding direct access (where referrer is "-" )  Note: to test if $11 is "-", we need to escape a double quote as \"  Code: (all on one line) % gawk '{if ($7~"html$" && $11!="\"-\"") print $11}' d100.log | wc Result: 7

© 2006 KDnuggets Gawk statements: BEGIN, END  To execute statements before reading the first line we use BEGIN keyword  To execute statements after the last line is read we use END keyword gawk 'BEGIN{stat1;…}{stat2;…}END{stat3;…}' file

© 2006 KDnuggets Example 6  Sum all the object sizes for access code 200 gawk '{if ($9 == 200) sumsize+=$10} END{print sumsize}' d100.log Result: Note: we did not initialize sumsize; all variables by default are initialized to zero