Download presentation
Presentation is loading. Please wait.
Published byShanon McBride Modified over 9 years ago
1
© 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" Module 4b: Perl for Web Log Analysis
2
© 2006 KDnuggets Perl - introduction A full-featured, fast, and easy to use scripting language Very powerful pattern-matching facilities More powerful than gawk; very popular for web programming and CGI files Many Perl tutorials, e.g. learn.perl.org/ www.perl.com/pub/a/2000/10/begperl1.html www.perlmonks.org/index.pl?node=Tutorials
3
© 2006 KDnuggets Perl – historical note PERL stands for Practical Extraction and Reporting Language Developed by Larry Wall Perl 1.0 was released to usenet's alt.comp.sources in 1987 Perl is the most popular web programming language – due to powerful text manipulation and quick development. Perl is widely known as "the duct-tape of the Internet".
4
© 2006 KDnuggets Perl - running First Perl script (on Unix) file1.pl #!/usr/local/bin/perl -w print "Hi there!\n"; Note: On Windows, first line usually is #!c:/Perl/bin/perl.exe -w % file1.pl Result: Hi there!
5
© 2006 KDnuggets Perl for Windows Active Perl – ready-to-install Perl distribution Runs on Windows, Linux, MAC OS, and other OS Free download www.activestate.com/Products/ActivePerl/
6
© 2006 KDnuggets Perl basics Two data types: numbers and strings Perl uses many special characters $, @, %, as part of its syntax Perl variables: Scalars (simple variables, things) start with $, e.g. $count Arrays (lists) start with @, e.g. @array1 Hashes (associative arrays) start with % Usual control structures Full introduction to Perl is beyond the scope of this module
7
© 2006 KDnuggets What does this code do? @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{ @p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f =!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/ && close$_}%p;wait until$?;map{/^r/&& }%p;$_=$d[$q];sleep rand(2)if/\S/;print Answer: We do NOT want to know !
8
© 2006 KDnuggets The Tao of Coding Human time is MUCH more precious than computer time It is much better (and faster) to develop programs using methods that AVOID mistakes than try to find bugs in badly written programs
9
© 2006 KDnuggets Perl style: understandability first Perl allows you to do tricky programs to save a few lines of text AVOID this approach Use careful, step by step development Test after every step A good program should be easy to understand Only after you have an understandable program, and only if you need it, you can improve efficiency
10
© 2006 KDnuggets Perl coding Variables can be declared implicitly by their first use, e.g. $oldvar=$nevar+27 if $nevar was not declared before, it will be initialized to zero Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ?) Much better to declare variables explicitly e.g. my $newvar = 0; Enforced by command use strict
11
© 2006 KDnuggets Sample log file We will again use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. We will give useful code examples You are encouraged to try the code examples in this lecture on this file You should get the same answers!
12
© 2006 KDnuggets Perl for parsing a web log file Program 0: logparse0.pl - read and print log file #!c:/Perl/bin/perl.exe -w use strict; while (<>) { my $line = $_; # current line print $line; }
13
© 2006 KDnuggets Perl regular expressions, 1 Usage: $var =~ / regex / where regex is a regular expression. E.g. $line =~ /google/ will match all lines containing "google" Note: / delimit regular expression, so / can't be used inside (unless escaped like this \/ )
14
© 2006 KDnuggets Perl log parsing, 1 #!c:/Perl/bin/perl.exe -w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~/google/) {$cnt++;} } print " $cnt lines matched google"; Check how many lines refer to google Applying this code to d100.log,you get: 2 lines matched google
15
© 2006 KDnuggets Perl regular expressions, 2 Special characters:. : matches one character a* : matches zero or more repeats of "a" a+ : matches 1 or more repeats of "a" \S : matches any non-white space character ^ : anchor – matches beginning of string $ : anchor – matches end of string
16
© 2006 KDnuggets Log parse 2: IP address IP address is the first item on the log line. In almost all log files it is followed by " - - ", representing missing "ident_user" and "auth_user" fields Regular expression for matching these 3 fields: $line =~ /^(\S+) - - /;
17
© 2006 KDnuggets Perl regex: parentheses capture match variables Perl regex items enclosed in parentheses () correspond to special match variables. Variable $1 contains value matched by regular expression in the first parentheses, etc
18
© 2006 KDnuggets Perl regex: match variables #!c:/Perl/bin/perl.exe –w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~ /^(\S+) - - /) { my $ip = $1; print "ip $ip\n"; $cnt++; } else { print "bad line $line\n"; } print " processed $cnt log lines\n"; this program shows how to assign IP to variable $ip; also shows error processing if match is not successful Note: First line with Perl is probably different on your machine
19
© 2006 KDnuggets Perl regular expression 4: brackets Brackets [ ] allow you match any character inside Example: [cmt]an will match can, man or tan, will not match ban or dan.
20
© 2006 KDnuggets Perl regular expression 4b: brackets [^ ] [^x] will match any character except x (note: here ^ is not the beginning of text anchor) Example: [^:]* will match any string that does not include a colon :. Example: if $date is 16/Nov/2005:031415, after $date =~ ([^:]*):.* [^:]* will match 16/Nov/2005 Because it was enclosed in (), match result stored in $1
21
© 2006 KDnuggets Parsing log: Date, Time Date, Time is specified in the log as [DD/Mon/YYYY:HH:MM:SS timezone] Matching regular expression \[([^:]+):(..):(..):(..) -0500\]
22
© 2006 KDnuggets Parsing log: Date, Time Matching regular expression in detail \[([^:]+):(..):(..):(..) -0500\] \[ matches brackets \] [^:] matches any string that does not contain : ([^:]+) will match DD/Mon/YYYY ; value in $1 first (..) will match HH (hours); value in $2 second (..) will match MM ; in $3 third (..) matches SS; in $4
23
© 2006 KDnuggets Parsing log: Time Zone The time zone is relative to GMT The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same in the time log but it changes during daylight savings time In our test log file the time zone is -0500, US Eastern time zone
24
© 2006 KDnuggets Parsing log: Request "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" Regular expression for parsing Request field: method opening and closing quotes URL, captures any string of 1 or more non-blanks HTTP version - usually ignored
25
© 2006 KDnuggets Parsing log: Status code and Object size Status (Response) code is always a 3-digit number, followed by space, so it can be matched with (\d\d\d) Object size is either a number or "-" followed by space. Simplest regex to match it is (\S+)
26
© 2006 KDnuggets Parsing log: Referrer The Referrer is a string enclosed in double quotes "…" Can have anything inside except for a double quote Can also be "-" in case of a direct request. Not documented, but can be "" (nothing between the quotes). Referrer can be matched by: "([^"]*)" opening and closing quotes anything except a double quote appearing zero or more times
27
© 2006 KDnuggets Parsing log: User agent User agent is also a string enclosed in double quotes " … ", that can have anything inside except for a double quote. It can also be "-". User agent can be matched by: "([^"]+)" opening and closing quotes anything except a double quote appearing one or more times
28
© 2006 KDnuggets Parsing a web log line: putting all together if ($line =~ /^(\S+) - - \[([^:]+):(..):(..):(..) -0500\] "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" (\d\d\d) (\S+) "([^"]*)" "([^"]+)"/ ) { … } The matching is done by the following (should be all on one line) Full code is in program weblog_parse.pl
29
© 2006 KDnuggets Perl arrays Perl array is an ordered list of items Array names begin with @ Array initialization: @days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
30
© 2006 KDnuggets Perl arrays, num of items When referring to a single array item, name begins with "$". E.g. we print the first array item (index 0) using print $days[0] ; Number of items in an array is $#array $#days is 7
31
© 2006 KDnuggets Perl array iteration Iterating over entire array foreach $day (@days) {print $day,"\n" } ; is the same as for $n ($n=0; $n <7; $n++) { print $days[$n],"\n" } ;
32
© 2006 KDnuggets Perl hash Hash is unordered list of key, value pairs. Hash names begin with % Hash initialization: %capitals=("USA", "Washington D.C.", "France", "Paris", "China", "Beijing") ;
33
© 2006 KDnuggets Perl hash reference Referring to a single hash item, name begins with "$". To get capital of China from %capitals we use $capitals{"China"} To add the capital of UK, we use $capitals{"UK"} = "London" ;
34
© 2006 KDnuggets Perl hash iteration Iteration over the entire hash foreach $country (keys %capitals) { print "$country capital $capitals{$country}\n"; }
35
© 2006 KDnuggets Additional tools for Web log analysis Perl for web log analysis www.oreilly.com/catalog/perlwsmng/chapter/ch08.html Some web log analysis tools Analog www.analog.cx/ AWstats awstats.sourceforge.net/ Webalizer www.mrunix.net/webalizer/ FTPweblog www.nihongo.org/snowhare/utilities/ftpweblog/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.