Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF
Introduction Job logs and reports created by Heritrix contain a lot of information, much more than visible on the first view. Information can be obtained by extracting, filtering and evaluating certain fields of log files. Specialised evaluation tools are available for everything. However, it is sometimes difficult to look for the right one and to adapt it to actual needs. This presentation shows some examples of how information can be obtained by using standard unix tools. They are available by default on every unix installation and are ready to be used immediately. This brings a flexibility into the evaluation process that no specialised tool can provide. The list of examples is not exhaustive at all. It is intended to show some possibilities as inspiration for further work. The unix tools used here are: cat, grep, sort, uniq, sed, awk, wc, head, regular expressions, and pipelining (i.e. the output data of one command is used as input data for the next command in the same command line). The crawl.log used in the examples comes from a typical medium-sized job of a selective crawl. The job run between T10:26:08.273Z and T16:16:30.214Z. The crawl.log contains 3,205,936 lines.
clm The extraction of columns from log files is a basic action which is heavily used in the evaluation process. It can be realised with the awk command. Extracted columns can be rearranged in arbitrary order. They are separated by default by a “white space” (one or several space or tab characters) or optionally by any other character indicated by the -F option. To facilitate daily life operations, an alias “clm” (the name stands for “columns”) has been created which shortens the use of the awk command. $ awk '{print $3}' clm 3 $ awk '{print $1 $3}' clm 1 3 $ awk '{print $3 $1}' clm 3 1 $ awk -F ':' '{print $1 $2}' clm -F ':' 1 2
sum_col A perl script “sum_col” calculates the sum of all numerical values that can be found in the first column of every line of an input data stream. #!/usr/bin/env perl use warnings; my $sum = 0; while ( ) { chomp; = split; if > 0) { if ($parts[0] =~ /^(\d+(\.\d+)?)/) { $sum += $1; } print $sum. "\n";
avg URI fetch duration The crawl.log holds in its 9 th column a timestamp indicating when a network fetch was begun and the millisecond duration of the fetch, separated from the begin-time by a “+” character T10:26:09.345Z P text/plain # sha1:EZ6YOU7YB3VVAGOD4PPMQGG3VKZN42D2 content-size:3534 One can extract the duration of all the fetches, limited optionally in the 4 th field (the URI of the document downloaded) to a particular host or domain, to compute the average URI fetch duration of a job. $ cat crawl.log | clm 9 | clm -F '+' 2 | sum_col $ cat crawl.log | clm 9 | grep -cE '[0-9]+\+[0-9]+' 2,582,697,481 / 3,197,842 = [ms] $ cat crawl.log | clm 4 9 | grep | clm 2 | \ clm -F '+' 2 | sum_col $ cat crawl.log | clm 4 9 | grep | wc -l 70,041,498 / 72,825 = [ms]
nb of images per host To know the number of images fetched per host: $ grep -i image mimetype-report.txt image/jpeg image/png image/gif image/jpg $ cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq –c 1 -gof.pagesperso-orange.fr 4 0.academia-assets.com gravatar.com 10 0.media.collegehumor.cvcdn.com 4 0.static.collegehumor.cvcdn.com The top 5: $ cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq -c | \ sort -nr | head -n upload.wikimedia.org fr.cdn.v5.futura-sciences.com pbs.twimg.com media.meltybuzz.fr s-
nb of images per seed Very often, images embedded into a web page are not fetched from the same host as the page itself. So, instead counting the number of images per host, it is more interesting to have the number of images collected per referring site or – even better – per seed. To make this possible, the “source-tag-seed” option, which is off by default, must be activated: $ order.xml: true The crawl.log will then contain in its 11 th column a tag for the seed the URI being treated originated from ‑ 02 ‑ 24T10:28:10.291Z ‑ sphotos ‑ f ‑ a.akamaihd.net /hphotos ‑ ak ‑ frc1/t1/s261x260/ _ _n.jpg X ook.com/brice.roger1 image/jpeg # sha1:DABRRLQPPAKH3QO W7MHGSMSIDDDDRY7D content ‑ size:8035,3t $ cat crawl.log | clm 11 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm 1 | sort | uniq -c | sort -nr | head -n
timeouts per time period The -2 fetch status code of a URI stands for “HTTP connect failed” ‑ 02 ‑ 24T17:48:22.350Z ‑ 2 ‑ EX no ‑ type #185 ‑ ‑ leaksactu.wordpress.com/2013/04/07/wikileaks ‑ et ‑ les ‑ jeux ‑ olympiques ‑ de-sotchi- de-2014/ Its cause may be on the server side but also on the network. An increase of the number of -2 codes in a certain time period might indicate network problems which can then be further investigated. This is the number of -2 codes per hour: $ grep 'Z -2 ' crawl.log | clm -F : 1 | uniq -c T T T T T T T T T T T T T05
timeouts per time period It is more meaningful to extract the -2 codes from the local-errors.log as every URI is repeatedly tried before it arrives as “given up” in the crawl.log: $ grep 'Z -2' local-errors.log | clm -F : 1 | uniq –c T T T T T T T T T T00 The number of -2 codes can be also be counted per minute: $ grep 'Z -2' local-errors.log | grep ^ T17 | \ clm -F : 1 2 | uniq –c T T T T T T T T17 13
timeouts per time period If all the running instances of Heritrix use a common workspace to store their logs, an extraction from all the log files is possible to better detect peaks: $ cat /.../jobs/*/logs/crawl.log | grep 'Z -2' | clm -F : 1 | uniq -c T T T T T T T T T T T T T T T T T T T T T T T T T T12
detect crawler traps via URI patterns The number of URIs fetched from is higher than expected: $ head hosts-report.txt dns: Fetched URLs are extracted (4 th column), sorted and written into a new file: $ cat crawl.log | clm 4 | grep -v dns: | sort > crawl.log.4-sort Crawler traps can then be detected by looking at the list: $ grep crawl.log.4-sort ‑ filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/merc9/6gd.jpg ‑ filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/cloture/1.jpg [...] ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=15 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=43 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=63
detect crawler traps via URI patterns Another approach to detect crawler traps is the extraction of URIs having a high number of URL parameters (separated by “&”). To find the highest number of URL parameters: $ cat crawl.log | clm 4 | grep '&' | sed -r 's/[^&]//g' | wc –L 68 To extract URIs having 20 or more URL parameters: $ grep -E '(\&.*){20,}' crawl.log | grep --color=always '\&' ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=annuairesup&PR OC=SAISIE_DEFAULTSTRUCTUREKSUP&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=cataloguelien& PROC=SAISIE_LIEN&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=core&PROC=SAIS IE_NEWSGW&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y
detect crawler traps via URI patterns URL parameters can be sorted and counted to easier detect repetitions: $ cat crawl.log | clm 4 | \ grep ' | \ grep --color=always '\&' se&ajax=true&show=liste_articles&numero=1110&pre&id_param= &java=false&aja x=true&show=liste_articles&numero=1110&pre&id_param= &java=false&ajax=true &show=liste_articles&numero=1110&pre&id_param= &java=false&ajax=true&show= liste_articles&numero=1110&preaction=mymodule&id_param= &java=false&ajax=t rue&show=liste_articles&numero=1110 $ cat crawl.log | clm 4 | \ grep ' | \ sed 's/&/\n/g' | sort | uniq –c 5 ajax=true id_param= java=false 6 numero= pre 1 preaction=mymodule 5 show=liste_articles
arcfiles-report.txt Heritrix does not provide a report for the ARC files written by the completed crawl job. If the option “org.archive.io.arc.ARCWriter.level” in the heritrix.properties file is set to INFO, Heritrix will log opening and closing of ARC files in the heritrix.out file. These information can then be transformed into an arcfiles report. Same for WARC files. $ grep "Opened.*arc.gz" heritrix.out :26: INFO thread-14 org.archive.io.WriterPoolMember.createFile() Opened /dlweb/data/NAS510/jobs/current/g110high/9340_ /arcs/ BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz.open $ grep "Closed.*arc.gz" heritrix.out :30: INFO thread-171 org.archive.io.WriterPoolMember.close() Closed /dlweb/data/NAS510/jobs/current/g110high/9340_ /arcs/ BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz, size $ cat arcfiles-report.txt [ARCFILE] [Opened] [Closed] [Size] BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz T10:26:08.254Z T10:30:40.822Z