Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.

Similar presentations


Presentation on theme: "Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF."— Presentation transcript:

1 Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF

2 Introduction Job logs and reports created by Heritrix contain a lot of information, much more than visible on the first view. Information can be obtained by extracting, filtering and evaluating certain fields of log files. Specialised evaluation tools are available for everything. However, it is sometimes difficult to look for the right one and to adapt it to actual needs. This presentation shows some examples of how information can be obtained by using standard unix tools. They are available by default on every unix installation and are ready to be used immediately. This brings a flexibility into the evaluation process that no specialised tool can provide. The list of examples is not exhaustive at all. It is intended to show some possibilities as inspiration for further work. The unix tools used here are: cat, grep, sort, uniq, sed, awk, wc, head, regular expressions, and pipelining (i.e. the output data of one command is used as input data for the next command in the same command line). The crawl.log used in the examples comes from a typical medium-sized job of a selective crawl. The job run between 2014-02-24T10:26:08.273Z and 2014-03-04T16:16:30.214Z. The crawl.log contains 3,205,936 lines.

3 clm The extraction of columns from log files is a basic action which is heavily used in the evaluation process. It can be realised with the awk command. Extracted columns can be rearranged in arbitrary order. They are separated by default by a “white space” (one or several space or tab characters) or optionally by any other character indicated by the -F option. To facilitate daily life operations, an alias “clm” (the name stands for “columns”) has been created which shortens the use of the awk command. $ awk '{print $3}'  clm 3 $ awk '{print $1 $3}'  clm 1 3 $ awk '{print $3 $1}'  clm 3 1 $ awk -F ':' '{print $1 $2}'  clm -F ':' 1 2

4 sum_col A perl script “sum_col” calculates the sum of all numerical values that can be found in the first column of every line of an input data stream. #!/usr/bin/env perl use warnings; my $sum = 0; while ( ) { chomp; my @parts = split; if (scalar @parts > 0) { if ($parts[0] =~ /^(\d+(\.\d+)?)/) { $sum += $1; } print $sum. "\n";

5 avg URI fetch duration The crawl.log holds in its 9 th column a timestamp indicating when a network fetch was begun and the millisecond duration of the fetch, separated from the begin-time by a “+” character. 2014-02-24T10:26:09.345Z 200 3339 http://www.facebook.com/robots.txt P http://www.facebook.com/antoine.adelisse text/plain #042 20140224102608798+543 sha1:EZ6YOU7YB3VVAGOD4PPMQGG3VKZN42D2 http://www.facebook.com/antoine.adelisse content-size:3534 One can extract the duration of all the fetches, limited optionally in the 4 th field (the URI of the document downloaded) to a particular host or domain, to compute the average URI fetch duration of a job. $ cat crawl.log | clm 9 | clm -F '+' 2 | sum_col 2582697481 $ cat crawl.log | clm 9 | grep -cE '[0-9]+\+[0-9]+' 3197842  2,582,697,481 / 3,197,842 = 805.6 [ms] $ cat crawl.log | clm 4 9 | grep www.facebook.com | clm 2 | \ clm -F '+' 2 | sum_col 70041498 $ cat crawl.log | clm 4 9 | grep www.facebook.com | wc -l 72825  70,041,498 / 72,825 = 952.4 [ms]

6 nb of images per host To know the number of images fetched per host: $ grep -i image mimetype-report.txt 1139893 33332978181 image/jpeg 296992 7276505356 image/png 110204 380490180 image/gif 79039 663127491 image/jpg $ cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq –c 1 -gof.pagesperso-orange.fr 4 0.academia-assets.com 6396 0.gravatar.com 10 0.media.collegehumor.cvcdn.com 4 0.static.collegehumor.cvcdn.com The top 5: $ cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq -c | \ sort -nr | head -n 5 89030 upload.wikimedia.org 78764 fr.cdn.v5.futura-sciences.com 44992 pbs.twimg.com 38781 media.meltybuzz.fr 36998 s-www.ledauphine.com

7 nb of images per seed Very often, images embedded into a web page are not fetched from the same host as the page itself. So, instead counting the number of images per host, it is more interesting to have the number of images collected per referring site or – even better – per seed. To make this possible, the “source-tag-seed” option, which is off by default, must be activated: $ order.xml: true The crawl.log will then contain in its 11 th column a tag for the seed the URI being treated originated from. 2014 ‑ 02 ‑ 24T10:28:10.291Z 200 7802 https://fbcdn ‑ sphotos ‑ f ‑ a.akamaihd.net /hphotos ‑ ak ‑ frc1/t1/s261x260/1621759_10202523932754072_n.jpg X https://www.faceb ook.com/brice.roger1 image/jpeg #190 20140224102801374+8902 sha1:DABRRLQPPAKH3QO W7MHGSMSIDDDDRY7D https://www.facebook.com/brice.roger1 content ‑ size:8035,3t $ cat crawl.log | clm 11 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm 1 | sort | uniq -c | sort -nr | head -n 10 110510 http://fr.oakley.com/sports 86479 http://www.futura-sciences.com/magazines/high-tech/infos/actu 74621 http://fr.wikipedia.org/wiki/Sotchi_2014 74007 http://www.linternaute.com/sport/jeux-olympiques/ 55742 http://www.tuxboard.com/les-tenues-officielles-des-j-o-de-sotchi/ 49135 http://www.meltybuzz.fr/sochi-2014/ 42308 http://www.legorafi.fr/tag/jeux-olympiques/ 39503 http://www.madmoizelle.com/sexisme-france-televisions-jeux-olymp 38426 http://www.ledauphine.com/sport/sports-hautes-alpes 32830 https://www.facebook.com/geoffrey.mattei

8 timeouts per time period The -2 fetch status code of a URI stands for “HTTP connect failed”. 2014 ‑ 02 ‑ 24T17:48:22.350Z ‑ 2 ‑ http://buy.itunes.apple.com/ EX http://s.skimresources.com/js/725X1342.skimlinks.js no ‑ type #185 ‑ ‑ http://wiki leaksactu.wordpress.com/2013/04/07/wikileaks ‑ et ‑ les ‑ jeux ‑ olympiques ‑ de-sotchi- de-2014/ le:SocketTimeoutException@HTTP,30t Its cause may be on the server side but also on the network. An increase of the number of -2 codes in a certain time period might indicate network problems which can then be further investigated. This is the number of -2 codes per hour: $ grep 'Z -2 ' crawl.log | clm -F : 1 | uniq -c 9 2014-02-24T17 34 2014-02-24T18 17 2014-02-24T19 19 2014-02-24T20 10 2014-02-24T21 26 2014-02-24T22 36 2014-02-24T23 182 2014-02-25T00 3 2014-02-25T01 4 2014-02-25T02 3 2014-02-25T03 2 2014-02-25T04 1 2014-02-25T05

9 timeouts per time period It is more meaningful to extract the -2 codes from the local-errors.log as every URI is repeatedly tried before it arrives as “given up” in the crawl.log: $ grep 'Z -2' local-errors.log | clm -F : 1 | uniq –c 376 2014-02-24T15 586 2014-02-24T16 1320 2014-02-24T17 1234 2014-02-24T18 1162 2014-02-24T19 1101 2014-02-24T20 892 2014-02-24T21 1008 2014-02-24T22 928 2014-02-24T23 999 2014-02-25T00 The number of -2 codes can be also be counted per minute: $ grep 'Z -2' local-errors.log | grep ^2014-02-24T17 | \ clm -F : 1 2 | uniq –c 12 2014-02-24T17 06 19 2014-02-24T17 07 12 2014-02-24T17 08 157 2014-02-24T17 09 12 2014-02-24T17 10 15 2014-02-24T17 11 6 2014-02-24T17 12 5 2014-02-24T17 13

10 timeouts per time period If all the running instances of Heritrix use a common workspace to store their logs, an extraction from all the log files is possible to better detect peaks: $ cat /.../jobs/*/logs/crawl.log | grep 'Z -2' | clm -F : 1 | uniq -c 199 2012-10-15T11 446 2012-10-15T12 793 2012-10-15T13 970 2012-10-15T14 677 2012-10-15T15 632 2012-10-15T16 553 2012-10-15T17 606 2012-10-15T18 801 2012-10-15T19 754 2012-10-15T20 701 2012-10-15T21 705 2012-10-15T22 539 2012-10-15T23 508 2012-10-16T00 899 2012-10-16T01 2099 2012-10-16T02 1298 2012-10-16T03 1064 2012-10-16T04 929 2012-10-16T05 983 2012-10-16T06 1274 2012-10-16T07 2131 2012-10-16T08 2639 2012-10-16T09 2288 2012-10-16T10 2596 2012-10-16T11 1950 2012-10-16T12

11 detect crawler traps via URI patterns The number of URIs fetched from www.clermont-filmfest.com is higher than expected: $ head hosts-report.txt 163438 16114773 dns: 0 13 163438 16114773 0 0 0 0 138317 15279274975 www.clermont-filmfest.com 0 125252 138317 15279274975 133786 6131883679 www.ffsa.org 0 314002 133786 6131883679 0 0 0 0 133757 6411525222 www.actu-environnement.com 0 18990 133757 6411525222 Fetched URLs are extracted (4 th column), sorted and written into a new file: $ cat crawl.log | clm 4 | grep -v dns: | sort > crawl.log.4-sort Crawler traps can then be detected by looking at the list: $ grep www.clermont-filmfest.com crawl.log.4-sort http://www.clermont ‑ filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/merc9/6gd.jpg http://www.clermont ‑ filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/cloture/1.jpg [...] http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15 http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03 http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15 http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03 http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=15 http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=43 http://www.clermont ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=63

12 detect crawler traps via URI patterns Another approach to detect crawler traps is the extraction of URIs having a high number of URL parameters (separated by “&”). To find the highest number of URL parameters: $ cat crawl.log | clm 4 | grep '&' | sed -r 's/[^&]//g' | wc –L 68 To extract URIs having 20 or more URL parameters: $ grep -E '(\&.*){20,}' crawl.log | grep --color=always '\&' http://arvise.grenoble ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=annuairesup&PR OC=SAISIE_DEFAULTSTRUCTUREKSUP&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y http://arvise.grenoble ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=cataloguelien& PROC=SAISIE_LIEN&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y http://arvise.grenoble ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=core&PROC=SAIS IE_NEWSGW&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y

13 detect crawler traps via URI patterns URL parameters can be sorted and counted to easier detect repetitions: $ cat crawl.log | clm 4 | \ grep 'http://cnb.avocat.fr/index.php?start=0&numero=1110' | \ grep --color=always '\&' http://cnb.avocat.fr/index.php?start=0&numero=1110&pre&id_param=1330690&java=fal se&ajax=true&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&aja x=true&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&ajax=true &show=liste_articles&numero=1110&pre&id_param=1330690&java=false&ajax=true&show= liste_articles&numero=1110&preaction=mymodule&id_param=1330690&java=false&ajax=t rue&show=liste_articles&numero=1110 $ cat crawl.log | clm 4 | \ grep 'http://cnb.avocat.fr/index.php?start=0&numero=1110' | \ sed 's/&/\n/g' | sort | uniq –c 5 ajax=true 1 http://cnb.avocat.fr/index.php?start=0 5 id_param=1330690 5 java=false 6 numero=1110 4 pre 1 preaction=mymodule 5 show=liste_articles

14 arcfiles-report.txt Heritrix does not provide a report for the ARC files written by the completed crawl job. If the option “org.archive.io.arc.ARCWriter.level” in the heritrix.properties file is set to INFO, Heritrix will log opening and closing of ARC files in the heritrix.out file. These information can then be transformed into an arcfiles report. Same for WARC files. $ grep "Opened.*arc.gz" heritrix.out 2014-02-24 10:26:08.254 INFO thread-14 org.archive.io.WriterPoolMember.createFile() Opened /dlweb/data/NAS510/jobs/current/g110high/9340_1393236979459/arcs/9340-32- 20140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz.open $ grep "Closed.*arc.gz" heritrix.out 2014-02-24 10:30:40.822 INFO thread-171 org.archive.io.WriterPoolMember.close() Closed /dlweb/data/NAS510/jobs/current/g110high/9340_1393236979459/arcs/9340-32- 20140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz, size 100010570 $ cat arcfiles-report.txt [ARCFILE] [Opened] [Closed] [Size] 9340-32-20140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz 2014-02- 24T10:26:08.254Z 2014-02-24T10:30:40.822Z 100010570


Download ppt "Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF."

Similar presentations


Ads by Google