Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.

Slides:



Advertisements
Similar presentations
A Guide to Unix Using Linux Fourth Edition
Advertisements

CS Lecture 03 Outline Sed and awk from previous lecture Writing simple bash script Assignment 1 discussion 1CS 311 Operating SystemsLecture 03.
Guide To UNIX Using Linux Third Edition
T UTORIAL OF U NIX C OMMAND & SHELL SCRIPT S 5027 Professor: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015.
Guide To UNIX Using Linux Third Edition
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
Session 2 Tables and forms in HTML Adapted by Sophie Peter from original document by Charlie Foulkes.
Using Unix Shell Scripts to Manage Large Data
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
1 Archive-It Training University of Maryland July 12, 2007.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Shell Scripting Awk (part1) Awk Programming Language standard unix language that is geared for text processing and creating formatted reports but it.
1 Day 16 Sed and Awk. 2 Looking through output We already know what “grep” does. –It looks for something in a file. –Returns any line from the file that.
Advanced File Processing
Lecture Note 3: ASP Syntax.  ASP Syntax  ASP Syntax ASP Code is Browser-Independent. You cannot view the ASP source code by selecting "View source"
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Comp2513 Forms and CGI Server Applications Daniel L. Silver, Ph.D.
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
Lecturer: Ghadah Aldehim
Quality Data: Fresno State's Analytics Strategy Rob Robinson Web Developer for Fresno
JavaScript, Fourth Edition
Annick Le Follic Bibliothèque nationale de France Tallinn,
Agenda Sed Utility - Advanced –Using Script-files / Example Awk Utility - Advanced –Using Script-files –Math calculations / Operators / Functions –Floating.
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
– Introduction to the Shell 10/1/2015 Introduction to the Shell – Session Introduction to the Shell – Session 2 · Permissions · Users.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
1 System Administration Introduction to Scripting, Perl Session 3 – Sat 10 Nov 2007 References:  chapter 1, The Unix Programming Environment, Kernighan.
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
Introduction to Awk Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data manipulation tasks.
Pipes and Filters Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
1 Advanced Archive-It Application Training: Crawl Scoping.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
UNIX commands Head More (press Q to exit) Cat – Example cat file – Example cat file1 file2 Grep – Grep –v ‘expression’ – Grep –A 1 ‘expression’ – Grep.
Marcelo R.N. Mendes. What is FINCoS? A Java-based set of tools for data generation, load submission, and performance measurement of event processing systems;
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
Tutorial of Unix Command & shell scriptS 5027
Lesson 5-Exploring Utilities
Chapter 6 Filters.
Linux command line basics III: piping commands for text processing
Unix Scripting Session 4 March 27, 2008.
CS 403: Programming Languages
PHP Introduction.
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
What is Bash Shell Scripting?
Crawling with Heritrix
Chapter 27 WWW and HTTP.
Guide To UNIX Using Linux Third Edition
Tutorial of Unix Command & shell scriptS 5027
Chapter Four UNIX File Processing.
CSE 491/891 Lecture 21 (Pig).
Tutorial Unix Command & Makefile CIS 5027
More advanced BASH usage
Presentation transcript:

Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF

Introduction Job logs and reports created by Heritrix contain a lot of information, much more than visible on the first view. Information can be obtained by extracting, filtering and evaluating certain fields of log files. Specialised evaluation tools are available for everything. However, it is sometimes difficult to look for the right one and to adapt it to actual needs. This presentation shows some examples of how information can be obtained by using standard unix tools. They are available by default on every unix installation and are ready to be used immediately. This brings a flexibility into the evaluation process that no specialised tool can provide. The list of examples is not exhaustive at all. It is intended to show some possibilities as inspiration for further work. The unix tools used here are: cat, grep, sort, uniq, sed, awk, wc, head, regular expressions, and pipelining (i.e. the output data of one command is used as input data for the next command in the same command line). The crawl.log used in the examples comes from a typical medium-sized job of a selective crawl. The job run between T10:26:08.273Z and T16:16:30.214Z. The crawl.log contains 3,205,936 lines.

clm The extraction of columns from log files is a basic action which is heavily used in the evaluation process. It can be realised with the awk command. Extracted columns can be rearranged in arbitrary order. They are separated by default by a “white space” (one or several space or tab characters) or optionally by any other character indicated by the -F option. To facilitate daily life operations, an alias “clm” (the name stands for “columns”) has been created which shortens the use of the awk command. $ awk '{print $3}'  clm 3 $ awk '{print $1 $3}'  clm 1 3 $ awk '{print $3 $1}'  clm 3 1 $ awk -F ':' '{print $1 $2}'  clm -F ':' 1 2

sum_col A perl script “sum_col” calculates the sum of all numerical values that can be found in the first column of every line of an input data stream. #!/usr/bin/env perl use warnings; my $sum = 0; while ( ) { chomp; = split; if > 0) { if ($parts[0] =~ /^(\d+(\.\d+)?)/) { $sum += $1; } print $sum. "\n";

avg URI fetch duration The crawl.log holds in its 9 th column a timestamp indicating when a network fetch was begun and the millisecond duration of the fetch, separated from the begin-time by a “+” character T10:26:09.345Z P text/plain # sha1:EZ6YOU7YB3VVAGOD4PPMQGG3VKZN42D2 content-size:3534 One can extract the duration of all the fetches, limited optionally in the 4 th field (the URI of the document downloaded) to a particular host or domain, to compute the average URI fetch duration of a job. $ cat crawl.log | clm 9 | clm -F '+' 2 | sum_col $ cat crawl.log | clm 9 | grep -cE '[0-9]+\+[0-9]+'  2,582,697,481 / 3,197,842 = [ms] $ cat crawl.log | clm 4 9 | grep | clm 2 | \ clm -F '+' 2 | sum_col $ cat crawl.log | clm 4 9 | grep | wc -l  70,041,498 / 72,825 = [ms]

nb of images per host To know the number of images fetched per host: $ grep -i image mimetype-report.txt image/jpeg image/png image/gif image/jpg $ cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq –c 1 -gof.pagesperso-orange.fr 4 0.academia-assets.com gravatar.com 10 0.media.collegehumor.cvcdn.com 4 0.static.collegehumor.cvcdn.com The top 5: $ cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq -c | \ sort -nr | head -n upload.wikimedia.org fr.cdn.v5.futura-sciences.com pbs.twimg.com media.meltybuzz.fr s-

nb of images per seed Very often, images embedded into a web page are not fetched from the same host as the page itself. So, instead counting the number of images per host, it is more interesting to have the number of images collected per referring site or – even better – per seed. To make this possible, the “source-tag-seed” option, which is off by default, must be activated: $ order.xml: true The crawl.log will then contain in its 11 th column a tag for the seed the URI being treated originated from ‑ 02 ‑ 24T10:28:10.291Z ‑ sphotos ‑ f ‑ a.akamaihd.net /hphotos ‑ ak ‑ frc1/t1/s261x260/ _ _n.jpg X ook.com/brice.roger1 image/jpeg # sha1:DABRRLQPPAKH3QO W7MHGSMSIDDDDRY7D content ‑ size:8035,3t $ cat crawl.log | clm 11 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm 1 | sort | uniq -c | sort -nr | head -n

timeouts per time period The -2 fetch status code of a URI stands for “HTTP connect failed” ‑ 02 ‑ 24T17:48:22.350Z ‑ 2 ‑ EX no ‑ type #185 ‑ ‑ leaksactu.wordpress.com/2013/04/07/wikileaks ‑ et ‑ les ‑ jeux ‑ olympiques ‑ de-sotchi- de-2014/ Its cause may be on the server side but also on the network. An increase of the number of -2 codes in a certain time period might indicate network problems which can then be further investigated. This is the number of -2 codes per hour: $ grep 'Z -2 ' crawl.log | clm -F : 1 | uniq -c T T T T T T T T T T T T T05

timeouts per time period It is more meaningful to extract the -2 codes from the local-errors.log as every URI is repeatedly tried before it arrives as “given up” in the crawl.log: $ grep 'Z -2' local-errors.log | clm -F : 1 | uniq –c T T T T T T T T T T00 The number of -2 codes can be also be counted per minute: $ grep 'Z -2' local-errors.log | grep ^ T17 | \ clm -F : 1 2 | uniq –c T T T T T T T T17 13

timeouts per time period If all the running instances of Heritrix use a common workspace to store their logs, an extraction from all the log files is possible to better detect peaks: $ cat /.../jobs/*/logs/crawl.log | grep 'Z -2' | clm -F : 1 | uniq -c T T T T T T T T T T T T T T T T T T T T T T T T T T12

detect crawler traps via URI patterns The number of URIs fetched from is higher than expected: $ head hosts-report.txt dns: Fetched URLs are extracted (4 th column), sorted and written into a new file: $ cat crawl.log | clm 4 | grep -v dns: | sort > crawl.log.4-sort Crawler traps can then be detected by looking at the list: $ grep crawl.log.4-sort ‑ filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/merc9/6gd.jpg ‑ filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/cloture/1.jpg [...] ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=15 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=43 ‑ filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=63

detect crawler traps via URI patterns Another approach to detect crawler traps is the extraction of URIs having a high number of URL parameters (separated by “&”). To find the highest number of URL parameters: $ cat crawl.log | clm 4 | grep '&' | sed -r 's/[^&]//g' | wc –L 68 To extract URIs having 20 or more URL parameters: $ grep -E '(\&.*){20,}' crawl.log | grep --color=always '\&' ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=annuairesup&PR OC=SAISIE_DEFAULTSTRUCTUREKSUP&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=cataloguelien& PROC=SAISIE_LIEN&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y ‑ inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=core&PROC=SAIS IE_NEWSGW&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y

detect crawler traps via URI patterns URL parameters can be sorted and counted to easier detect repetitions: $ cat crawl.log | clm 4 | \ grep ' | \ grep --color=always '\&' se&ajax=true&show=liste_articles&numero=1110&pre&id_param= &java=false&aja x=true&show=liste_articles&numero=1110&pre&id_param= &java=false&ajax=true &show=liste_articles&numero=1110&pre&id_param= &java=false&ajax=true&show= liste_articles&numero=1110&preaction=mymodule&id_param= &java=false&ajax=t rue&show=liste_articles&numero=1110 $ cat crawl.log | clm 4 | \ grep ' | \ sed 's/&/\n/g' | sort | uniq –c 5 ajax=true id_param= java=false 6 numero= pre 1 preaction=mymodule 5 show=liste_articles

arcfiles-report.txt Heritrix does not provide a report for the ARC files written by the completed crawl job. If the option “org.archive.io.arc.ARCWriter.level” in the heritrix.properties file is set to INFO, Heritrix will log opening and closing of ARC files in the heritrix.out file. These information can then be transformed into an arcfiles report. Same for WARC files. $ grep "Opened.*arc.gz" heritrix.out :26: INFO thread-14 org.archive.io.WriterPoolMember.createFile() Opened /dlweb/data/NAS510/jobs/current/g110high/9340_ /arcs/ BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz.open $ grep "Closed.*arc.gz" heritrix.out :30: INFO thread-171 org.archive.io.WriterPoolMember.close() Closed /dlweb/data/NAS510/jobs/current/g110high/9340_ /arcs/ BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz, size $ cat arcfiles-report.txt [ARCFILE] [Opened] [Closed] [Size] BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz T10:26:08.254Z T10:30:40.822Z