Presentation is loading. Please wait.

Presentation is loading. Please wait.

WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570

Similar presentations


Presentation on theme: "WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570"— Presentation transcript:

1 WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk

2 WebWatching the UK: Robot software for analysing UK web resources UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union. UKOLN also receives support from the University of Bath where it is based.

3 Robot software WebWatch. WebWatch experiences. General robot issues. The need for robots. Bad press. Awareness.

4 The WebWatch project A one year post funded by RIC. “..to develop a set of tools to audit and monitor design practice and use of technologies on the web..”. Communities. UK web communities. Information to benefit institutions/communities.

5 The WebWatch project Information on the project can be found at.

6 WebWatch aims Evaluation of robot technologies. Making recommendations on appropriate technologies. Working within UK web communities. Analysis of the results of web crawling and leasing with various communities in interpreting the results.

7 WebWatch aims Working with the web robot community. Analysing other related resources, such as web logs.

8 WebWatch robot Experimentation. Harvest. Perl based robot.

9 WebWatch analyses Production of a report. SOIF records. CSV. Excel, SPSS,… Current developments.

10 WebWatch benefits Benefits Communities. Web managers and designers. Knowledge base.

11 WebWatch robot History –Harvest –Experiences with Perl –? Features Future plans

12 WebWatch robot Type{4}: HTML Type-recognition by{4}: MIME Linked from{23}: http://www.ukoln.ac.uk/ Context{4}: Link Element-referrer{5}: LINKS p-count{1}: 3 a-21-attrib{55}: href=http://www.ukoln.ac.uk/services/elib/papers/other/ img-9-attrib{110}: width=87|src=http://www.ukoln.ac.uk/resources/images/ukoln- logo/logo|height=101|alt=UKOLN|align=right|border=0 Examples of robot output HTML element information

13 Robot issues Definition of a (web) robot. The need for robots

14 Robot issues The need for robots? Web expansion and increasing non- linearity. Understanding the nature of the web to help solve problems. Maintenance. Construction of index-space. Navigable document-space.

15 Increasing non-linearity URL A URL B URL C URL D

16 Benefits of robots End-user satisfaction. Reduced network traffic in document space. Populating caches, archiving, mirroring. Monitoring changes relevant to users. ‘Schooling’ network traffic into localised neighbourhoods.

17 Benefits of robots A user view (as opposed to a file- system view). Non fatiguing. Next generation. These properties offer feasible solution to web problems?

18 Robot design Is it necessary? Traversal algorithm (depth vs breadth first). Black holes and correct implementations (e.g. redirects). Bounds on activity. Multiple requests.

19 Example of a ‘black-hole’ Client requests: http://www.foo.bar/generate_report?date=02021998&time=1250 Server returns document with this link:

20 Robot design (continued) Caching directives

21 Ethical robots Reuse of robot code. Appropriate identification. Thorough testing (locally!). Speed/frequency bounding. Selective retrieval. Performance monitoring. Dissemination of results.

22 Ethical web crawling Advantages vs disadvantages.. Guidelines

23 Robot Exclusion Refers to means available to users and server administrators to control robot navigation through a particular server. Advantages. Disadvantages. Currently two kinds of Robot Exclusion Protocol (REP).

24 Robot exclusion protocols Server-wide method (/robots.txt) –Directives for the whole server must be under the top level /robots.txt. META element method (per page). –Directives are inserted per page with the META element. Directives allow for indexing (or not) and parsing for links (or not).

25 Other methods of robot control Blocking at the server configuration level (e.g. Apache’s allow from, deny from). Blocking at the TCP level (TCP wrappers?) Page design?

26 Network performance Bandwidth issues. Comparison with a human user. Bottlenecks. New developments in robots..good or bad? Decentralisation.

27 Server concerns Rapid fire requests (TCP, HTTP). Skewing of server logs. Identification of robots.

28 The future of web robots Intelligent agents. Metadata standards (XML, RDF, CDF, embedded metadata). Robots becoming part of the web.

29 WebWatch findings Analysis of URLs Domains for public library web sites

30 WebWatch findings Server software Servers used to serve eLib project pages

31 WebWatch findings File size analyses HTML file sizes for UK University entry-points

32 WebWatch findings Top ten tags used within the eLib community HTML analyses

33 WebWatch findings Hyperlink profiles Top ten external domains linked to from all eLib pages

34 WebWatch findings Analysis of other document content Use of metadata in UK university homepages


Download ppt "WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570"

Similar presentations


Ads by Google