The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

Slides:



Advertisements
Similar presentations
Zetoc.mimas.ac.uk Zetoc Electronic Table of Contents from the British Library Zetoc Support.
Advertisements

1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
UKCoRR meeting Kingston University, November 2007 Mary Robinson European Development Officer University of Nottingham, UK
Subject Based Information Gateways in The UK Coordinated Activities in The UK Within the UK Higher Education community, the JISC (Joint Information Systems.
A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background The.
Why metadata matters for libraries... Rachel Heery UKOLN: The UK Office for Library and Information Networking, University of Bath
Issues and approaches to preservation metadata Michael Day UKOLN: UK Office for Library and Information Networking University of Bath
Collection-level description & collection management: tool for the trade or information trade-off? Collection Description Focus Workshop 4 Newcastle, 8.
Benchmarking Web Sites Brian Kelly UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
UKOLN and the Institutional Web Service UKOLN (UK Office for Library and Information Networking) is a research and dissemination unit based at the University.
1 RDF Tools Brian Kelly UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by the British Library Research and Innovation Centre,
The metadata challenge for libraries: a view from Europe Michael Day UKOLN: The UK Office for Library and Information Networking, University of Bath
A centre of expertise in digital information management Developing a Quality Culture For Digital Library Programmes Author & Presenter Brian Kelly UKOLN.
A centre of expertise in digital information managementwww.ukoln.ac.uk Search Facilities For Web Sites A Discussion Group Session Brian Kelly UKOLN University.
Publishing An e-Journal Whats Out There? UKOLN is funded by Resource: The Council for Museums, Archives and Libraries, the Joint Information Systems Committee.
Collection-level description & the Information Landscape: users evaluate strategies for resource discovery Collection Description Focus Workshop 5 Cambridge,
A centre of expertise in digital information managementwww.ukoln.ac.uk UKOLN is supported by: Benchmarking Web Sites Brian Kelly UKOLN University of Bath.
1 A Tool-box for Web-site Maintenance Manjula Patel UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by the Library and Information Commission, the.
September Public Library Web Managers Workshop 2000 Cascading Style Sheets Manjula Patel UKOLN University of Bath Bath, BA2 7AY UKOLN is funded.
1 ROADS to ATHENS Manjula Patel UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by the British Library Research and Innovation.
1 Advances In Web Technologies Brian Kelly UK Web Focus UKOLN University of Bath
Web Page Concept and Design :
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
Collection description & Collection Description Focus JISC/DNER Moving Image & Sound Cluster Steering Group meeting, HEFCE Office, London, 24 September.
A centre of expertise in digital information management A QA Framework To Support Your Library Web Site Review Brian Kelly UKOLN University of Bath Bath.
1 Creating a professional website I Mutsumi Ogawa - LG 400 – wk10.
Towards consensus on collection-level description Collection Description Focus Briefing Day 1 British Library, St Pancras, London 22 October 2001 Bridget.
An introduction to collections and collection-level description Collection-Level Description & NOF-digitise projects NOF-digitise programme seminar, London,
Thinking collectively : approaching collection-level description Collection Description Focus Workshop 1 Staff House, UMIST, Manchester 1 November 2001.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
4.01 How Web Pages Work.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
1 Technical Developments Related to Quality Issues Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY
DT211/3 Internet Application Development Active Server Pages & IIS Web server.
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
1 Exploit Interactive: A Web Magazine For The Library and Information Professional Brian Kelly UK Web Focus UKOLN University of.
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
1 WebWatch: Monitoring Web Developments In The UK Brian Kelly UK Web Focus UKOLN University of BathURL Bath, BA2 7AY
1 UCISA-SG WebTools Forum An Evaluation Exercise David Lomas University of Salford.
A Lightweight Approach To Support of Resource Discovery Standards The Problem Dublin Core is an international standard for resource discovery metadata.
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
1 If I Could Start All Over Again: Lessons To be Learnt From The HE Community Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
Approaches To Indexing in The UK Higher Education Community Institutional Activities Surveys of 150 UK University web sites show the popularity of freely.
1 Exploit Interactive: The Development of a Web Magazine Bernadette Daly Information Officer UKOLN University of BathURL Bath,
UNESCO ICTLIP Module 1. Lesson 61 Introduction to Information and Communication Technologies Lesson 6. What is the Internet?
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Automated Benchmarking Of Local Authority Web Sites Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by:
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
QA Focus – a JISC-funded advisory service supporting JISC 5/99 projects QA Focus Surveys QA Focus carried out a number of surveys of project Web sites.
1 Benchmarking your Web Site Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
A centre of expertise in digital information managementwww.ukoln.ac.uk UKOLN is supported by: UKOLN/TechDis Workshop For RSC South East: Benchmarking Web.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
A centre of expertise in digital information management UKOLN priorities: ●Provide advice and services to the library, education.
Current Approaches to Web Site Development Brian Kelly UK Web Focus UKOLN University of Bath UKOLN is funded by Resource: The Council for Museums, Archives.
A centre of expertise in digital information managementwww.ukoln.ac.uk UKOLN is supported by: Effective Web Site Training Workshop: Benchmarking Web Sites.
A centre of expertise in digital information managementwww.ukoln.ac.uk Search Facilities For Web Sites A Discussion Group Session Brian Kelly UKOLN University.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Exploit Interactive Web Magazine.
Auditing and Evaluating Web Sites Brian Kelly UK Web Focus UKOLN University of Bath UKOLN is funded by Resource: The Council for Museums,
A centre of expertise in digital information managementwww.ukoln.ac.uk UKOLN is supported by: Benchmarking RSC Web Sites Brian Kelly UKOLN University of.
APACHE Apache is generally recognized as the world's most popular Web server (HTTP server). Originally designed for Unix servers, the Apache Web server.
Distributed Control and Measurement via the Internet
Warm Handshake with Websites, Servers and Web Servers:
Web Page Concept and Design :
Presentation transcript:

The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves the development and use of web robot software for monitoring use of web technologies Papers, reports, articles and presentations of the findings are produced by the WebWatch project UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISCs Electronic Libraries Programme and the European Union. UKOLN also receives support from the University of Bath where it is based. A WebWatch Trawl A simple model of how the WebWatch robot trawls communities is shown below Input file of URLs WebWatch robot reads input file and retrieves resources Resource A Resource B Summary file Analysis and statistical programs produce reports Resource A,B, etc. could be individual pages or entire websites Report for UK Universities

WebWatch Trawl of UK University Entry Pages Background The WebWatch project carried out a trawl of UK University entry points on 24 October The trawl was repeated in 31 July Web Servers The most popular web server was Apache. This has grown in popularity, with a decline in the CERN, NCSA and other smaller servers. Microsoft's IIS server has also grown in popularity, perhaps indicating growth in use of Windows NT. Size of Entry Points The file size of HTML resource(s) (including frame sets) and images (but excluding background images) were analysed. Four pages were less than 5 Kb. The largest page was 193Kb. The largest pages contained animated GIF images.

WebWatch Trawl of UK University Entry Pages Web Technologies An analysis of some of the technologies used in UK University entry points is given below. Java and JavaScript None of the institutions trawled made use of Java. Subsequently it was found that one institution used Java. This institution used the Robot Exclusion Protocol to stop robots from trawling the site. JavaScript In October institutions used client-side scripting, such as JavaScript. By July institutions were using JavaScript. The University of Northumbria at Newcastle is one of about 38 institutions which use JavaScript. JavaScript is used to display picture fragments when the cursor moves over a menu option. The University of Northumbria at Newcastle is one of about 38 institutions which use JavaScript. JavaScript is used to display picture fragments when the cursor moves over a menu option. Java provides this scrolling news facility Liverpool University is probably the only university entry page using Java

WebWatch Trawl of UK University Entry Pages Metadata In October institutions used "Alta Vista" type metadata on their main entry point. By July 1998 the metadata was used on 74 entry points. In contrast Dublin Core metadata was used on only 2 pages on both occasions. Cachability Interest in cache-friendly web resources has grown since the introduction of network charging on 1 August Over 50% of institutional HTML resources were found to be cachable, with only 1% not cachable. Further analyses is needed for the other resources. Possible Use of Alta Vista and Dublin Core Metadata % telnet GET / HTTP/1.0 HTTP/ OK Date: Fri, 28 Aug :22:51 GMT Server: Apache/1.2b8 Content-Type: text/html Telnet can be used to analyse HTTP headers, including caching information A WebWatch service is being developed to provide a web-interface to the telnet command, to give more helpful information. URL: This resource uses HTTP/1.1. The resource is cachable. The resource was last updated on … Possible Interface

WebWatch Trawl of UK University Entry Pages Frames In July 1998 the following 19 sites used frames, compared with 12 in October 1997: EssexBretton Coll. UCERoyal College of Music KeeleKing Alfred's Coll. MiddlesexNottingham Trent PortsmouthRavensbourne Coll. TeesideBirkbeck Coll. UMIST Uni. Coll. Of St Martin Thames ValleyQueen Margaret Coll. Westhill Scottish Agricultural Coll. Kent Institute of Art and Design "Splash Screens" In July sites used client- side requests to provide redirects or "splash screens". UMIST is an example of a framed website De Montfort University displays a screen with a yellow background. After 8 seconds a new screen is displayed. "Splash screens" are created by Liverpool University also uses frames but this was not detected by the robot due to their use of the Robot Exclusion Protocol.

WebWatch Trawl of UK University Entry Pages Hyperlinking Issues The WebWatch trawls revealed some interesting hyperlinking issues, which are described below. Numbers of Hyperlinks The histogram of the numbers of hyperlinks from institutional entry points shows an approximately normal distribution. Six sites were found to have fewer than 5 links. One site contained over 75 links. Limitations of Survey The analyses do not give a completely accurate view for a variety of reasons: The address of one of the sites with a small number of links was incorrectly given in the input file list (obtained from HESA). The analysis did not exclude duplicate links. Sites containing "splash screens" were reported as having small number of links, although arguably the links on the second screen should also be included. Discussion Many Links: Provide useful "short cuts" for experienced users Can minimise numbers of levels to navigate Few Links: Can be confusing for new user Can cause accessibility problems (e.g. for the visually impaired) What is your view? Discussion Many Links: Provide useful "short cuts" for experienced users Can minimise numbers of levels to navigate Few Links: Can be confusing for new user Can cause accessibility problems (e.g. for the visually impaired) What is your view?

Trends in UK University Entry Points Trawls of UK University Entry Points The WebWatch project has surveyed UK University web site entry points on three occasions: 24 October 1997, 31 July 1998 and 25 November A summary of significant trends is given below. Metadata Usage Use of Dublin Core (DC) metadata grew during the summer 1998 from 2 sites to 11. DC metadata is still dwarfed by "Alta Vista" style metadata. Metadata Usage Use of Dublin Core (DC) metadata grew during the summer 1998 from 2 sites to 11. DC metadata is still dwarfed by "Alta Vista" style metadata. "Splash Screens" The number of entry points using "splash" screen has increased from 5 (Oct 97), to 7 (Jul 98) to 10 (Nov 98). "Splash Screens" The number of entry points using "splash" screen has increased from 5 (Oct 97), to 7 (Jul 98) to 10 (Nov 98). Growth (Kb) Server Usage The Apache and Microsoft web servers are both growing in popularity, at the expense of the CERN and Netscape servers, and a number of more specialist servers. Server Usage The Apache and Microsoft web servers are both growing in popularity, at the expense of the CERN and Netscape servers, and a number of more specialist servers. Size Of Entry Points Trends in the sizes (HTML plus embedded images) have been analysed. The majority of entry points have not changed in size significantly, although one or two have grown (~ 100Kb) or decreased in size (~50Kb) substantially. Size Of Entry Points Trends in the sizes (HTML plus embedded images) have been analysed. The majority of entry points have not changed in size significantly, although one or two have grown (~ 100Kb) or decreased in size (~50Kb) substantially.

WebWatch Services HTTP-info Service A web form is available which can be used to obtain the HTTP headers sent when the resource is accessed. This service can be useful for getting information, such as the name of the server software, HTTP version information, etc. Doc-info Service A web form is available which can be used to obtain information on web resources. The Doc-info service is integrated with the HTTP-info service, enabled the HTTP headers are all objects contained in a resource to be analysed. WebWatch provides access to various tools and utilities which have been developed to support its work. These services can be accessed using a Web browser at the address.

WebWatch Technologies Technologies The WebWatch project has made use of the following technologies: The Harvest indexing and analysis suite Perl for developing the WebWatch robot Locally-developed indexing and analysis software A series of Unix Perl utilities for analysis and filtering the data Excel, Minitab and SPSS for statistical analysis Trawling Software The Harvest software was used originally. Harvest is widely used within the research community for indexing resources. For example the ACDC project uses Harvest to provide a distributed index of UK.AC web resources. Unfortunately as Harvest was designed for indexing, it is limited in its ability to audit and monitor web technologies. The current version of the WebWatch robot uses Perl. ACDC uses Harvest. See

Restricting Access Why Restrict Access? Administrators may wish to restrict access by automated robot software to web resources for a variety of reasons: To prevent resources from being indexed To minimise load on the web server To minimise network load Robot Exclusion Protocol The Robot Exclusion Protocol is a set of rules which robot software should obey. A robots.txt file located in the root of the web server can contain information on: Areas which robots should not access Particular robots which are not allowed access User-agent: * Disallow: /images/ Disallow: /cgi-bin/ Typical robots.txt File Issues Some issues to be aware of: Prohibiting robots will mean that web resources will not be found on search engines such as Alta Vista Restricting access to the main search engine robots may mean that valuable new services cannot access the resources The existence of a small robots.txt file can have performance benefits It may be desirable to restrict access to certain areas, such as cgi-bin and images directories. Issues Some issues to be aware of: Prohibiting robots will mean that web resources will not be found on search engines such as Alta Vista Restricting access to the main search engine robots may mean that valuable new services cannot access the resources The existence of a small robots.txt file can have performance benefits It may be desirable to restrict access to certain areas, such as cgi-bin and images directories. WebWatch Hosts A robots.txt Checker Service

WebWatch Recommendations Recommendations The final WebWatch report makes a number of recommendations, based on its trawls, including advice for Information Providers, Web Administrators and Robot Software Developers Information Providers Directory Structure Directory structures can provide a form of metadata about a resource. It is recommended the information providers make consistent use of directories. Metadata The use of "Alta Vista" type metadata is recommended for use on key entry points. Frames Frames can prevent indexing robots from accessing resources. If frames are used, there should be an alternative route to resources for robots. Information Providers Directory Structure Directory structures can provide a form of metadata about a resource. It is recommended the information providers make consistent use of directories. Metadata The use of "Alta Vista" type metadata is recommended for use on key entry points. Frames Frames can prevent indexing robots from accessing resources. If frames are used, there should be an alternative route to resources for robots. System Administrators The robots.txt File Web system administrators should ensure that web servers contain a robots.txt file. This may be used to restrict access to robots. HTTP/1.1 Web system administrators should ensure that their server software supports HTTP/1.1. Analysis of Robot Usage Web system administrators should periodically check log files for access by robot software. System Administrators The robots.txt File Web system administrators should ensure that web servers contain a robots.txt file. This may be used to restrict access to robots. HTTP/1.1 Web system administrators should ensure that their server software supports HTTP/1.1. Analysis of Robot Usage Web system administrators should periodically check log files for access by robot software. Software Developers Memory Leaks Memory leaks can cause problems, especially when accessing large nos. of resources. Robot software should include checkpoints, to facilitate restarts. User-Agent Negotiation Robot developers should be aware of server use of "User-Agent Negotiation" which may provide different information to robots and browsers. Software Developers Memory Leaks Memory leaks can cause problems, especially when accessing large nos. of resources. Robot software should include checkpoints, to facilitate restarts. User-Agent Negotiation Robot developers should be aware of server use of "User-Agent Negotiation" which may provide different information to robots and browsers. Further Information Further recommendations are included in the final WebWatch report. The report is available at.

Finding Out More About WebWatch Ariadne Occasional WebWatch reports are published in the online version of the Ariadne magazine. See: WebWatch Staff The WebWatch Officer is Ian Peacock ( Ian's responsibilities include software development, running the robot trawls, analysing the data and producing reports. The WebWatch project is managed by Brian Kelly ( WebWatch Staff The WebWatch Officer is Ian Peacock ( Ian's responsibilities include software development, running the robot trawls, analysing the data and producing reports. The WebWatch project is managed by Brian Kelly ( Publications The following WebWatch articles have been published: "Robot Seeks Public Library Web Sites" in LA Record, Dec 1997 Vol 99 (12) "Academic and Public Library Web Sites" in Library Technology, Aug 1998 "WebWatching Academic Library Web Sites" in Library Technology, Jun 1998 "WebWatching Public Library Web Site Entry Points" in Library Technology, Apr 1998 "Public Library Domain Names" in Library Technology, Feb 1998 "How is My Web Community Doing? Monitoring Trends In Web Service Provision" in Journal Of Documentation, Vol. 55 No. 1 Jan 1999 The final WebWatch report can be obtained from