Download presentation
Presentation is loading. Please wait.
Published byMaud Carter Modified over 9 years ago
1
Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library of Australia 12 November 2004
2
Reasons for archiving web sites ● They are important – Main public and internal communication mechanism – Australian "Government Online", US Government Paperwork Elimination Act ● Legal – Act of publication – Context as well as content ● Reputation, Community Expectations ● Commercial Advantage ● Providence
3
Web site characteristics ● Increasing dynamic content ● Content changes relatively slowly as a % of total ● Small set of pages account for most hits ● Most responses have been seen before
4
Web site characteristics
5
Desirable attributes of an archiving methodology ● Coverage – Temporal – Responses to searches, forms, scripted links ● Robustness – Simple – Adaptable ● Cost – Feasible – Scalable ● Ability to recreate web site at a point in time – Exactly as originally delivered – Support analysis, recovery
6
Approaches to archiving ● Content archiving – Input side – capture all changes ● “Snapshot” – crawl – backup ● Response archiving – Output side: capture all unique request/responses
7
Content archiving Snapshot Response archiving Cost Coverage Robust Recreate web site Often part of CMS Small volumes ✔ Often part of CMS Small volumes Requires “live site”: hardware software, content, data, authentication,... ✘ Assumes all content is perfectly managed ✘ Dynamic content hard to capture. Subvertable ✘ Complete crawl is large ✘ Faithful but incomplete ✘ Simple ✔ Incomplete (no forms, scripts..) Gap between crawls ✘ Small volumes ✔✘✔✘ Faithful and complete ✔ Conceptually simple Independent of content type ✔ Address space & temporally complete. Not subvertable Too complete! ✔ Collection overhead In the critical path ✘ ✘
8
Is response archiving feasible? ● Yes, because: – Only a small % of responses are unique – Overhead and robustness can be addressed by design – Non material changes can be defined
9
Approaches to response archiving ● Network sniffer – Not in the critical path – Cannot support HTTPS ● Proxy – End to end problems (HTTPS, client IP addr) – Extra latency (TCP/IP session) ● Filter – Runs within web server – Full access to req/resp
10
A Filter implementation: pageVault ● Simple filter “gatherer” – Uses Apache 2 or IIS server architecture – Big problems with Apache 1 ● Does as little as possible within the server
11
pageVault Architecture
12
pageVault design goals ● Filter must be simple, efficient, robust – Negligible impact on server performance – No changes to web applications ● Selection of responses to archive based on URL,content type – Support definition of “non material” differences ● Flexible archiving – Union archives, split archives ● Complete “point in time” viewing experience – Plus support for analysis
13
Sample pageVault archive capabilities ● What did this page/this site look like at 9:30 on 4th May last year? ● How many times and exactly how has this page changed over the last 6 months? ● Which images in the "logos" directory have changed this week? ● Show these versions of this URL side-by-side ● Which media releases on our site have mentioned “John Laws”, however briefly available?
14
Performance impact ● Determining "uniqueness" requires calculation of checksum – 0.2ms per 10KB [*] ● pageVault adds 0.3 - 0.4 ms to service a typical request – a “minimal” static page takes ~1.1ms, – typical scripted pages take ~5 – 100ms... – performance impact of determining strings to exclude for “non- material” purposes is negligible [*] - Apache 2.0.40, Sparc 750MHz processor, Solaris 8
15
Comparison with Vignette’s WebCapture ● Enterprise-sized, integrated, strategic ● Large investment ● Focused on transactions ● Aims to be able to replay transactions ● Simple, standalone, lightweight ● Inexpensive ● Targets all responses ● Aims to recreate all responses on the entire website WebCapture pageVault
16
pageVault applicability ● Simple web site archives ● Notary service – Independent archive of delivered responses ● Union archive – Organisation-wide (multiple sites) – National archive – Thematic collection
17
Summary ● Effective web site archiving is an unmet need – Legal – Reputation, community expectation – Providence ● Complete archiving with input-side and snapshot approaches is impractical ● An output-side approach can be scalable, complete, inexpensive
18
Thanks to... ● Russell McCaskie, Records Manager, CSIRO – Russell was responsible for bringing the significant issues with the preservation and management of web-site content to our attention in 1999 ● The JDBM team – An open source B+Tree implementation used by pageVault
19
More information http://www.projectcomputing.com kent.fitch@projectcomputing.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.