Australian web domain harvests 2005, 2006 & 2007
Igor Ranitovic Internet Archive engineer With Petabox rack For Australian domain harvest
PANDORA : Domain Harvesting Australian domain harvest –.au domain, located on Australian servers –Internet Archive 1 st harvest June/July 2005 –4 weeks, 185m files, 6.69 TBs 2 nd harvest Aug/Sept 2006 –5 weeks, 596m files, TBs 3 rd harvest Aug/Sept 2007 –4 weeks, 516m files, TBs
Comparative statistics PANDORA Files:51 million Size:2.12 TB Domain Harvest Unique files185,549,662596,238,990516,064,820 Hosts crawled811,5231,046,0381,247,614 Size6.69 TB TB Domain Harvests Files:1,297 million Size:44.2 TB
PANDORA : Domain Harvesting
Some pros – –Retains linkages and context –Large scale – more bytes for the buck –Less selectively discriminate Some cons – –High dependence on the crawler technology –Domain and geo-location bias (.au, geoIP) –Limitations in timeliness, quality assurance, scoping, site complexity, deep web –Legal and access issues to resolve
PANDORA : Australia’s Web Archive Enormous growth and volume of material Everyone can be creators and publishers Virtually instantaneous publication Dynamic content and format Multiplicity of formats Technology dependent Hyperlinked and interconnected Highly accessible but hard to identify Ephemeral Interactivity, re-use, personalisation (web 2.0)