Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December or... A joyful romp with Heritrix, JavaScript, & Spotlight!
background... DI2 brought together –University of Minnesota (CBI) –University of Michigan (SI) –Internet2 web crawling only a small part the “save everything” approach
briefly… on crawling with spiders on Heritrix and JavaScript on Spotlight and local files on sinkholes and strategies
spiders on the web
pages
links
hosts & domains
robots.txt
scope
seeds
excluded pages
done!
our crawler Heritrix, from the IA aiming for broad deployment, Archive-It cross-platform, many users simple setup, sophisticated options generates ARC files
from ARC to archive keep originals intact a few large files to manage can serve a mirror from the master can extract files for research solution requires Perl, PHP, JavaScript, MySQL
processing... for mirroring online –optimizing and indexing with Perl –loading into MySQL database –presenting via PHP for using on local disk –extracting files from ARC
joys of javascript... modifies the page after loading HTML almost unmolested changes explicit in code
are we there yet? make the archive obvious yet intrude as little as possible
global research locally a web site in your pocket applying local tools maintaining browse-ability Apple’s Spotlight one of many
sinkholes / strategies partnership with institution –config, IP, retention crawling far from perfect –no creation dates, exclusions –sticky traps, scripted pages (AJAX) scripts still immature –better demarcation –more self-contained (not at /)
still... capture & save what we can keep it as “original” as possible stay flexible for the future have fun in the present!
more information Eric Celeste