Presentation is loading. Please wait.

Presentation is loading. Please wait.

Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

Similar presentations


Presentation on theme: "Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December."— Presentation transcript:

1 Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December 2005...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

2 background... DI2 brought together –University of Minnesota (CBI) –University of Michigan (SI) –Internet2 web crawling only a small part the “save everything” approach

3 briefly… on crawling with spiders on Heritrix and JavaScript on Spotlight and local files on sinkholes and strategies

4 spiders on the web

5 pages

6 links

7 hosts & domains

8 robots.txt

9 scope

10 seeds

11 excluded pages

12

13

14

15

16

17

18 done!

19 our crawler Heritrix, from the IA aiming for broad deployment, Archive-It cross-platform, many users simple setup, sophisticated options generates ARC files

20 from ARC to archive keep originals intact a few large files to manage can serve a mirror from the master can extract files for research solution requires Perl, PHP, JavaScript, MySQL

21 processing... for mirroring online –optimizing and indexing with Perl –loading into MySQL database –presenting via PHP for using on local disk –extracting files from ARC

22 joys of javascript... modifies the page after loading HTML almost unmolested changes explicit in code

23 are we there yet? make the archive obvious yet intrude as little as possible

24 global research locally a web site in your pocket applying local tools maintaining browse-ability Apple’s Spotlight one of many

25 sinkholes / strategies partnership with institution –config, IP, retention crawling far from perfect –no creation dates, exclusions –sticky traps, scripted pages (AJAX) scripts still immature –better demarcation –more self-contained (not at /)

26 still... capture & save what we can keep it as “original” as possible stay flexible for the future have fun in the present!

27 more information http://wiki.lib.umn.edu/DI2/ Eric Celeste


Download ppt "Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December."

Similar presentations


Ads by Google