Presentation is loading. Please wait.

Presentation is loading. Please wait.

Click to edit Master subtitle style Plugin Development & Standards Monika Mevenkamp MetaArchive Annual Membership Meeting Atlanta, Georgia Friday October.

Similar presentations


Presentation on theme: "Click to edit Master subtitle style Plugin Development & Standards Monika Mevenkamp MetaArchive Annual Membership Meeting Atlanta, Georgia Friday October."— Presentation transcript:

1 Click to edit Master subtitle style Plugin Development & Standards Monika Mevenkamp MetaArchive Annual Membership Meeting Atlanta, Georgia Friday October 24, 2008

2 Plugin/Manifest Resources https://www.metaarchive.org/metawiki On the MetaArchive Wiki Main Page Writing Plugins Plugins, Rules, Manifest Pages,... Plugin/Manifest Standards and Recommendation Plugin Development Cycle Plugin Examples Plugin Resources on the Web Plugins at Virginia Tech Subversion Configuration and Usage

3 Remember Title Database: lockss.xml Keystore for Plugins Daemons magically harvest web sites as you want http://somewhere.edu/someStuff Plugin Repositories Signed Jar Files edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files manifest.html

4 Daemon Crawl Start at manifest page Find Links to new URLs Apply Filtering Rules from Plugin Fetch Survivors Look inside html files go on until there are no more links to look at

5 Example: allContent Minor variation of Metawiki: Writing Plugins | Plugin Examples | harvestEverything The Archival Unit plugin: edu.someWhere.smallSite Base_Url: http://someWhere.edu/smalSite The Website http://someWhere.edu/smallSite The Plugin name: edu.someWhere.smallSite start_url: Base_Url/manifest.html rules: Exclude No Match ^Base Url Include ^Base_Url/manifest.html$ Include ^Base_Url/$

6 Example: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite/index.html

7 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite The Archival Unit plugin: edu.someWhere.smallSite Base_Url: http://.../smallSite

8 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Base_Url = http://..../smallSite ThePlugin says start_url: Base_Url/manifest.html start_url: http://somewhere.edu/.../smallSite/manifest.html start off with Host = somewhere.edu (manifest page has permission stmt) Base_Url = http://..../smallSite Pending: url: http://..../smallSite/manifest.html

9 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Host = somewhere.edu Base_Url = http://..../smallSite Pending: url: http://..../smallSite/manifest.html Remove first url and Apply Filtering Rules to url: http://..../smallSite/manifest.html Rule1: Exclude No Match ^Base Url ^ http://..../smallSite does not apply => not excluded Rule2: include ^Base Url/manifest.html$ ^http://..../smallSite/manifest.html$ applies ==> include Fetch url http://..../smallSite/manifest.html

10 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Host = somewhere.edu Base_Url = http://..../smallSite Pending: NONE Fetch url http://..../smallSite/manifest.html Parse html Find: url: http://..../smallSite/index.html Add to Pending list url: http://..../smallSite/index.html

11 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Host = somewhere.edu Base_Url = http://..../smallSite Pending: url: http://..../smallSite/index.html Remove and Apply Filtering Rules to url: http://..../smallSite/index.html Rule1: Exclude No Match ^Base Url ^ http://..../smallSite does not apply => not excluded Rule2: Include ^Base Url/manifest.html$ ^http://..../smallSite/manifest.html$ does not apply => go on Rule3: Include ^Base_Url/ ^http://..../smallSite/ applies => include Fetch url http://..../smallSite/index.html

12 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Host = somewhere.edu Base_Url = http://..../smallSite Pending Urls: NONE Fetch http://..../smallSite/index.html Parse html Find: url: http://en.wikipedia.org/wiki/Main_Page url: http://www.google.comhttp://www.google.com url: http://.../smallSite/frank.html url: http://.../smallSite/corbusier.html Do not add google.com & wikipedia.org urls They are not on host Add new urls to Pending List

13 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Base_Url = http://..../smallSite Pending: url: http://.../smallSite/frank.html url: http://.../smallSite/corbusier.html Remove first and Apply filtering rules to url: http://.../smallSite/frank.html Rule1: Exclude No Match ^Base Url ^ http://..../smallSite does not apply => not excluded Rule2: Include ^Base Url/manifest.html$ ^http://..../smallSite/manifest.html$ does not apply => go on Rule3: Include ^Base_Url/ ^http://..../smallSite/ applies => include Fetch url http://..../smallSite/frank.html

14 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Base_Url = http://..../smallSite Pending: url: http://.../smallSite/corbusier.html Fetch http://..../smallSite/frank.html Parse html Find: url: http://.../smallSite/index.html url: http://.../smallSite/wright.jpg index.html is on old url Add wright.jpg to pending list

15 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Base_Url = http://..../smallSite Pending: url: http://.../smallSite/corbusier.html url: http://.../smallSite/wright.jpg Remove and Apply Rules to url: http://.../smallSite/corbusier.html Fetch corbusier.html Parse html Find: url: http://.../smallSite/index.html url: http://.../smallSite/corbusier.jpg index.html is on old url Add image to Pending List

16 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Base_Url = http://..../smallSite Pending: url: http://.../smallSite/wright.jpg url: http://.../smallSite/corbusier.jpg Remove and Apply Rules to url: http://.../smallSite/wright.jpg Fetch wright.jpg Remove and Apply Rules to url: http://.../smallSite/corbusier.jpg Fetch corbusier.jpg Pending List is Empty

17 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Nobody links to README

18 Crawl: smallSite http://www.metaarchive.org/lockss/help/plugins/site/smallSite Nobody links to README

19 Daemon Crawl Refined Start with manifest/ start_url Apply Filtering Rules in sequence stop when there is a Include or Exclude decision exclude when reaching end of rules If url passes filtering rules fetch url parse if html format and collect urls found in links only if found url is on permitted host and is a new url go on until there are no more urls Special handling for start_url look for LOCKSS permission statement do not continue if it is not there otherwise use manifests host to decide whether to add urls to pending list

20 Content -> MetaArchive Define in Conspectus Tool manifest.html Create Manifest Page Create & Test Plugin

21 Conspectus http://www.metaarchive.org/conspectus/index.php Create New Metadata Description http://www.metaarchive.org/conspectus/index.php Web Crawl Plugin Base_URL no additional params... LOCKSS meta data... Title Description... general meta data...

22 Plugin Creation edu/somewhere/allContent.xml If your institution is somewhere.edu keep all your plugins in files edu/somewhere/pluginThisName.xml edu/somewhere/institute/pluginThatOne.xml where possible create plugins by copying a similar one edit/create plugins with plugintool plugin name must match filename: edu/somewhere/pluginThisName.xml Name: edu.somewhere.pluginThisName edu/somehwere/institute/pluginThatOne.xml Name: edu.somewhere.institute.pluginThatOne

23 Manifest – Where to put it http://somewhere.edu/someStuff/manifest.html The conspectus says: Collection URI: http://somewhere.edu/someStuff Plugin Identifier: edu.somewhere.allContent The Plugin in file edu/somewhere/allContent.xml will be signed, jared, and deployed in plugin repository It has the name edu.somewhere.allContent Its start_url pattern determines the name of the manifest page start_url: Base_URL/manifest.html So the manifest must be called: http://somewhere.edu/someStuff/manifest.html

24 Manifest http://somewhere.edu/someStuff/manifest.html The manifest page seeds the web crawl with the urls it links to. It needs to cooperate with the plugin. It is under the control of the content owner, readily accessible, and unlike plugins it may be changed easily. +

25 Manifest http://somewhere.edu/someStuff/manifest.html based on: metasource-svn /trunk/manifests/manifest_template.html Talks By The Famous Guy LOCKSS Manifest Page Collection Info: * Conspectus Collection(s): Talks By The Famous Guy * Institution: Famous Institute, Some Where University * Contact Info: Edna Krabappel digital librarian * Contact Info: Milhouse van Houten plugin developer This collections contains transcripts of lectures given by the famous guy starting with his PhD defense in 1856 given at the Current Institute.... It contains plain text, scanned images, and pdf files. The whole site is preserved.... links to dublin core XML files are part of each lecture page. Links for LOCKSS to start its crawl: * index.html - the home page of the Famous Guy Web Site LOCKSS system has permission to collect, preserve, and serve this Archival Unit. http://www.metaarchive.org/conspectus/view_rdf.php?collection=39 Conspectus Tool | View RDF Mail links OK Keep updated Human readable Whatever is relevant special Coordinate links with plugin No Permission No Harvest

26 Preserving Big Collections Hard Easier

27 Example: bigSite Index: bigSite Parent Directory dnaSearch/ microbes/ viruses/... Index: bigSite/dnaSearch Parent Directory branchAndBound.java branchAndBound.pdf branchAndBound.tgz divideAndConquerChart1.tif divideAndConquerChart2.tif divideAndConquer.java divideAndConquer.pdf divideAndConquer.txt Index: bigSite/microbes Parent Directory dc pdf tif Index: bigSite/microbes/dc Parent Directory actinomycetes.dc leptospira.dc... Index: bigSite/microbes/pdf Parent Directory actinomycetes.pdf leptospira.pdf... Index: bigSite/microbes/tif Parent Directory actinomycetes.tif leptospira.tif... Index: bigSite/viruses Parent Directory dc wav

28 Example: bigSite Plugin Minor variation of Metawiki: Writing Plugins | Plugin Examples | harvestAllFilesInSubdirectory Archival Unit plugin: edu.someWhere.bigSite Base_Url: http://.../bigSite volume_name: dnaSearch The Plugin name: edu.somewhere.bigSite param: volume_name start_url: Base_Url/manifest.html rules: Exclude No Match ^Base Url Include ^Base_Url/manifest.html$ Include ^Base_Url/?$ Include ^Base_Url/volume_name$ Include ^Base_Url/volume_name/ Archival Unit plugin: edu.someWhere.bigSite Base_Url: http://.../bigSite volume_name: microbes Archival Unit plugin: edu.someWhere.bigSite Base_Url: http://.../bigSite volume_name: viruses http://.../bigSite/manifest.html link to http://.../bigSite

29 Example: bigSite Plugin Minor variation of Metawiki: Writing Plugins | Plugin Examples | harvestAllFilesInSubdirectory Archival Unit plugin: edu.someWhere.bigSite Base_Url: http://somewhere.edu/bigSite volume_name: dnaSearch http://somewhere.edu/bigSite/manifest.html http://somewhere.edu/bigSite/ http://somewhere.edu/ http://somewhere.edu/bigSite/ http://somewhere.edu/bigSite/dnaSearch http://somewhere.edu/bigSite/microbes http://somewhere.edu/bigSite/viruses http://somewhere.edu/bigSite/ http://somewhere.edu/bigSite/dnaSearch/ http://somewhere.edu/bigSite/dnaSearch/branchAndBound.java http://somewhere.edu/bigSite/dnaSearch/branchAndBound.pdf http://somewhere.edu/bigSite/dnaSearch/branchAndBound.tgz http://somewhere.edu/bigSite/dnaSearch/divideAndConquerChart1.tif http://somewhere.edu/bigSite/dnaSearch/divideAndConquerChart2.tif http://somewhere.edu/bigSite/dnaSearch/divideAndConquer.java http://somewhere.edu/bigSite/dnaSearch/divideAndConquer.pdf http://somewhere.edu/bigSite/dnaSearch/divideAndConquer.txt

30 Example: bigSite Plugin Minor variation of Metawiki: Writing Plugins | Plugin Examples | harvestAllFilesInSubdirectory Archival Unit plugin: edu.someWhere.bigSite Base_Url: http://somewhere.edu/bigSite volume_name: microbes http://somewhere.edu/bigSite/manifest.html http://somewhere.edu/bigSite/ http://somewhere.edu/ http://somewhere.edu/bigSite/ http://somewhere.edu/bigSite/dnaSearch http://somewhere.edu/bigSite/viruses http://somewhere.edu/bigSite/microbes http://somewhere.edu/bigSite/ http://somewhere.edu/bigSite/microbes/ http://somewhere.edu/bigSite/microbes/dc http://somewhere.edu/bigSite/microbes/pdf http://somewhere.edu/bigSite/microbes/tif http://somewhere.edu/bigSite/microbes/ http://somewhere.edu/bigSite/microbes/dc/ http://somewhere.edu/bigSite/microbes/dc/actinomycetes.dc http://somewhere.edu/bigSite/microbes/dc/leptospira.dc http://somewhere.edu/bigSite/microbes/ http://somewhere.edu/bigSite/microbes/pdf/ http://somewhere.edu/bigSite/microbes/pdf/actinomycetes.pdf http://somewhere.edu/bigSite/microbes/pdf/leptospira.pdf http://somewhere.edu/bigSite/microbes/ http://somewhere.edu/bigSite/microbes/tif/ http://somewhere.edu/bigSite/microbes/tif/actinomycetes.tif http://somewhere.edu/bigSite/microbes/tif/leptospira.tif

31 Example: bigSite Harvest Index: bigSite Parent Directory algorithms/ microbes/ proprietary/... Index: bigSite/dnaSearch Parent Directory branchAndBound.java branchAndBound.pdf branchAndBound.tgz divideAndConquerChart1.tif divideAndConquerChart2.tif divideAndConquer.java divideAndConquer.pdf divideAndConquer.txt Index: bigSite/microbes Parent Directory dc pdf tif Index: bigSite/microbes/dc Parent Directory actinomycetes.dc leptospira.dc... Index: bigSite/microbes/pdf Parent Directory actinomycetes.pdf leptospira.pdf... Index: bigSite/microbes/tif Parent Directory actinomycetes.tif leptospira.tif... Index: bigSite/viruses Parent Directory dc wav manifest.html

32 WYFISWYP server processing can get in the way SERVER has html files / docs css files server side includes cgi processing database + code xml + xsl CLIENT/CACHE sees html files / docs css files expanded files whatever cgi-scripts generate whatever code generates xml transformed xml

33 WYFISWYP server processing can get in the way What you Fetch is what you preserve http:/.../site: ruby/cgi + database -> html database has all metadata html has links to documents (pdf,wav,...) Decide: Is html / documents enough ? If yes have manifest.html link to http://.../site if no explore options is a metadata export available ? can you modify the site ? database dump (XML,SQL) ?

34 WYFISWYP server processing can get in the way What you Fetch is what you preserve http:/.../site: ruby/cgi + database -> rdf/mets database has all metadata rdf/mets has links to documents (pdf,wav,...) Decide: Do you want the docs ? Be aware that LOCKSS does not parse XML Solution ? Site dependent

35 WYFISWYP server processing can get in the way What you Fetch is what you preserve http://.../site/xml http://.../site/xsl http:/.../site: ruby converts xml -> html with help of xsl Decide: Do you want html or XML/XSL ? If html have manifest.html link to http://.../site if XML/XSL have manifest.html link to http://.../site/xml/ http://..../site/xsl

36 WYFISWYP server processing... Metawiki: Writing Plugins | Plugin Examples | harvestXMLSite What you Fetch is what you preserve site is XML/XSL based transformation to html is left to client xml contains references to document urls Decide: Is xml enough ? If yes have manifest.html link to http://.../site if no xml files are not parsed by LOCKSS harvester want xsl files – add links to manifest page want document files – add links to manifest page if site is well structured link from manifest page to directory containing all xsl/documents files

37 Re-Harvest is different When harvesting the first time everything is new all urls that pass plugin filter must be fetched crawl all relevant parts of site When re-harvesting later goal: fetch only changes avoid: complete crawl method: visit only updated parts of website Need to know how the site changes over time to get this right

38 Crawl Depth LOCKSS harvester does a Depth First Crawl It adds new urls at end of pending url list manifest.html is at depth 0 links found in manifest.html are at depth 1 new urls found in depth 1 pages are at depth 2 new urls found in depth 2 pages are at depth 3 new urls found in depth 3 pages are at depth 4 new urls found in depth 4 pages are at depth 5 new urls found in depth 5 pages are at depth 6 new urls found in depth 6 pages are at depth 7...

39 Crawl Depth: bigSite Index: bigSite Parent Directory algorithms/ microbes/ proprietary/... Index: bigSite/dnaSearch Parent Directory branchAndBound.java branchAndBound.pdf branchAndBound.tgz divideAndConquerChart1.tif divideAndConquerChart2.tif divideAndConquer.java divideAndConquer.pdf divideAndConquer.txt Index: bigSite/microbes Parent Directory dc pdf tif Index: bigSite/microbes/dc Parent Directory actinomycetes.dc leptospira.dc... Index: bigSite/microbes/pdf Parent Directory actinomycetes.pdf leptospira.pdf... Index: bigSite/microbes/tif Parent Directory actinomycetes.tif leptospira.tif... Index: bigSite/viruses Parent Directory dc wav manifest.html 0 1 2 2 2 3 3 3

40 Daemon Crawl Again Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations | PluginCrawlDepth Start with manifest/ start_url Apply Filtering Rules in sequence stop when there is a Include or Exclude decision exclude when reaching end of rules If url passes filtering rules fetch url parse if html format and collect urls found in links if found url is on permitted host and is a new url if found url has never been fetched before add to pending list if crawl depth of found url < plugin crawl depth if server gives modified info and server url is newer than stored url add to pending list if server does not give modified info add to pending list go on until there are no more urls

41 Plugin Crawl Depth By default plugins always fetch manifest page but continue on only if they find changes Set a plugins crawl depth in the plugintool Plugin | Expert Mode set Default Crawl Depth value set to 99 to force complete crawl each time beware: the plugintools Test Depth is not related to the plugins Default Crawl Depth

42 Re-Harvest: bigSite Index: bigSite Parent Directory algorithms/ microbes/ proprietary/... Index: bigSite/dnaSearch Parent Directory branchAndBound.java branchAndBound.pdf branchAndBound.tgz divideAndConquerChart1.tif divideAndConquerChart2.tif divideAndConquer.java divideAndConquer.pdf divideAndConquer.txt Index: bigSite/microbes Parent Directory dc pdf tif Index: bigSite/microbes/dc Parent Directory actinomycetes.dc leptospira.dc... Index: bigSite/microbes/pdf Parent Directory actinomycetes.pdf leptospira.pdf... Index: bigSite/microbes/tif Parent Directory actinomycetes.tif leptospira.tif... Index: bigSite/viruses Parent Directory dc wav manifest.html 0 1 2 2 2 3 3 3 Plugin Crawl Depth

43 Harvesting Tips/Standards Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations Archival Units are uniquely defined by once ingested by LOCKSS daemon plugin – name name can not change Base_Url value value can not change extra parameters and their values params and values can not change only plugin definition, start_url, filter rules can change edu.somewhere.allContent Base_URL: http:/somewhere.edu/someStuff edu.somewhere.someJournal Base_URL: http:/somewhere.edu/someJournal year: 1999 edu.somewhere.someJournal Base_URL: http:/somewhere.edu/someJournal year: 2000 edu.somewhere.allContent Base_URL: http:/somewhere.edu/thatStuff edu.somewhere.someJournal Base_URL: http:/somewhere.edu/someJournal year: 1998 edu.somewhere.otherStuffPlugin Base_URL: http:/somewhere.edu/otherStuff Collection

44 Harvesting Tips/Standards Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations Shoot for a size between 1GB and 10GB Remember that LOCKSS daemons never delete anything Even if your site deletes a file for every file it creates its AU grows. Choose your Base_URL wisely do not include port numbers: http:// somewhere.edu/foobar:8081 do not go full length: http:// somewhere.edu/long/path/to/some/stuff You can always include port numbers or long paths in start_url and rule patterns start_url: Base_Url/manifest.html start_url: Base_Url:8081/manifest.html start_url: Base_Url/long/path/to/some/stuff/manifest.html Do not share plugins for collections unless you are sure that site structure and preservation needs will remain the same. Collections may change over time, keep it flexible Its often better to exclude what you don't want than to list what you want. Periodically make sure plugin/manifest produce correct preservation behavior.

45 Manifest Tips / Standards Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations use the manifest template as a starting point svn copy from metasource subversion include LOCKSS permission statement include relevant information about the collection in the manifest page include a link to the collections conspectus entry include information about what is preserved, how content is organized, where to find metadata one collection one manifest page even if there are multiple archival units include few links in manifest page

46 Manifest Tips / Standards Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations Include few links if collection has multiple archival units design plugin rules that include different sections of site at Base_URL and/or include links in manifest that point to the 'bases' of archival units Never Ever Ever write a script that crawls your site and generates url lists for the manifest Almost Never Ever Ever use the manifest page to list individual urls for preservation

47 If there is a similar plugin svn copy all metaarchive plugins are in metasource subversion start the plugin name with the reverse of your institutions domain name e.g. edu.gatech.smartechPlugin make sure the plugin is stored in a file that matches its name capitalization matters e.g. edu/gatech/smartechPlugin.xml use the plugin notes for documenting the plugins intent and expectations for the manifest page Plugin Tips / Standards Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations

48 start rules with Exclude No Match ^Base Url Include ^whatever start_url says$ plugins for collection with multiple archival units have additional parameter(s) Volume No., Year, Year (2digit) parameters can assume only numerical values the plugins start_url should be the same for all archival units Plugin Tips / Standards Metawiki: Writing Plugins | Plugin/Manifest Standards and Recommendations

49 Plugin Tips / Standards test Test TEST plugintool basic operation / ingest harvest run_one_daemon test whether web sites changes are picked up use audit proxy to inspect preserved content check content of cache/ subdirectory

50 Plugin Development Editor: plugintool http://www.lockss.org/lockss/Plugin_Tool Testing: plugintool run_one_daemon source control: subversion https://svn.library.emory.edu/oss/metasource MetaArchive Project Wiki about subversion https://www.metaarchive.org/metawiki/index.php?title=Subversion Subversion Book: svnbook.red-bean.com/ RapidSVN: http://www.rapidsvn.org/index.php/Main_Pagehttp://www.rapidsvn.org/index.php/Main_Page documentation MetaArchive Project Wiki: Writing Plugins https://www.metaarchive.org/metawiki/index.php?title=Plugins LOCKSS web site: http://www.lockss.org/lockss/Plugin_Tool#Tutorial


Download ppt "Click to edit Master subtitle style Plugin Development & Standards Monika Mevenkamp MetaArchive Annual Membership Meeting Atlanta, Georgia Friday October."

Similar presentations


Ads by Google