Heritrix 3: librarian features BnF proposal March 2015
Context Follow up of our NetarchiveSuite workshop in Tallinn: – Identified work packages: – tests – template migration – implementation of important but missing curator features for common operations in Heritrix 3 BnF will further describe use cases, share them with the community for feedback and implement the following features as a minimal Heritix UI add-on
From H1…
… to H3
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
Search crawl.log (NASC61) Add a page with the same layout but with 2 additional form fields: – Regular expression: – Show matches: 1000 (default # of matching URIs) – Action => Display URIs (reversed order by default) Possibility to refresh display (F5)
Draft UI for « Search crawl log » Display URIs Status + job ID Home Forward Reversed Matching lines: 1000 Lines: displaying out of 12345
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
Add filter on current job (DecideRule) (NASC60) Not necessary to view active filters that were included from job start (NASC59) Add a page containing a rejectTemporarily area working with the following parameters: – Decision: REJECT – List-logic: OR – Regexp-list : empty at job start, free textarea which can be manually edited and sorted (440 px wide, 20 lines) – Action => Save: save current filters and activate them for current job
Draft UI for « Add filter on current job » Status + job ID Home All URIs matching any of the following regular expressions will be rejected from the current job. Regular expressions: Save
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
Change domains/hosts budget Works with queue-total-budget and quota- enforcer systems Add a page containing: – a list of domains/hosts (in domain alphabetical order) – their associated budget value (which can be edited) – only those which budget is not set by default – and a form field to add a new domain/host
Draft UI for « Change domains/hosts budget » Status + job ID Home Save Budget defined in job configuration: queue-total-budget of URIs. bnf.fr ina.fr cnc.fr Budgets of following domains/hosts have been changed in the current job: New domain/host: toto.fr – Save
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
View or delete frontier URIs (NASC56 + NASC57 + NASC58) Add a page containing 2 form fields: – Regular expression: – Show matches: 1000 (default # of matching URIs) – Action A => Display URIs: displays the matching URIs, the # of matching URIs and gives the possibility to view the next bloc of matching URIs – Action B => Delete URIs: delete matching URIs and indicates the # of matching URIs
Draft UI for « View or delete frontier URIs » Status + job ID Home URIs: displaying out of Matching lines: 1000 URIs: displaying out of Pause the job first to view frontier
search Job configuration add filter – change budget
Comparaison with BAnQ