Download presentation
Presentation is loading. Please wait.
Published byAnnice Perkins Modified over 9 years ago
1
Heritrix 3: librarian features BnF proposal March 2015
2
Context Follow up of our NetarchiveSuite workshop in Tallinn: – https://sbforge.org/display/NAS/2015+Workshop+Conclusion https://sbforge.org/display/NAS/2015+Workshop+Conclusion Identified work packages: – tests – template migration – implementation of important but missing curator features for common operations in Heritrix 3 BnF will further describe use cases, share them with the community for feedback and implement the following features as a minimal Heritix UI add-on
3
From H1…
4
… to H3
7
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
10
Search crawl.log (NASC61) Add a page with the same layout but with 2 additional form fields: – Regular expression: – Show matches: 1000 (default # of matching URIs) – Action => Display URIs (reversed order by default) Possibility to refresh display (F5)
11
Draft UI for « Search crawl log » Display URIs Status + job ID Home Forward Reversed Matching lines: 1000 Lines: displaying 1-1000 out of 12345
12
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
14
Add filter on current job (DecideRule) (NASC60) Not necessary to view active filters that were included from job start (NASC59) Add a page containing a rejectTemporarily area working with the following parameters: – Decision: REJECT – List-logic: OR – Regexp-list : empty at job start, free textarea which can be manually edited and sorted (440 px wide, 20 lines) – Action => Save: save current filters and activate them for current job
15
Draft UI for « Add filter on current job » Status + job ID Home All URIs matching any of the following regular expressions will be rejected from the current job. Regular expressions: Save
17
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
18
Change domains/hosts budget Works with queue-total-budget and quota- enforcer systems Add a page containing: – a list of domains/hosts (in domain alphabetical order) – their associated budget value (which can be edited) – only those which budget is not set by default – and a form field to add a new domain/host
19
Draft UI for « Change domains/hosts budget » Status + job ID Home Save Budget defined in job configuration: queue-total-budget of 100 000 URIs. bnf.fr 140 000 ina.fr 139 000 cnc.fr 139 500 Budgets of following domains/hosts have been changed in the current job: New domain/host: toto.fr – 130 000 Save
20
Common curator operations Search crawl.log Add filter on current job (job configuration) Change domains/hosts budget (job configuration) View or delete frontier URIs
22
View or delete frontier URIs (NASC56 + NASC57 + NASC58) Add a page containing 2 form fields: – Regular expression: – Show matches: 1000 (default # of matching URIs) – Action A => Display URIs: displays the matching URIs, the # of matching URIs and gives the possibility to view the next bloc of matching URIs – Action B => Delete URIs: delete matching URIs and indicates the # of matching URIs
23
Draft UI for « View or delete frontier URIs » Status + job ID Home URIs: displaying 1-1000 out of 12345 Matching lines: 1000 URIs: displaying 1-1000 out of 12345 Pause the job first to view frontier
24
search Job configuration add filter – change budget
25
Comparaison with BAnQ
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.