RoMEO and CRIS Technical Issues & Efficiency Tips Peter Millington Centre for Research Communications University of Nottingham RoMEO and CRIS in Practice Birmingham, 1st April 2011
Outline Patterns of usage Approaches to using ROMEO in CRIS Do we have a crisis? Approaches to using ROMEO in CRIS Real time queries Caching and reusing RoMEO query results Rates of change – Reality Check And their implications Other efficiency tips
Usage of Interactive RoMEO
Usage of Interactive RoMEO
Usage of Interactive RoMEO Similar curve shapes for other measures Distinct weekly pattern ~4,500 Page views per day ~1,000 Visits per day ~ 700 Unique visitors per day Seems to be a stable seasonal pattern
Usage of the RoMEO API – All Users
Usage of the RoMEO API – All Users
Usage of the RoMEO API – Requests
Usage of the RoMEO API – Requests
Usage of the RoMEO API Much more variable pattern Weekly cycle of visits less distinct Number of requests very highly variable More usage by fewer users ~60 Unique visitors per day Over 250,000 hits per day (>50 times interactive) Significant growth Steady growth in number of API users Rapid growth in number of requests
Do we have a Crisis? Do you ever think RoMEO is slow? Most API usage is by CRIS-like applications How can we improve things? Higher capacity server? Funding? Unnecessary? Improve efficiency? Optimise the API? More efficient usage? Put a cap on number of requests per day? What level? 1000? 2000? Block commercial software users N.b. Creative Commons License
API approaches in CRIS applications Real time requests when displaying data Acceptable for individual article displays Latency too slow for lists of articles Caching RoMEO data for rapid local re-use Initial (bulk) checks against RoMEO Store the results locally Periodically recheck for updated policies Whole bibliography Additions and updates only
Real Time Usage Pattern
Real Time Usage Pattern
Real Time Usage Pattern Levels vary day by day Arguably high usage for one installation Occasional peaks Special system jobs Special end user projects
Caching with Monthly Updates
Caching with Monthly Updates Rechecking the whole database each cycle Seems to take three days. Low priority setting? Scheduled job – starts 1st of the month Could it be a weekend instead? Faster. Less intrusive. What is being checked? Each reference? Groups of records for each journal title? What about additions between cycles?
Caching with Daily Updates (1)
Caching with Daily Updates (1)
Caching with Daily Updates (1) Whole database checked every day Institutions can easily have lists of 50,000 items! Lists constantly growing, slowing things down What is being checked? Each reference? Probably Additions and updates between checks? No accuracy problems Sledgehammer to crack a nut
Is the nut cracking the sledgehammer?
Caching with Daily Updates (2)
Caching with Daily Updates (2) Note the logarithmic scale Large initial check of the whole database Daily check of added & changed items only Welcome low loading on the API
Rates of Change – Reality Check Institutional Bibliographies Up to 2,000 additions per year (<40 per week) Few bibliographic changes after initial QA RoMEO Publishers’ Policies c.25 additions or substantive changes per week Journal - Publisher Correlations Change of publisher - infrequent - mostly January Bulk changes - Business take-over or name change Expiry of archiving embargos
RoMEO Implications of Change Rates Institutional Bibliographies Only need to check additions & changes Weekly check probably sufficient, or on first use RoMEO Publishers’ Policies Recheck when the RoMEO record changes Store RoMEO ID with article/journal for bulk updates Journal - Publisher Correlations Full recheck annually on rolling cycle Specific rechecks for known business/name changes Expiry of archiving embargos Scope for improvement in RoMEO
Caching of RoMEO Publisher Data Download the whole database with “?all=yes” Relatively fast Download as often as you wish Suggest weekly And/Or… Store key RoMEO data with bibliographic records Provide links to interactive RoMEO Full publisher records using RoMEO ID, or Journal level data using ISSN
Caching Journal-level Data Schema/Organisation Per journal (efficient) Per article (probably inefficient) Fields Journal title ISSN and ESSN RoMEO Persistent Publisher ID RoMEO Colour and/or Version-specific permissions Normal – i.e. At the time of publication Adjusted after the completion of any embargo period
Most Efficient RoMEO Queries Journals ISSN/ESSN or Exact Title Unique or far fewer results, so faster May avoid the overhead of needing to search Zetoc Publishers RoMEO ID Unique result. It gets no faster. Exact publisher name May sometimes find multiple results.
What to do with failed requests? Don’t just keep rechecking! Not a journal article? Outside RoMEO’s scope. Prevent rechecking Data error (e.g. typo, bad abbreviation)? Correct the source data, then recheck No publisher or no policy in RoMEO? Feedback to RoMEO – if important Recheck infrequently – say annually or quarterly
Any Questions? RoMEO: http://www.sherpa.ac.uk/romeo API: http://www.sherpa.ac.uk/romeo/api Blog: http://romeoblog.jiscinvolve.org E-mail: romeo@sherpa.ac.uk Twitter: @SHERPAServices Peter Millington: peter.millington@nottingham.ac.uk 0115 84 68481