IWIR-CRIS '06 Data retrieval in PURE Data retrieval in the 4-year old PURE CRIS project at 9 universities
2 atira Niels Jernes Vej 10 DK-9220 Aalborg Agenda ■ Overview ■ Retrieval Validated manual data gathering Dynamic integration to local back-end systems Aggregation, enrichment and import of historic data Experiments with automated imports of historic data ■ Exposure Two web services OAI Z39.50 Reports Portal framework ■ Archiving ■ Near future
3 atira Niels Jernes Vej 10 DK-9220 Aalborg Overview ■ Brief overview ■ … in order to discuss ingestion, integration, conversion and import in a specific context
4 atira Niels Jernes Vej 10 DK-9220 Aalborg Overview ■ Brief overview ■ History Development begun in 2002 ■ Users 9 universities (DK+SE), several hospitals + other research institutions ■ Platform and architecture J2EE enterprise application Release management: All users have instances of same release version, same code-base ■ Business model Commercial software licenses, powerful user group, shared budgets ■ Modular Basic module, Reporting module, Student thesis module, External publications module, Bibliometrics module, Press module.
5 atira Niels Jernes Vej 10 DK-9220 Aalborg Overview
6 atira Niels Jernes Vej 10 DK-9220 Aalborg Retrieval ■ Manual data gathering ■ User roles/right + workflow: = de-centralized data gathering = validated data gathering = continuous data gathering ■ GUI example ■ Management focus is necessary Reports and statistics, KPI-management, etc. ■ Adding value to researchers is necessary Instantly in Google indexes, instantly updated personal websites, instantly updated CV, increased citations (source in paper), etc.
7 atira Niels Jernes Vej 10 DK-9220 Aalborg Retrieval ■ Dynamic integration ■ Dynamic integration to local back-end systems: Personnel systems, payroll systems (for data retrieval) LDAPs, Active Directories (for data retrieval + authentication) Single sign-on systems (for authentication) … to automatically create object types such as “person” or “organization” ■ … and yes, PURE hosts data, too We need complete objects according to the meta-data model ■ Plug-in architecture in PURE: Pro = individually adapted integration Con = individually programmed plug-in necessary Future = GUI, standardized plug-ins
8 atira Niels Jernes Vej 10 DK-9220 Aalborg Retrieval ■ Import ■ Historic data ■ Many sources More or less useful data More or less consequent use of formats :-) ■ The PXA format PURE XML Archive format -.zip based Meta-data, relations between entities, binary files ■ Aggregation > enrichment > conversion > import The process is external to PURE
9 atira Niels Jernes Vej 10 DK-9220 Aalborg Retrieval ■ Experiments ■ Experiments with automated imports of historic data from specific, identified sources ■ [source format] > PXA conversion > import > enrichment/validation ■ Very poor data quality demands the concept of “draft objects” in PURE
10 atira Niels Jernes Vej 10 DK-9220 Aalborg Exposure ■ Web services ■ RPC/encoded + document/literal ■ Rich libraries of methods ■ Including format-specific methods: APA, MLA, HARVARD, VANCOUVER and CBE ■ Free and near-instant adding of methods ■ WS code example (if time)
11 atira Niels Jernes Vej 10 DK-9220 Aalborg Exposure ■ OAI support ■ OAI-PMH data provider ■ OAI-PMH formats ■ DC ■ DDF-MXD (Danish national format) ■ SVEP (Swedish national format) … more to come ■ Also used to harvest other PURE-repositories for “external publications”
12 atira Niels Jernes Vej 10 DK-9220 Aalborg Exposure ■ Z39.50 ■ Enabling of searches in PURE from library systems ■ SRW/SRU
13 atira Niels Jernes Vej 10 DK-9220 Aalborg Exposure ■ Reports ■ PURE reporting module ■ GUI example
14 atira Niels Jernes Vej 10 DK-9220 Aalborg Exposure ■ Reference manager ■ Export of data to local Reference Manager installation ■ Using RM-formatted export file ■ Promotes registering to the repository rather than in RM ■ GUI example
15 atira Niels Jernes Vej 10 DK-9220 Aalborg Exposure ■ Portal framework ■ PUREportal – free PURE-specific framework for custom development of research exhibition portals ■ Online example ■ Typical cost scenario € 20,000 ■ Typical delivery time 1 month ■ Little need for requirements specification ■ Automatic PURE-API maintenance
16 atira Niels Jernes Vej 10 DK-9220 Aalborg Archiving ■ Data archiving – 2 levels ■ SQL environment ■ Meta-data and relations ■ Binary files just stored in server file system ■ FEDORA via connector (not PURE-specific, Open Source) ■ Facilitates: Higher quality archival of binary files Long term preservation in general Adoption of PURE in institutions’ general FEDORA strategies
17 atira Niels Jernes Vej 10 DK-9220 Aalborg Near future ■ The near future regarding data retrieval ■ More automated imports using increasingly advanced converters ■ Automated data delivery (push and harvest) to: Industry specific search services (e.g. PubMed, Nordicom) Documentary data collections (such as clinicaltrials.org), and national collections (such as DDF (DK), ForskDok (NO), etc. ■ Temporary import objects When imported data are not in sufficient quality to create valid objects when data cannot be properly related to other objects upon import