Download presentation
Presentation is loading. Please wait.
Published byAgnes Chambers Modified over 6 years ago
1
A Modular, Standards-based Digital Object Repository
aDORe: A Modular, Standards-based Digital Object Repository Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library Los Alamos National Laboratory
2
aDORe repository architecture : an overview
3
context Initial motivation: undo tight integration between data and application Uniform approach for ingesting, storing, and disseminating LANL RL data collections Bigger picture: Allow for multiple, parallel applications on top of stored content Create an environment that provides guarantees regarding long-term accessibility of stored content
4
context Core characteristics of the aDORe architecture:
Standards-based: XML, XML Schema, MPEG-21 Digital Item Declaration, the MPEG-21 Digital Item Identification, the MPEG-21 Digital Item Processing, OAI-PMH, NISO OpenURL Framework for Context-Sensitive Services, Internet Archive ARC file format, OAIS concepts Component-based, modular design. Interaction between components is protocol-based Dynamic attachment of dissemination methods to stored content future proof-ness ability to use off-the-shelf software ability to replace components while maintaining stability increase interoperability with information environment at large scale
5
core aDORe modules Ingestion process: Repository Index
Representing Digital Objects using MPEG-21 DID Identification Storing Digital Objects: XMLtape & ARC files Autonomous OAI-PMH Repositories Repository Index Identifier Locator OAI-PMH Federator OpenURL Gateway
6
overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C
7
Pre-Ingest: data input from information provider
1 Pre-Ingest: data input from information provider Data feeds from third parties: Delivered in various ways (http, ftp, OAI-PMH, ..) Many different formats Typically contain many assets in a single feed Assets are typically ‘complex’, i.e. they consist of multiple datastreams
8
Pre-Ingest: sample Digital Object
1 Pre-Ingest: sample Digital Object Type MIME identifier Digital Object scholarly paper N/A DOI Constituent Datastream 1 metadata record application/xml PMID Constituent Datastream 2 fulltext file application/pdf –
9
Ingest: representing Digital Objects using MPEG-21 DID
Ingest process creates a Package per Digital Object The Package is an XML document compliant with the MPEG-21 Digital Item Declaration Language ~ DIDL document The DIDL document is the OAIS Archival Information Package in aDORe A new DIDL document is created when a new version of a previously ingested Digital Object is ingested The DIDL document typically contains: By-Value: metadata (Digital Object & Ingest-related) By-Reference: other constituent datastreams of the Digital Object
10
MPEG-21 DID - 1. Data Model abstract definitions + W3C XML Schema
DID entities + DIDL XML representation a container didl:Container an item didl:Item a component didl:Component a resource didl:Resource a descriptor didl:Descriptor … remarks we defined a DIDL profile for LANL repository we define a profile ‘per collection’ all profiles are fully DIDL compliant
11
MPEG-21 DID - Data Model + XML representation
12
MPEG-21 DID - Descriptors
secondary information pertaining to entities MPEG-21 defined uses identification information – MPEG-21 Part 3 : DII rights information – MPEG-21 Part 5 : REL / Part 4 : IPMP processing information – MPEG-21 Part 10 : DIP community/application specific uses cf. use of Descriptors in LANL profile
13
MPEG-21 DID - Descriptors - identifiers
<didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dii:Identifier xmlns:dii="urn:mpeg:mpeg21:2002:01-DII-NS"> urn:isbn: </dii:Identifier> </didl:Statement> </didl:Descriptor> … </didl:Item> MPEG-21 dii:Identifier
14
MPEG-21 DID - Descriptors - rights
<didl:Item> … <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <r:license xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-NS"> <!-- optionally, specific rights can be added here.--> <r:otherInfo> <dc:rights xmlns:dc=" Copyright2003; American Physical Society</dc:rights> </r:otherInfo> </r:license> </didl:Statement> </didl:Descriptor> </didl:Item> MPEG-21 r:license
15
MPEG-21 DID - Descriptors - processing information
<didl:Component> … <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:ObjectType xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:ObjectType> </didl:Statement> </didl:Descriptor> </didl:Component> MPEG-21 dip:ObjectType Content <didl:Item> … <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:Argument xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:Argument> </didl:Statement> </didl:Descriptor> <didl:Resource> function PlayTrack() { } </didl:Resource> </didl:Item> MPEG-21 dip:Argument Processing Item
16
profiling MPEG-21 DIDL for the aDORe architecture
question: how to map datastreams of compound objects to the DID data model: local choices how to use Descriptors to meet the design goals of the repository and its associated applications: core aDORe characteristics how to convey a variety of non-core secondary information: local choices
17
Construction of DIDL documents in aDORe
Each Digital Object is mapped to a top-level DIDL Item element. Constituent datastreams are provided in child elements of this top-level Item. An identifier of the Digital Object is expected at this level. A constituent datastream of a Digital Object is provided in a Component/Resource construct. If identifier => Component/Resource construct is embedded in a sub-Item of the top-level Item, If no identifier => Component/Resource construct is child of top-level Item Top-level Item is embedded in a Container element (transformations of DIDL documents) The top-level Item and its parent Container element are then embedded in the DIDL root element => DIDL XML document == OAIS AIP that represents the Digital Object.
18
Pre-Ingest: sample Digital Object
Type MIME identifier Digital Object scholarly paper N/A DOI Constituent Datastream 1 metadata record application/xml PMID Constituent Datastream 2 fulltext file application/pdf –
19
Ingest: representing Digital Objects using MPEG-21 DID
Package <Container> Digital Object
20
Ingest: Identification (core)
2 Ingest: Identification (core) Package Identifiers @DIDid <Container> Content Identifiers MPEG-21 DII
21
Ingest: DIDL Creation Dates (core)
2 Ingest: DIDL Creation Dates (core) <Container> @DIDcreated T15:42:16Z
22
Ingest: Formats (core)
2 Ingest: Formats (core) <Container> dc.format - info:lanl-repo/pro/DID dc.format - info:lanl-repo/pro/pub dc.format info:lanl-repo/fmt/1 dc.format info:lanl-repo/fmt/456
23
‘Formats’ as placeholder for dynamic behaviors
2 ‘Formats’ as placeholder for dynamic behaviors stored DID disseminated DID … <didl:Descriptor> <didl:Statement> <dc:format> info:lanl-repo/fmt/1 </dc:format> </didl:Statement> </didl:Descriptor> <didl:Item> … <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:ObjectType xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:ObjectType> </didl:Statement> </didl:Descriptor> </didl:Item> Content Item Profile/ BehaviorRegistry MPEG-21 dip:ObjectType … <didl:Item> <didl:Descriptor> <didl:Statement mimeType="text/xml; charset=UTF-8"> <dip:Argument xmlns:dip="urn:mpeg:mpeg21:2002:01-DIP-NS"> urn:foobar:Argument</dip:Argument> </didl:Statement> </didl:Descriptor> <didl:Resource> function PlayTrack() { } </didl:Resource> </didl:Item> Processing Item MPEG-21 dip:Argument dynamic insertion of behaviors
24
Ingest: Digests (core)
2 Ingest: Digests (core) <Container> W3C XML Signature W3C XML Signature W3C XML Signature
25
Ingest: Bitstream Creation Dates (local)
2 Ingest: Bitstream Creation Dates (local) <Container> dc.created T12:05:33Z dc.created T14:22:54Z
26
Ingest: Collection Membership (local)
2 Ingest: Collection Membership (local) <Container> dcterms.isPartOf info:sid/library.lanl.gov:Elsevier
27
Ingest: Rights Information (local)
2 Ingest: Rights Information (local) <Container> dc.rights - textual statement
28
Ingest: Storing DIDL documents in XMLtapes & ARC files
2 Ingest: Storing DIDL documents in XMLtapes & ARC files File-based storage approach combines: XMLtapes: Valid XML file that concatenates multiple DIDL documents (all metadata & identifiers) Internet Archive ARC files: File that concatenates multiple bitstreams Connection XMLtapes & ARC files: Pointers from DIDL documents into ARC files
29
XMLTape: sequential storage of DIDs
2 XMLTape: sequential storage of DIDs XMLTape XMLTape: XML wrapper for batch of DIDs index based on byte offset and byte count in XML file DID content: inline XML (typcially including descriptive metadata) secondary information pointers to bitstreams in ARC files DID DID-identifier datestamp of creation DID-identifier datestamp of creation DID-identifier datestamp of creation …
30
ARC files: sequential storage of bitstreams
2 ARC files: sequential storage of bitstreams ARC ARC file: Internet Archive file format index (arc identifier) based on byte offset and byte count in ARC file content: bitstreams resource resource resource resource resource resource resource resource resource
31
XMLtapes & ARC files 2 XMLtape ARC resource DID resource
XMLtape Index DID-id 1 (Byte offset 1, Byte Count 1) DID-id 2 (Byte offset 2, Byte Count 2) DID-id 3 (Byte offset 3, Byte Count 3) DID-created 1 DID-created 2 DID-id 8 pointers are OpenURLs resource ARC Index arc id 1 (Byte offset 1, Byte Count 1) arc id 2 (Byte offset 2, Byte Count 2) arc id 3 (Byte offset 3, Byte Count 3) resource resource resource resource resource resource
32
overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C
33
Autonomous OAI-PMH Repositories
3 Autonomous OAI-PMH Repositories techReport OAI-PMH identifier = @DIDid OAI-PMH datestamp = @DIDcreated OAI-PMH response = DIDs techReport baseURL(1) LANL A&I A&I baseURL(2) A&I publisher OAI-PMH sets collection = dcterms.isPartOf profile ~ Digital Format Identifier= dc.format FTXT FTXT baseURL(3) publisher Expose Ingest XMLtapes (or other)
34
overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C
35
Repository Index: Registry of Autonomous OAI-PMH repositories
4 techReport STEP 2: ListRecords (OAI-PMH) List of DIDs baseURL(1) A&I Repository Index baseURL(1) baseURL(2) baseURL(3) baseURL(2) STEP 1: ListIdentifiers (OAI-PMH) baseURL(1) Repo Index baseURL(index) Expose
36
overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C
37
5 Identifier Locator: Locating DIDL documents, Digital Objects, constituent datastreams techReport DID-id identifier locator identifier datestamp repository DID-id 1 baseURL(1) & DID-id 1 Content-id 1 baseURL(2) & DID-id x Content-id 2 baseURL(x) & DID-id y baseURL(9) & DID-id p Content-id monitors Content-id A&I baseURL(2) DID-id or content-id baseURL & DID-id Repo Index Identifier Locator baseURL(index) Expose
38
Identifier Locator 5 Identifier Repository Location baseURL protocol
Repository Id extension (XML ID) info:lanl-repo/i/UUID1 baseURL1 OAI-PMH info:lanl-repo/opac/LANLb UUID2 info:lanl-repo/tr/LA-9870 UUID3
39
overview of the aDORe architecture
1 2 3 4 5 6 7 TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM OAI PMH publisher DID A&I APPLICATION publisher OpenURL OpenURL Registry of trans- formations Profile/ BehaviorRegistry FTXT publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH Identifier Resolver OAI PMH CNRI handle, JAVA, C
40
OAI-PMH Federator: Retrieve (batches of) OAIS DIPs
6 OAI-PMH Federator: Retrieve (batches of) OAIS DIPs techReport OAI-PMH Federator set = baseURL(1) set = baseURL(2) set = baseURL(3) OAI-PMH Key = Package Identifier OAI-PMH DID Profile/ BehaviorRegistry DID with PI A&I DID, METS, SCORM, … MPEG-21 DIP Engine Registry of trans- formations FTXT OAI-PMH sets baseURL Collection Format Expose OAIS Package level access ~ DIDL documents & transforms
41
DIM Inserter: dynamic insertion of behaviors
42
OpenURL Resolver: Retrieve OAIS Result Sets
7 OpenURL Resolver: Retrieve OAIS Result Sets OpenURL Requester … ServiceType Referent OpenURL techReport OpenURL Key = Content Identifier Key = Package Identifier OAI-PMH Profile/ BehaviorRegistry DID with PI A&I transformed content MPEG-21 DIP Engine Registry of trans- formations FTXT Expose OAIS Result Set level access: Digital Object, contained datastreams & services
43
OpenURL-based disseminations
7 OpenURL-based disseminations disseminate DIDs, contained datastreams and transforms thereof & rfr_id=info:sid/library.lanl.gov & url_ver=Z & rft_id=info:lanl-repo/biosis/PREV & svc_id=info:lanl-repo/svc/tomods.marc
44
OAI-PMH Federator & OpenURL Resolver
aDORe front-end Interface standard identifier OAIS Access Type # items in response OAI-PMH Federator Package Identifier OAIS DIP 1 or more OpenURL Resolver NISO Content Identifier, Package Identifier (with XML ID fragment) Result Set 1
45
aDORe architecture : papers
Using MPEG-21 DIDL to Represent Complex Digital Objects in LANL Using MPEG-21 DIP and NISO OpenURL for the Dynamic Dissemination of Complex Digital Objects in LANL The multi-faceted use of the OAI-PMH in the LANL Repository aDORe: a modular, standards-based Digital Object Repository arXiv:cs.DL/
46
aDORe architecture : conclusions
aDORe & scale: Modular nature, storage of DIDL documents in Autonomous OAI-PMH Repositories, storage of bitstreams in ARC files Dynamic binding of behaviors Create new DIDL document in case of updates First large-scale use of MPEG-21 technologies aDORe & standards: Use off-the-shelf software Migration to other implementations without major disruptiuons When new generation standards emerge, probably/hopefully migration tools will be available aDORe & protocols: Distributed implementation (cf. Federation of Institutional Repositories) Novel use of OpenURL: contextual capabilities, generic DL front-end
47
aDORe architecture : conclusions
In production since 08/2004 Currently 30,000,000 DIDL documents Various downstream applications harvesting from aDORe (search engines, de-duplication component) New version ~ Summer 2005
48
Dynamic de-duplication of bibliographic information
49
LANL De-duplication Problem
LANL Research Library locally hosts a large data collection A&I databases: ISI Citation Databases, Inspec, BIOSIS, Engineering Index, … Full-text collections: Elsevier, Wiley, APS, IOP, … Duplicates in LANL data collection: amongst bibliographic records between bibliographic records and citations amongst citations De-duplication need: join records from several databases that describe the same work find works that cite a given work
50
Bibliographic Items Citation Items Biosis 13,947,365 - Inspec 7,510,299 Engineering Index 5,241,479 ISI Science 25,453,618 414,983,407 ISI Arts & Humanities 3,012,800 20,856,114 ISI Social Sciences 3,738,926 53,915,890 Total 58,904,487 489,755,411 Annual Growth ~ 2,500,000 ~ 26,000,000
51
DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NY ACAD SCI DAVIS BJ ANN NEY YORK ACAD SC 1964 ___ DAVIS BJ ANN NY ACAD SCI CLARK BJ ANN N Y ACAD SCI DALLNER BJ ANN NY ACAD SCI DAVIES BJ ANNALS NY ACAD SCI
52
Current LANL De-duplication Approach
Strategy: Batch processing Bibliographic key matching Complex heuristics Issues: Extensive processing time Scalability problem in light of growing data collection Revision of heuristics requires reprocessing of collection Explore alternative: On-the-fly de-duplication De-duplication approach that is appropriate for citation matching Flexibility regarding revision of matching approach
53
Netrics Software Netrics in the literature:
C. Lee Giles Steve Lawrence Kurt D. Bollacker CiteSeer: an automatic citation indexing system. International Conference on Digital Libraries. Proceedings of the third ACM conference on Digital libraries Pittsburgh, Pennsylvania. Pages: 89 – 98. DOI / C. Lee Giles Steve Lawrence Kurt D. Bollacker Autonomous Citation Matching. International Conference on Autonomous Agents. Proceedings of the third annual conference on Autonomous Agents, Seattle, Washington. Pages: 392 – 393. DOI / Peter N. Yianilos Data structures and algorithms for nearest neighbor search in general metric spaces. Symposium on Discrete Algorithms. Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, Austin, Texas. Pages: 311 – 321. Various papers at
54
Netrics Software Netrics properties:
Forgiving with respect to errors in dataset Forgiving with respect to errors in query Compares strings like humans do Response can be optimized for specific datasets: machine-learning module Performance scales well with growing dataset RAM-based index
55
De-duplication component: database setup
bibliographic aulast auinit – stitle – year – volume – issue – spage – epage || identifiers of bibliographic records with given key indexed DAVIS BJ - ANN NY ACAD SCI – 121 – A2 – || info:lanl-repo/biosis/PREV citation aulast auinit – stitle – year – volume – spage || identifiers of bibliographic records in which citation is found DAVIS BJ - ANN NY ACAD SCI – || info:lanl-repo/isi/A #10 ; info:lanl-repo/isi/A #3
56
De-duplication component: database setup
IN OUT bib key list (matching bib key, bib id) cit key bib id bibliographic citation IN OUT bib key list (citing bib id) cit key
57
Query: OTT HR – PHYS REV LETT – 1983 – 50 - 1595
Response: keys likelihood client application decides on cut-off point
58
De-duplication component: populating the database
ISI 1 Netrics Harvester OAI-PMH Federator bibliographic OAI-PMH MPEG-21 DID XML documents ISI 2 OAI-PMH MPEG-21 DID XML documents citation BIOSIS Expose
59
Repository crawling
61
Repository crawling XHTML Nutch Search Crawler seed list: identifiers
bibliographic Nutch Search XHTML biblio info Crawler citation 1 citation 2 ISI 2 OpenURL DIP Engine ISI 1 seed list: identifiers of bib records citation 3 seed list: (Open)URLs pointing at bib records in LANL repository bib 13 bib 26 Repository crawling
62
1 5 2 3 PageRank XHTML XHTML XHTML XHTML biblio info citation 4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.