FEDORA Rathachai Chawuthai Information Management CSIM / AIT Repository Issued document 1.0
Overview Data Model Services Architecture Fedora in use Challenges 2
3
F lexible E xtensible D igital O bject R epository A rchitecture A system that serve digital content repository for a wide variety of users – E.g. institutional repository, digital archive, content management system, scholarly publishing enterprises, and digital library. Sponsor by Fedora Common – A non-profit organization providing free product fedora-commons.org, wikipedia.org 4
Fedora has a core component that response to enable durable storage and access to the digital content Fedora is able to be either a stand-alone server or a component of other system To complete repository solution, 3 rd parties are needed to enhance features – E.g. authoring, search engine, workflow management, security component fedora-commons.org 5
Fedora’s Digital Object Model – The model can apply to many domains and digital object types. Distributed Repositories – It can integrated access to digital resource from other repositories. – It provides interface accessed by multiple repositories. Preservation & Archiving – Represent preservation object by XML format – Offer content versioning – Can define object to object relationships – Log event history for every change of the object Web Service – Provide SOAP and REST access Easy to integrate with other applications and systems – It is able to function as a generic repository that other applications or systems can apply it fedora-commons.org 6
Store all types of content and its metadata Digital content of any type can be managed and maintained Metadata about content in any format can be managed and maintained Scale to millions of objects Access data via Web APIs (REST/SOAP) Provide RDF search (SPARQL) Rebuilder Utility (for disaster recovery and data migration) The entire repository can be rebuilt from the digital object and content files. Content Model Architecture (define "types" of objects by their content) Many storage options (database and file systems) JMS messaging (your apps can "listen" to repository events) Web-based Administrator GUI (low-level object editing) OAI-PMH Provider Service GSearch (fulltext) Search Service Multiple, customer driven front-ends. fedora-commons.org 7
Fedora – A flexible system that is based on fundamental of SOA. It is capable to preserve and access any types of digital contents. However, it does not provide rich user interface. DSpace – A complex system that complete all function of repository focusing on user experience mainly. However, it is not a flexible system. EPrints – A system that is primarily used for scientific publication (digital object are not modified very often). Commonly, it provides features as a document management system. wikipedia.org Archival Repository Systems 8
User Interface – Provide complete user interfaces supporting all types of end-users System Security – Provide authentication and authorization in order to access information and functions RDF searchable – Provide service for semantic search in form of RDF language Customizable metadata formats – Allow users to work with many types of metadata format Flexibility of the system – Has flexibility of architecture that allow to build many alternative behaviors SOA principle – Provide accessibilities of services following principles of Service Oriented Architecture Process based approach – Support document management process Support Preservation Strategies – Support preservation activities of digital information based on OAIS reference model Criteria 9
User Interface System Security RDF searchable Customizable metadata formats Flexibility of the system SOA principle Process based approach Support Preservation Strategies F edora D Space E Prints D D E E F F F F F F D D D D F F F F F F D D D D Michal,2010 E E E E Result F F D D E E E E F F E E 10
11
F edora O bject XML FOXML is a metadata that is stored in an object as a content. FOXML is required by Fedora Repository 12
Digital Object Identifier: A unique, persistent identifier for the digital object. System Properties: A set of system-defined descriptive properties that is necessary to manage and track the object in the repository. – Object Properties describe the object’s type, its state, the content model to which it subscribes, the created and last modified dates of the object, and its label. Datastream(s): The element in a Fedora digital object that represents a content item. fedora-commons.org 13
The content of System Properties is generated by system For example 14
A datastream is the element of Fedora object A Fedora object can have one/more Datastreams – (It can treat as bitstream) Fedora model supports versioning of Datastreams – Every change of datastreams has running version fedora-commons.org 15
Datastreams can be: – Digital resource that is needed to preserve – Metadata of the digital resource E.g. DublinCore (DC), METS, PREMIS, …., and etc More … – DC is a default Datastream of Fedora system – AUDIT is to record change of digital object. AUDIT content is controlled by system (no human editable) – RELS-EXT is to provide relationship between digital objects fedora-commons.org 16
Easy perspective – Fedora repository as a warehouse – Fedora digital object as a cabinet – Datastreams as a drawer Fedora repository Fedora Object Datastream 17
Digital Object Identifier Keep Persistent ID of an object called PID ID:38493 PID 18
Object Property A set of system-defined descriptive properties that is necessary to manage and track the object in the repository. It is controlled by system Prop PID Example 19
Datastream DlubinCore (DC) A reserved datastream. (Key object metadata) DC contents is basically generate to fedora objects. However, administrator is able to add it by insert DC information in the stream DC PID Prop Example 1 demo:555 Example 20
Datastream AUDIT A reserved datastream. (Key object metadata) To record a change of digital object. The record is controlled by system AUDIT PID Prop audit:record Audit:process DC addDatastream HTML fedoraAdmin T16:47:12.679Z Example 21
Datastream Relationship-External RELS-EXT A reserved datastream. (Key object metadata) To defined relationship with other digital objects in RDF syntax RELS-EXT PID Prop DC <foxml:datastreamVersion ID="RELS-EXT1.0" LABEL="RDF Statements about this object" CREATED=" T05:09:44.406Z" MIMETYPE="application/rdf+xml" > < rdf:RDF xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:rdf=" Example AUDIT 22
Datastream Content … e.g. IMG User can add content that need to preserve. In this case, user need to preserve image file, so he can upload image file to system and name the datastream himself (e.g. IMG). IMG PID Prop DC AUDIT RELS-EXT 23
Datastream Content e.g. Metadata “PREMIS” It can use datastream to keep external metadata schema e.g. PREMIS. User has to define a PREMIS content and add it as the datastream, then name the datastream e.g. “PREMIS” PID Prop DC Example AUDIT RELS-EXT PREMIS IMG pms:object pms:event pms:agent pms:rights 24
User can change datastream directly in the digital object. After changing, The system run a new version of the content. Thus, user is able to access the previous version of content. PID Prop PREMIS AUDIT RELS-EXT DC IMG DC V. 1 V. 2 25
Datastream Identifier: an identifier for the Datastream that is unique within the digital object (but not necessarily globally unique) State: the Datastream state of Active, Inactive, or Deleted Created Date: the date/time that the Datastream was created (assigned by the repository service) Modified Date: the date/time that the Datastream was modified (assigned by the repository service) Versionable: an indicator (true/false) as to whether the repository service should version the Datastream. By default the repository versions all Datastreams. Label: a descriptive label for the Datastream MIME Type: the MIME type of the Datastream (required) fedora-commons.org 26
Format Identifier: an optional format identifier for the Datastream. Examples of emerging schemes are PRONOM and the Global Digital Format Registry (GDRF). Alternate Identifiers: one or more alternate identifiers for the Datastream. Such identifiers could be local identifiers or global identifiers such as Handles or DOI. Checksum: an integrity stamp for the Datastream which can be calculate using one of many standard algorithms (MD5, SHA-1, etc.) Bytestream Content: the "stuff" of the Datastream is about (such as a document, digital image, video, metadata record) fedora-commons.org 27
Control Group: pertaining the the bytestream content, a new Datastream can be defined as one of four types, or control groups, as follows: – Internal XML Metadata – To store XML content – Managed Content – To store content that need to preserve; such as, image file, video, pdf, and etc that upload to the system. – Redirect Referenced Content - To store URL of digital object from external repository – External Referenced Content – Same purpose as Redirect, but when user access the content, user will see url of content in the same domain name of the repository fedora-commons.org 28
Example List of datastreams Fedora object information Be able to add a new datastream Can export Display digital object information 29
Example In case of XML, user can input XML content here. XML Datastream group XML. It might be ontology, RDF, metadata, and etc. 30
Example URL of image in repository M M Set MIME Type Datastream group Managed Content, a content that user need to preserve 31
Example Reference URL E E Set MIME Type Reference digital object from external source. The digital content will not be stored in the repository itself. URL to access 32
Example View XML Object View digital object in XML format 33
Example Can export FOXML 1.1 (the most current FOXML format) FOXML 1.0 (the FOXML format used with pre-3.0 Fedora repositories) METS 1.1 (the most current Fedora extension of METS) METS 1.0 (the METS format used with pre-3.0 Fedora repositories) ATOM (the Fedora extension of Atom) ATOM ZIP (an ATOM based format which packages all datastreams along with the object XML in a ZIP file) Export digital object to another format 34
<foxml:digitalObject VERSION="1.1" PID="changeme:2" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:xsi=" xsi:schemaLocation="info:fedora/fedora-system:def/foxml# modifyObject fedoraAdmin T15:15:22.428Z modifyObject fedoraAdmin T15:16:37.065Z changeme:2 Information of object identifier PID … … 35
<foxml:digitalObject VERSION="1.1" PID="changeme:2" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:xsi=" xsi:schemaLocation="info:fedora/fedora-system:def/foxml# modifyObject fedoraAdmin T15:15:22.428Z modifyObject fedoraAdmin T15:16:37.065Z changeme:2 System Properties of digital object generated from system … … 36
<foxml:digitalObject VERSION="1.1" PID="changeme:2" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:xsi=" xsi:schemaLocation="info:fedora/fedora-system:def/foxml# modifyObject fedoraAdmin T15:15:22.428Z modifyObject fedoraAdmin T15:16:37.065Z changeme:2 Datastream “AUDIT” An audit record Date Audit ID Action … … 37
<foxml:digitalObject VERSION="1.1" PID="changeme:2" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:xsi=" xsi:schemaLocation="info:fedora/fedora-system:def/foxml# modifyObject fedoraAdmin T15:15:22.428Z modifyObject fedoraAdmin T15:16:37.065Z changeme:2 Datastream “DC” DC content … … 38
<foxml:digitalObject VERSION="1.1" PID="changeme:2" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:xsi=" xsi:schemaLocation="info:fedora/fedora-system:def/foxml# modifyObject fedoraAdmin T15:15:22.428Z modifyObject fedoraAdmin T15:16:37.065Z changeme:2 Datastream of preservation content Name + Version number Managed Content (to store in repository) Version Record URL to resource file … … 39
To record relationship of a Fedora digital object can be related to other Fedora objects. Relationship types: – isPartOf, hasPart – isMemberOf, hasMember Relationship ontology – relsext-ontology.rdfs relsext-ontology.rdfs fedora-commons.org 40
Example fedora-commons.org 41
Benefit of relationship – Support OAI harvesting and user search/browse – Be able to define bibliographic relationship among objects – Define semantic relationship among resources – Link resources together based on contextual information fedora-commons.org 42
Pattern Example fedora-commons.org 43
Between object and object use datastream RELS-EXT 44
Between datastreams inside the same Fedora digital object use datastream RELS-INT 45
46
Open Source - The Fedora repository system is open source software. Flexible Digital Object Model - The Fedora digital object model provides the flexibility to create kinds of objects including documents, images, electronic books, multi-media learning objects, datasets, metadata, and etc. Content Versioning - Any modifications made to a Datastream through the Fedora management interface (API-M) will automatically result in the creation of a new version of that Datastream or Disseminator XML Ingest and Export - Digital objects can be submitted to a Fedora repository as XML-encoded files : FOXML XML Storage - By default, Fedora digital objects are stored in a Fedora repository as XML-encoded files Object-to-Object Relationships - Fedora provides the ability to assert object-to-object relationships. Access Control and Authentication - Includes a simple form of access control to provide access restrictions based on IP address/range. fedora-commons.org 47
Simple Search - Fedora automatically creates simple indexes of the repository. Search engine also use information from DC and relationship. RDF-based Resource Index - Includes more information about objects plus object-to-object relationships OAI Metadata Harvesting Provider - The OAI Protocol for Metadata Harvesting is a standard for sharing metadata across repositories Migration Utility - A new migration utility is provided to perform mass export and mass ingest of objects Batch Utility - Fedora Administrator client that enables the mass creation and modification of Fedora digital objects Reporting Utility - A reporting utility is provided providing different management views of the contents of the Fedora repository web services from wiki.duraspace.org 48
Internet 49
Management API (API-M) – SOAP-enabled web service defines an administration interface managing repository. There are necessary functions for administrator to create and maintain digital objects and their components. Access API (API-A) – SOAP-enabled web service defines an interface for accessing digital objects stored in the repository. Resource Index Search API –RDF based index search provides searching of the new Resource Index to each digital objects following: – object properties – object-to-object relationships – metadata about datastreams and disseminations – default Dublin Core record API web services from wiki.duraspace.org 50
Access-Lite API (API-A-Lite) –REST-based web service that can be invoked access function to digital object. Management-Lite API (API-M-Lite) – Future REST-based web service that response for management functions Search API (part of API-A-Lite) – REST-based web service that include search operations. API-LITE a light-weight version of the Fedora Access Service web services from wiki.duraspace.org 51
Datastream Management – addDatastream – compareDatastreamChecksum – getDatastream – getDatastreamHistory – getDatastreams – modifyDatastreamByReference – modifyDatastreamByValue – setDatastreamState – setDatastreamVersionable – purgeDatastream Relationship Management – addRelationship – getRelationships – purgeRelationship Object Management – modifyObject – purgeObject – export – getNextPID – getObjectXML – ingest – validate web services from wiki.duraspace.org 52
Repository Access – describeRepository – Object Access – findObjects – resumeFindObjects – getObjectHistory – getObjectProfile – Datastream Access – getDatastreamDissemination – listDatastreams – Dissemination Access – getDissemination – listMethods web services from wiki.duraspace.org 53
To get history of the specific object web services from wiki.duraspace.org 54
Semantically search provided by web service. Allow to query by SPARQL Response by RDF support formats such as, N-Triples, RDF/XML, Turtle, and etc. For example select $object $modified from where $object and $object $modified info:fedora/demo:1, T19:39:28.859Z info:fedora/demo:12, T19:39:17.843Z info:fedora/demo:19, T19:39:20.375Z info:fedora/demo:22, T19:39:20.671Z web services from wiki.duraspace.org 55
56
fedora-commons.org 57
Repository Service: the core service that enables functions manipulating digital objects; such as, creation, management, storage, access, and reuse. OAI Provider Service: a service that harvests metadata from other repositories and provide metadata to them. Directory Ingest Service: a service that uses to ingest a digital object and store it. Search Service: a service for search that can be enhanced by adding other search engines. fedora-commons.org 58
Object Reuse and Exchange (ORE) Access Point: provide cross repositories service. Workflow and Orchestration: (Future plan) Preservation Integrity Service: (Future plan) Preservation Monitoring and Alerting Service: (Future plan) fedora-commons.org 59
fedora-commons.org 60
The service framework applies concept of OAIS It provided interfaces to access the core repository services via web services – API-M (Management) – API-A (Access) – Basic Search – RDF Search Fedora Repository server is running on Tomcat Store data in database – Object (XML) and byte streams to collect preservation data and metadata – SQL registry + metadata – RDF-Based index fedora-commons.org 61
62 A case study of islandora
Institution – University of Prince Edward Island's Robertson Library Description – Islandora is an open source project underway at the Robertson Library at the University of Prince Edward Island. Islandora combines the Drupal and Fedora software applications to create a robust digital asset management system that can be used for any requirement where collaboration and digital data stewardship, for the short and long term, are critical. Tools – Fedora Repository + GSearch, Drupal, and Solr Link – Overview example from fedora-commons.org 63
Provide administration panel – View, ingest, and purge any digital objects – Help user to understand the relationship between digital objects Provide lightning-fast search of the Fedora database – Including full-text search – Integrate with Solr to be better performance of searching Support many formats of metadata – Allow user to define metadata model for each digital object category Support many types of digital object – Support collection of both bond-digital and digitized materials Features 64 islandora from wiki.duraspace.org
System Architecture 65 Fedora Core Service Fedora Core Service Database SOLR GSearch Generic Search GSearch Generic Search Drupal Servlet Filter Web Server Database Server Search Server Service Provider Database Database Server Web Server Service Consumer Drupal islandora module
66 islandora.ca
Drupal Servlet Filter – The Drupal Servlet Filter allows the Fedora Repository to use Drupal’s database for authentication, including integration with Drupal user roles. The islandora Module – The Islandora module is a Drupal module written to allow the Drupal content management system to act as a front end to a Fedora Digital Repository. The module enables viewing and management of Fedora objects. This includes insert, update, and delete datastream, and also browse and search. Enabling Indexing/Searching with SOLR – Islandora utilizes the Solr open-source search platform to enable flexible and configurable indexing and searching. Solr uses the Lucene Java search library at its core for full-text indexing and search and offers hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling as additional features. GSearch – The Fedora Generic Search Service, or GSearch, is a search service installed with Fedora that allows for automatic updating of the Lucene/Solr index. GSearch relies on JMS to receive messages that are sent when Fedora objects are ingested, modified or purged. This keeps the Lucene index in sync with the Fedora repository. islandora from wiki.duraspace.org 67 Components
68 A collection of digital object that collects many PDF documents.
69 Hierarchy of collection (From administration page)
70 The collections are represent in fedora object hierarchy islandora:demos islandora:pdf_collectionislandora:demo_image_collection isMemberOfCollection islandora:collectionCModel hasModel islandora:top hasModel demo:DualResImageCollection
71 A digital object in a collection. (Display after select a collection) A digital object in a collection. (Display after select a collection)
72 The collections are represent in fedora object hierarchy islandora:demos islandora:pdf_collectionislandora:demo_image_collection isMemberOfCollection islandora:6 islandora:1 isMemberOfCollection islandora:collectionCModel hasModel islandora:top demo:DualResImageCollection
73 Go inside digital object to get standard MetaData
74 1) 2) To get stream
75 Be able to retrieve datastream of the digital object
76 islandora:book_collection isMemberOfCollection Example structure from digitalized object such as Book that each page is scanned by TIFF format islandora:demos islandora:top isMemberOfCollection islandora:book1islandora:book2 islandora:book1-page1islandora:book1-page2 isPartOf
77 It allows user to add digital object under selected collection e.g. islandora:pdf_collection Step: 1) click on “Add” It allows user to add digital object under selected collection e.g. islandora:pdf_collection Step: 1) click on “Add” 2) Select a content model of digital object. The choice leads to difference ingestion UI 2) Select a content model of digital object. The choice leads to difference ingestion UI 3) Click “Next”
78 Put metadata information in application form
79 Scroll down and click “Ingest” to finish ingest process
80 All information (including datastream) of the created digital resource has already ingested to Fedora repository Fedora admin
81 User can browse to page that allow user to manage collection and model by Menu : Administer > Content Management > Islandora Content Modeler User can browse to page that allow user to manage collection and model by Menu : Administer > Content Management > Islandora Content Modeler
82 Collections of digital object that are displayed in the first page of digital repository page Can add more collection from button +
83 All content models that are templates of digital objects Create more Input form to create Content Model
84 Detail of a selected content model
85 Input form elements that are customizable to display on ingestion form A new input element is able to create from the panel after click + to add element
86 User can search and get the search result of digital object
87 After view a digital object. User can click a metadata's’ value to search from that value e.g. dc.subject: History-Conoe Cove (PEI) After view a digital object. User can click a metadata's’ value to search from that value e.g. dc.subject: History-Conoe Cove (PEI)
88 After that user gets related digital object that relevance to dc.subject:History and dc.subject:Conoe Cove (PEI) After that user gets related digital object that relevance to dc.subject:History and dc.subject:Conoe Cove (PEI)
89 More information about installation and user manual guide are available at ISLANDORA/Islandora+Guide
90
DSpace – Founded in MIT – An open source software that provides functions for manage digital resources, and generally support instructional repository. – It supports many kind of digital formats and controls process to manage digital resource. – A DSpace data model supports preservation – It also supports OAI-PMH protocol. dspace.org 91
Fedora-DSpace integration Flexible architecture RDF based semantic search SOA Customizable metadata format Flexible information model Disaster recovery Preservation activities Data migration Versioning Many storage options (database or file system) Web-based application Documentation process control Authentication & authorization Localization Browse and search UI Configurable UI Preservation process management Administration UI 92
May, 2009, Fedora Commons and Dspace Foundation merge organization to synergize strategies and missions. It becomes “DuraSpace” – a non-profit organization provide leadership and innovation in open source of preservation and dissemination of digital library and institutional repository resources. duraspace.org DURASPACE 93
Running DSpace on Fedora – Current status : In progress duraspace.org Objective 94
DURASPACE, a Fedora-DSpace integration, becomes a complete archival repository solution. Users still experiences with the system from rich features that enhance from strong points of DSpace. Furthermore, advantageous back-end features of Fedora are improved in order to serve enhanced DSpace functions. Therefore, DURASPACE is capable to provide: – Rich experiences to user; such as, rich administration functions, rich search functions supporting semantic search – Excellence back-end system that is based on flexible service architecture fully supports preservation activities support versioning of digital object and datastream has flexible information model appending semantic technology to support relationship between digital object To ensure longevity and re-use of digital content Benefits 95
Identify high impact scholarly applications to integrate with other repositories Demonstrate DSpace running on top of Fedora Define common information model – Map the information model of DSpace to Fedora – Define a common information model for intuitional repository purpose Define common standard protocol for repositories Define scenario for integration open repositories Develop shared services/modules to enable exchange of information among repositories Define common storage API Integrate semantic technology Share user interface approaches by Manakin (enhanced DSpace XMLUI) and build lightweight applications on top of repositories Move toward common architecture based on well-defined design pattern architecture Possible ideas 96 duraspace.org
97 DURACLOUD from duraspace.org
A hosted service and opend technology developed by DuraSpace under concept “Store and Do More”. DuraCloud will let libraries and institutes manage own repositories without building own technical infrastructure. It offers storage across commercial and non-commercial provider. Progress : Pilot phase Benefits – “Digital contents are stored in the cloud” – “Backing up, preserving, and updating content yourself can be an uphill battle” – “Maintaining several copies in different locations is a lot safer” – “Let you add image viewing and media streaming services to your site without messing with new servers or software” DURACLOUD from duraspace.org 98
PREMIS – PREservation Metadata: Implementation Strategies – Focus on developing metadata for use in digital preservation – Sponsored by Library of Congress (LOC) – Objectives To store technical information that supports making decision and action in order to do preservation To document actions taken, such as migration. To record the effects of preservation strategies To ensure authenticity of digital resources over the long-term To note information about collection management and rights management PREMIS from LOC.gov 99
Store PREMIS in Fedora Digital Object PREMIS fedora-commons.org 100
Challenge! What does input form should look like? 101
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment storage relationship 102
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN storage relationship 103
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN storage relationship 104
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN storage relationship 105
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN software swName * swVersion Environment swType * hwInformation software 106
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN software swName * swVersion swType * hwInformation i Environment software 107
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN software swName * swVersion swType * hwInformation ibm Environment software dc:ibm dp:ibm dp:ibmserver foaf:ibm 108
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN software swName * swVersion swType * hwInformation ibm Environment software dc:ibm dp:ibm dp:ibmserver foaf:ibm 109
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment ISBN software swName * swVersion swType * hwInformation dp:ibmserver Environment software 110
Could it be? PREMIS Object objectIdentifier * objectIdentifierType * objectIdentifierValue * Environment dp:ibmserver software swName * swVersion swType * hwInformation IBM Server hwInformation ISBN software 111
112
Comparison of digital libraries systems MICHAL KÖKÖRČENÝ, AGÁTA BODNÁROVÁ University of Hradec Králové, Faculty of Informatics and Management FBE3E1F FBE3E1F