Cluj Napoca, 28 August IEEE International Conference on Intelligent Computer Communication and Processing Digital Libraries Workshop Towards a GRID-Based Digital Library Management System. Gheorghe Sebestyén-Pál 1, Doina Banciu 2, Tünde Bálint 1, Bogdan Moscaiuc 1, and Ágnes Sebestyén-Pál 1 1- Technical University of Cluj-Napoca 2 - ICI Bucharest
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Content Classical vs. Digital Libraries Recent research on Digital Libraries (DL) Main issues and requirements for DLs An ontology-based DL model Grid-enabled DL Implementation considerations of a pilot DL Experiments Conclusions
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Classical vs. Digital Libraries Classical library a repository of knowledge organized mainly on paper Digital library Not only a digitized version of a classical library A new set of functionalities and services are added (e.g. access control, resources management and allocation, complex search and processing services, etc.) A data exchange and cooperation environment DLs are becoming digital content management systems Incorporates a wide variety of formats and data types ( text, audio, video, multi-document complex digital objects) Uses a variety of communication and data-exchange protocols and standards
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS IT and Communication technologies involved in the implementation of digital libraries
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Goals for modern DLs DELOS project’s vision – “to enable any person to access all human knowledge anytime and anywhere, in a friendly, multi-modal, efficient, and effective way, by overcoming barriers of distance, language, and culture and by using multiple Internet-connected devices” DL - a knowledge repository and an information exchange infrastructure that allows: data generation, processing and seamless access to relevant information, regardless of the geographic distribution of hardware resources, databases or persons.
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Research in digital libraries Delos Network of Excellence – Goals: to define and implement digital libraries on new computing and communication technologies Achievements: definition of functional and architectural requirements for DL implementation BRICKS project Goals: to design a user and service-oriented space to share knowledge and resources in a multi-cultural heritage. Achievements: Definition of a digital library architecture for a very broad and heterogeneous user community; automatic indexing and annotation functionalities OpenDlib project Goal: development of a software toolkit for dedicated DLs generation Achievements: tools for content harvesting form existing resources Fedora, DSpace – open source software for DLs Lucene – open source Search engines
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Research in digital libraries (cont.) Diligent project (part of EGEE project) Goal: the use of GRID infrastructure for DL implementation Achievements: a new vision about the DL concept: DL = a dynamic digital content repository and management system dedicated for a purpose (e.g. a project, an art collection, an academic course) Definition of generic DL services mapped on GRID services DLs dedicated for different domains – with powerful processing capabilities SINRED project – National Excellency project Goal: development of a national framework for DLs specialized on technical sciences and research Achievements: evaluation of requirements, evaluation of existing software, infrastructure development, DL model definition, implementation of a pilot DL SIPADOC project – National research program Goal: reevaluation of the national patrimony through DLs Achievements: evaluation of digitizing tools
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Key issues in DL implementation Architectural issues: distributed nature of storage, processing and access resources Scalability, flexibility, interoperability Functional requirements: Core functions: storage, indexing and annotation, data-search, content retrieval, users management Content organization should reflect semantic connections Processing facilities Data processing services – specialized for different fields Pattern search and recognition QoS issues Restricted time to obtain relevant information Reasonable time for complex data processing User and access control management Virtual organizations Role-based access
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS DL = Essence & Metadata Management Text Audio Video Text Digital content generation and harvesting Management of essence Automatic feature (metadata) extraction Metadata Management Cataloging, indexing, annotation Access and visualization Cataloging information system
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS An ontology-based Digital Library approach Ontology: concepts and relations together with a reasoning engine Ontology for technical and scientific domains Main concepts: Digital objects: association of content, metadata and procedures Examples: articles, technical reports, prospects, PhD Thesis, patents Digital collections Set of digital objects structured for a given goal/purpose of based on a given criterion Examples: articles of an author, documents of a domain Events Conferences, workshops, seminars Processes Projects Courses Virtual organizations Roles users
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Grid-enabled digital library services Why DLs on GRID infrastructure? Huge volume of documents/digital objects Concurrent access and multiple search engines (see Google) Multimedia streaming Automatic indexing and annotation Complex processing requires prohibitive time User management through virtual organizations Job distribution facilities offered by GRID
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS DL functions mapped on GRID services Computing, storage and communication resources Digital Library GRID Services Collections management Catalog and metadata management Digital objects management Users’ management Data visualization Virtual organizations management Resource management Task distribution Processing Data distribution and replication Data processing
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Experiments Two approaches: DL implementation on Alchemi GRID (Microsoft) Job distribution at thread level Explicit GRID programming Experiments with multimedia streaming (multimedia content distribution) DL implementation on Condor GRID (Open source) Job distribution at task level Job and data distribution is transparent to the DL application ( distribution is made through separate scripts) Experiments with “key-word search” in the whole DL content The execution time decreased with the number of executor computers For more than 5 executors the scheduling and communication time is comparable with the execution time
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS A pilot implementation of a Digital library framework developed with GRID support Goal: implementation of a digital content storage and retrieval system dedicated for educational and scientific activities (courses, projects, etc.) Main requirements: A DL adaptable for a given purpose/goal Access controlled and restricted with virtual organizations Ontology-based approach (concepts, relations, semantic search) Advanced search procedures GRID-enabled full-text search services – for better reaction time Access through Internet browsers The result: A distributed digital library application, which allows: Management of digital objects (upload, storage, indexing, metadata creation Management of collections Management of users and virtual organizations
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Pilot DL details: ( Management of digital objects Digital Documents’ upload, Annotation, metadata generation according with Dublin Core Distributed Storage of data Management of collections Define a new collection Attach new documents to an existing collection Associate access rights to a collection Management of users and virtual organizations Define new users and new virtual organizations Define roles Associate roles to users and collections
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Snapshots of the DL application’s interface
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Snapshots of the DL application’s interface
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Search techniques in DLs through key-word or index search: Database techniques through semantic Information Retrieval: Semantic graph with documents and concepts through non-semantic Information Retrieval: Naive Bayes Algorithm Probabilistic approach Based on probabilistic similarity between documents Topic-Based Vector Space Model Algorithm
Debrecen, 3-5 September 2008, DAPSYS’08 7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS Conclusions DLs are complex content management systems that extend the functionalities of classical libraries: Semantic organization of a wide variety of information formats Multiple search and data retrieval techniques (including full-text and semantic search): Key-word full-text search Semantic search Statistical and probabilistic retrieval and classification Access control to distributed and remote data DLs are Data exchange and cooperation environments Useful for remote and cooperative work DLs must include powerful search and data retrieval engines GRID infrastructures may be a feasible support in the implementation of DLs For more efficient parallel search, classification or automatic annotation
Cluj Napoca, 28 August IEEE International Conference on Intelligent Computer Communication and Processing Digital Libraries Workshop Thank you for your attention Questions ?