The Question of Quality Most of this presentation is based on the work of Marcos Gonçales as cited in the references.

Slides:



Advertisements
Similar presentations
Focus on Your Content, Not on Ingesting Your Content Terry Brady Applications Programmer Analyst Georgetown University Library
Advertisements

Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
Putting it all together A summary and opportunity to explore an existing project critically.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
Information Retrieval in Practice
ISP 433/533 Week 2 IR Models.
Formalizing the Design of Digital Libraries Based on UML Delos NoE, Preservation Cluster: Workshop: Persistency in Digital Libraries 13. February 2006,
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
The Subject Librarian's Role in Building Digital Collections: Where Information Management and Subject Expertise Meet Ruth Vondracek Oregon State University.
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
Overview of Search Engines
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Database Environment 1.  Purpose of three-level database architecture.  Contents of external, conceptual, and internal levels.  Purpose of external/conceptual.
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Dspace 1 Introduction to DSpace Mukesh Pund Scientist NISCAIR, New Delhi.
Metadata Standards and Applications 1. Introduction to Digital Libraries and Metadata.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Depth customization of DSpace: Best practices and techniques of institutional repository at IIT Kanpur, India By S. K. Vijaianand V. D. Shrivastava Gaurav.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
© 2007 by Prentice Hall 1 Introduction to databases.
This chapter is extracted from Sommerville’s slides. Text book chapter
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November
Workshop on Software Product Archiving and Retrieving System Takeo KASUBUCHI Hiroshi IGAKI Hajimu IIDA Ken’ichi MATUMOTO Nara Institute of Science and.
CS370 Spring 2007 CS 370 Database Systems Lecture 1 Overview of Database Systems.
Metadata and Documentation Iain Wallace Performing Arts Data Service.
Introduction to metadata
Tsinghua University Library Yang Zhao & Airong Jiang Tsinghua University Library, Beijing China 4 June, 2004 Electronic Thesis and Dissertation System.
Best Practices for Digital Imaging and Metadata Roy Tennant The Library, University of California, Berkeley
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Multimedia ETD Questions Bill Savage UMI Dissertations Publishing ETD 2002 Provo, Utah Saturday, June 1, 2002.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
1 Chapter 1 Introduction to Databases Transparencies.
Digital Libraries Lillian N. Cassel Spring A digital library An informal definition of a digital library is a managed collection of information,
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
1 Video Message: Welcome ETD 2015: 18 th Int’l Symposium on ETDs New Delhi, India Edward A. Fox Executive Director, Chairman of the Board NDLTD,
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
Digital Stewardship Lee Dotson Digital Initiatives Librarian University of Central Florida John C. Hitt Library Presentation available at
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
Developing a Dark Archive for OJS Journals Yu-Hung Lin, Metadata Librarian for Continuing Resources, Scholarship and Data Rutgers University 1 10/7/2015.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Information Retrieval in Practice
Classifications of Software Requirements
Usage scenarios, User Interface & tools
Developing Information Systems
Introduction to DSpace
Data Management: Documentation & Metadata
Digital Project Lifecycle Curating Across the Curriculum
Metadata for research outputs management
Emulation: Good or Bad? Emulation as a Digital Preservation Strategy – Stewart Granger Reality and Chimeras in the Preservation of Electronic Records –
Developing a Data Model
Chapter 17 - Component-based software engineering
Presentation transcript:

The Question of Quality Most of this presentation is based on the work of Marcos Gonçales as cited in the references

Goals for this class Consider quality in digital libraries –How do we define quality –How do we measure quality –How does quality control impact a user?

Understanding Quality in a DL Quality indicators: proposed descriptions of quantities or observable variables that may be related to quality –“measures” = stronger term. Requires validation –Gonçalves et al provide analysis of quality conditions and recommend specific quantities to be used. Dimensions of quality Proposed indicators Application to DL concerns

Getting the data Where does the data come from? –Logging –Surveys –Focus Groups Know what information is needed, then choose the method most likely to provide the data. –More about the sources of data after we see what we need to know.

What are we looking for? Consider that we are concerned about the quality of the following characteristics of a DL: –Data objects –Metadata –Collection –Catalog –Repository –Services What characteristics do we want each of those to have?

Dimensions of Quality

Digital Object –Accessibility –Pertinence –Preservability –Relevance –Similarity –Significance –Timeliness Metadata Specification –Accuracy –Completeness –Conformance Collection –Completeness Catalog –Completeness –Consistency Repository –Completeness –Consistency Services –Composability –Efficiency –Effectiveness –Extensibility –Reusability –Reliability

What information do we need - related to Digital Objects Accessibility –What collection? –# of structured streams –Rights management metadata –Communities to be served Pertinence –Context –Information content –Information need

Information need - Digital Objects, continued Preservability –Fidelity (lossiness) –Migration cost –Digital object complexity –Stream formats Relevance –Feature frequency –Inverse document frequency –Document size –Document structure –Query size –Collection size

Information need - Digital Objects, continued Similarity –All the same features as in relevance –Also: citation/link patterns Significance –Citation/link patterns Timeliness –Age –Time of latest citation –Collection freshness

Information need - Metadata Specification Accuracy –Accurate attributes –# attributes in the record Completeness –Missing attributes –Schema size Conformance –Conformant attributes –Schema size

Information - Collection and Catalog Completeness of the Collection –Collection size –Size of an “ideal” collection Completeness of the Catalog –# of digital objects with no metadata Item level metadata –Size of the collection Catalog Consistency –# of metadata specifications per digital object

Information about the Repository Completeness –# of collections Consistency –# of collections –Catalog/collection match How well do the catalogs match the collections? Are the catalogs for all the collections at the same level of detail?

Service Information Need Composability (ability to be combined to form new services) –Extensibility –Reusability Efficiency –Response time Effectiveness –Precision/recall (of search) –Classification

Service Information, continued Extensibility –# extended services –# services in the DL –# lines of code per service manager Reusability –# reused services –# services in the DL –# lines of code per service manager Reliability –# service failures –# accesses

Making more concrete Each of the measures listed gives an idea of the information need Exactly what do we measure? How do we combine numbers obtained to get a usable result? Following pages describe specific measures and formulas for combining those.

Digital Object Accessibility Basic requirement –If a user cannot access the DO, there is little point in having it in the DL –Identified measures: Collection, # structured streams, rights management metadata, communities –Say it another way: Is it present in a collection in the repository? Is there a service that can retrieve and display the content? Is the rights management open enough for access by this user?

Digital Object Accessibility - formally Define do x = a specific digital object Accessibility = Acc(do x, ac y ) = –0, if there is no collection C in the DL repository R such that do x  C –Otherwise, acc = (∑ z  struct_streams(do x ) r z (ac y ))/ |struc_streams(do x )| –where r z (ac y )) is a rights management rule defined as 1, if –Z has no access constraints, or –Z has access constraints and ac y  cm z, »Where cm z,  Soc(1) is a community that has the right to access z; and 0, otherwise This does not deal with accessibilty related to accessing the streams

An illustration NDLTD is the Networked Digital Library of Theses and Dissertations –Some institutions requre that all theses and dissertations be stored in this DL –Student chooses how visible to make the document. Parts of the document may be visible while other parts are not The document, or parts of it, may be visible to a restricted community.

Accessiblity case etd x is a specific electronic thesis or dissertation of interest acc(etd x ) is –0 if it is not in the collection –Otherwise (∑ z  struct_streams(etd x ) r z (ac y ))/ |struc_streams(do x )| Where r z (ac y ) = 1 –if etd x is marked “world wide access” or etd x is marked “local institution only” and ac y  C where C is defined as identifiable members of the local institution = 0 otherwise

With the numbers An example from VT For authors name beginning with A (219 entries): –Unrestricted ETDs: 164 –Restricted ETDs: 50 –Mixed ETDs: 5 Percent unrestricted: 0.5, 0.5, 0.167, , 0.6) Overall measure of accessibility outside VT: –(164 * * )/219 –0.76

Solidifying Pertinence How do we measure something like pertinence? Relation between the information content of a digital object and the need of the user Depends on the user’s situation -- background, current context, etc.

Pertinence Inf(do i ) represents the information content of digital object i IN(ac j ) is the Information Need of actor (user) ac j Context (ac j, k) the combined effects of social factors that determine the pertinence of do i to ac j at time k Two communities of actors –Users whose information needs we try to satisfy –External Judges who are responsible for judging the relevance of a document in response to a query. –Non overlapping groups

Pertinence formula Pertinence (do i, ac j, k): Inf(do i ) X IN(ac j ) X Context(ac j, k) defined as –1 if Inf(do i ) is judged by ac j to be informative with regard to IN(ac j ) in context Context(ac j, k) –0 otherwise Rather complex way to say that the information is relevant if either the user or a qualified independent judge says it is

Preservability Property of a digital object that describes its state relative to changes in hardware and software, representation format standards –Ex new recording technologies (replacement of VHS video tapes by DVDs) –New versions of software such as Word or Acrobat –New image standards such as JPEG 2000

Digital preservation techniques Migration –Transform from one format to another Ex. Open the document in one format and save in another or do an automated transformation Emulation –Reproducing the effect of the environment originally used to display the material Keep an old version of the software, or have new software that can read the old format Wrapping –Keep the original format, but add enough human-readable metadata so that it can be decoded in the future Note that the material is not directly usable Refreshing –Copy the stream of bits from one location to another Particularly suitable for guarding against the physical deterioration of the medium Most commonly used

Preservability issues Obsolescence –How out of date is the digital object? Many versions of the software? Old storage media? –Difficult to migrate Appropriate tools? Expertise? Fidelity –How different is the migrated version from the original? –Distortion = loss of information Preservability of a digital object in a digital library is a function of the fidelity of the migration and the obsolescence of the object Preservability(do i, dl) = (fidelity of migrating (do i, format x, format y ), obsolescence(do i, dl)) –Two values to reflect the two dimensions of the concept: fidelity and obsolescence Miniclip Internet Archive

Preservability factors Capital direct costs –Software Developing software to create new versions of the object or obtaining licenses for new versions of the original software –Hardware For processing the migration and for storing the results Indirect operating costs –Monitoring digital objects for migration needs –Maintaining up-to-date intellectual property rights –Storage –Staff training

Calculating Obsolescence obsolence(do i, dl) = cost of converting/migrating the digital object, do i, within the context of a specific digital library

Calculating fidelity fidelity is the inverse of distortion. fidelity(do i, format x, format y ) = 1/(distortion(mp(format x, format y )) + 1.0) One common measure of distortion –mean squared error (mse) Let {x n } be a stream of do i and {y n } be the converted stream mse({x n }, {y n }) = ∑ N n-1 (x n - y n ) 2 / N Use mse for distortion: fidelity(do i, format x, format y ) = 1/(mse({x n }, {y n }) = ∑ N n-1 (x n - y n ) 2 / N + 1.0) No distortion: must yield a fidelity of 1.0

A Preservation Scenario From Gonçales, adopted from one of his sources Librarian learns that special collection of 1,000 digital images, stored in TIFF v5.0, is in danger of obsolescence because the latest version of the display software does not support that version. Librarian decides to migrate all images to JPEG 2000, now the de facto image preservation standard, recommended by the Research Libraries Group (RLG) Librarian does search for options, finds a tool costing $500, that converts TIFF 5.0 to JPEG 2000 About 20 hours needed to order, install, learn, apply the software to all images. Hourly rate of $66.60 per library employee. To save space, choose to use a compression rate that produces average mse = 8 per image. Preservability of each image = preservability (image-TIFF5.0, dl) = (1/9, ($500 +$66.60 *20)/1000) = (0.11, $1.83) Both numbers are costs and lower is better Fidelity loss Obsolescence cost Distortion +1 Hourly rate * hours # images

Relevance Relevance(d 0 i,q) = = 1 if d 0 i is judged by an external judge to be relevant to query q = 0 otherwise Measure of the distance between the vector representing the object and the vector representing the object The “external judge” requirement makes the measure objective and independent of local contextual issues. Relevance has a consistency, independent of the momentary information need. Pertinence is a measure of usefulness within a particular information need.

Significance Significance is an expression of the absolute usefulness of a given digital object, independent of particular user needs. Citation records of objects in digital libraries offer one measure of significance. (This disadvantages the most recently obtained objects, since they have had less time to be cited by others.) Look at ACM DL and the citation counts, for example.ACM DL

Life Cycle and Quality The quality indicators relate to the core components of a digital library – creation, use, finding, distribution. Creation –Authoring, modifying –Describing, Organizing, Indexing Use –Access, filtering Finding (seeking) –Searching, Browsing, recommending Distribution –Storing –Archiving –Networking

Quality and Lifecycle - 2

Quality and Life Cycle - 3 Note that some elements repeat –Timeliness is relevant to the content and to the metadata that describes the content –Accessibility affects both usefulness and distribution.

References Gonçalves, M. A., Moreira, B. L., Fox, E. A., and Watson, L. T. “Quality Model for Digital Libraries”.