Digital Libraries Prof. Marcos Andre Goncalves Universidade Federal de Minas Gerais
Synchronous Scholarly Communication Same time, Same or different place
Asynchronous, Digital Library Mediated Scholarly Communication Different time and/or place
Digital Libraries Shorten the Chain from Editor Publisher A&I Library Reviewer
DLs Shorten the Chain to Author Reader Digital Library Editor Reviewer Teacher Learner Librarian
DL Overview Why of Global Interest? National projects can preserve antiquities and heritage: cultural, historical, linguistic, scholarly Knowledge and information are essential to economic and technological growth, education DL - a domain for international collaboration –wherein all can contribute and benefit –which leverages investment in networking –which provides useful content on Internet & WWW –which will tie nations and peoples together more strongly and through deeper understanding
Digital Libraries --- Objectives World Lit.: 24hr / 7day / from desktop Integrated “super” information systems: 5S: Table of related areas and their coverage Ubiquitous, Higher Quality, Lower Cost Education, Knowledge Sharing, Discovery Disintermediation -> Collaboration Universities Reclaim Property Interactive Courseware, Student Works Scalable, Sustainable, Usable, Useful
How is a DL different from a database? A traditional SQL database has as its basic element data items in a relation: – select name – from employee, project – where employee.deptnumber = “25” AND – project.number = “100” databases exploit known structures and relations DBMS retrieval is not probabilistic (Frakes, Baeza-Yates, p. 3)
How is a DL different from the WWW? The keyword is managed –The WWW is not managed Some meta searchers (Yahoo, Lycos) attempt to add an organizational framework to their web holdings –However, most are focused on keyword searching (i.e., Google)
How is a DL different from the WWW? Another key difference is who controls the input into the system –most meta searchers hunt down their holdings Lycos is short for Lycosidae lycosa (the wolf spider), which pursues its prey and does not build a web (Mauldin, IEEE Expert, 1/97) –some (Yahoo) have humans in the loop for review and classification To date, DLs are generally more tightly controlled, and have a targeted customer set
DL = Content + Services Why not just use the WWW ? – WWW by itself has low archival & management characteristics Why not use a RDBMS? – In the same way that a card catalog is not a TL, a RDBMS is candidate technology for use in DLs DL is the union of the content and services defined on the content
How is a DL Different from a Traditional Library? TL has as its focus physical objects – even if the card catalog (metadata) is electronic, the purpose is to point you to a physical location – trafficking in physical objects has both obvious and subtle implications object can exist only in 1 place if you have it, I cant have it (zero-sum distribution) I have to go to the object, or wait for it to come to me
TLs vs. DLs DLs clearly better than TLs at: –Dissemination, storing information variety However, TL objects are more survivable –Who will archive the research information? the publishers? the institutions? the authors? –Will the average DL object still be accessible in 10 years? take my digital preservation seminar in the spring! image from:
Digital Library – removing the physical restriction has obvious benefits multiple access, multiple listings, electronic transmission – also complicates many other issues... intellectual property, terms and conditions, etc. Note that a TL offers additional social and educational benefits – Most TLs also offer hybrid services too. How is a DL Different from a Traditional Library?
DL Definitions - 1 “A digital library is an organized and focused collection of digital objects, including text, images, video, and audio, along with methods of access and retrieval, and for selection, creation, organization, maintenance, and sharing of the collection.” Witten & Bainbridge – “How to Build a Digital Library” – Morgan Kaufmann 2003
DL Definitions - 2 “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities” Waters,D.J. CLIR Issues, July/August
Informal 5S & DL Definitions DLs are complex systems that help satisfy info needs of users (societies) provide info services (scenarios) organize info in usable ways (structures) present info in usable ways (spaces) communicate info with users (streams)
5Ss SsExamplesObjectives Streams Text; video; audio; image Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data Structures Collection; catalog; hypertext; document; metadata Specifies organizational aspects of the DL content Spaces Measure; measurable, topological, vector, probabilistic Defines logical and presentational views of several DL components Scenarios Searching, browsing, recommending Details the behavior of DL services Societies Service managers, learners, teachers, etc. Defines managers, responsible for running DL services; actors, that use those services; and relationships among them
5S and DL formal definitions and compositions (April 2004 TOIS)
ETANA-DL Archaeological DL Integrated DL –Heterogeneous data handling Applies and extends the OAI-PMH –Open Archives Initiative Protocol for Metadata Handling Design considerations –Componentized –Extensible –Portable
Map courtesy: Initial ETANA-DL Member Locations Virginia Tech Mississippi State University Vanderbilt University Canadian University College Walla Walla College Andrews University CWRU Willamette University
Lahav Website
Megiddo Opening Screen
Locus Screen: Pictures View all
Area Screen
ETANA-DL Approach Applying and extending Digital Library (DL) techniques to solve key problems: making primary data available, data preservation, and interoperability Modeling archaeological information systems using 5S to better understand the domain and design the system and the supporting services Rapidly prototyping DLs that handle heterogeneous archaeological data using componentized frameworks: –eliciting requirements –refining metamodel and union schema –modeling sites –mapping –harvesting –providing useful services
ETANA-DL Website
Marking – writing notes for a specific user Marking Items
Marked Items Display Sender, Date, Object OAI ID Sender Comments Options: View Record, Add record to Items Of Interest, Re-mark item (Redirect), Unmark item (Remove item from list)
Discussions Page Discussions about an object View/Post messages, create new threads
Recommendations Items recommended on the basis of similar interests
ETANA-DL Searching Service Search
ETANA-DL Multi-dimensional Browsing 3 new sites 2 new types of artifacts
ETANA-DL Visual Browsing Service Visual Browse By site
Visual Browsing Nimrin: Topographical Drawings Full siteNorth west quadrant Square: N40/W20
Visual Browsing Nimrin : Square information Square: N40/W20 Locus: 86 Loci layout
Visual Browsing Nimrin : locus sheet
Visual Browsing Bab edh-Dhra' Cemetery Pottery # 25
Visual Browsing Bab edh-Dhra' Cemetery Pottery # 25
ETANA Societies 1.Historic and pre-historic societies (being studied) 2.Archaeologists (in academic institutes, fieldwork settings, or local and national governmental bodies) 3.Project directors 4.Technical staff (consisting of photographers, technical illustrators, and their assistants) 5.Field staff (responsible for the actual work of excavation) 6.Camp staff (e.g., camp managers, registrars, tool stewards) 7.General public (e.g., educators, learners, citizens)
ETANA Societies Social issues 1.Who owns the finds? 2.Where should they be preserved? 3.What nationality and ethnicity do they represent? 4.Who has publication rights? 5.What interactions took place between those at the site studied, and others? What theories are proposed by whom about this?
ETANA Scenarios 1.Life in the site in former times 2.Digital recording: the planning stage and the excavation stage 3.Planning stage: remote sensing, fieldwalking, field surveys, building surveys, consulting historical and other documentary sources, and managing the sites and monuments 4.Excavation 1.Detailed information is recorded, including for each layer of soil, and for features such as pole holes, pits, and ditches. 2.Data about each artifact is recorded together with information about its exact find spot. 3.Numerous environmental and other samples are taken for laboratory analysis, and the location and purpose of each is carefully recorded. 4.Large numbers of photographs are taken, both general views of the progress of excavation and detailed shots showing the contexts of finds. 5.Organization and storage of material 6.Analysis and hypotheses generation and testing 7.Publications, museum displays 8.Information services for the general public
ETANA Spaces 1.Geographic distribution of found artifacts 2.Temporal dimension (as inferred by archaeologists) 3.Metric or vector spaces 1.used to support retrieval operations, and to calculate distance (and similarity) 2.used to browse / constrain searches spatially 4.3D models of the past, used to reconstruct and visualize archaeological ruins 5.2D interfaces for human-computer interaction
ETANA Structures 1.Site Organization 1.Region, site, partition, sub-partition, locus, … 2.Temporal orderings (ages, periods) 3.Taxonomies 1.for bones, seeds, building materials, … 4.Stratigraphic relationships 1.above, beneath, coexistent
ETANA Streams 1.successive photos and drawings of excavation sites, loci, unearthed artifacts 2.audio and video recordings of excavation activities and discussions 3.textual reports 4.3D models used to reconstruct and visualize archaeological ruins.
Streams Multiple media types and representation –See ch. 4 for IR (except some here for non-text) –Standards for each, and for some combinations Text –Character strings, encoding (Unicode) –Morphology -> Stemming –Syntax, semantics -> stop words –** POS tagging, phrases Images, Audio, Video, Graphics, Animation –Capture, digitization, representation –CBIR for each ** Compression, processing, analysis **Synchronization, rendering, presentation, interchange –RealVideo, SMIL, QoS
Content Based Information Retrieval
Problems Image similarity is subjective –Personal Interpretation Concept x Appearance
By Visual features –Retrieve images with 50 percent of white colour and 50 percent of black colour
Textual information retrieval Query on Google using Sunset and Rio de Janeiro Query result
Image Classification by shape
Work of Torres et al Search in collections of fish images using combination of image properties (CBIR) and textual descriptions
Motivation Query 1: –List all metadata related to fish which were observed in the Amazon River Query 2: –Retrieve images of fishes whose shape is similar to that in the example oQuery 3: List all metadata related to fishes that were observed in the Amazon River and whose shape is similar to that in the example
Motivation Retrieve fish descriptions whose shapes are similar to the one shown below, that belong to the “Notropis” genre, that have large yes” e and that have been observed in the “Tennessee River”
Problem There is no BIodiversity Information System which allow queries involving : –Geographic data –Species metadata –Image Descriptors Existing systems: –Metadada or –Metadada + spatial data –Images are stored as separate files With no possibilty of retrieval by content
WeBioS
Torres: Visualizations Spiral Pattern Concentric Rings Pattern
Structures Digital Objects –Documents, digitization, packaging (METS), interchange, standards, format conversion –Genre: plays, encyclopedia, dictionaries, educational resources: courses (e.g., syllabi) and lessons –Structural organizations (books, chapters, sections), excerpts/spans (mark, superimposed info) Metadata: standards, markup Knowledge Structures & Representations –Databases, Schema, Ontologies, Thesauri, Lexicons, Authority files, Concept maps, Semantic networks Indexes –Inverted files, signature files, R-trees, Quad trees, etc. Clusters & Classification Schemes
Degree of Structure Chaotic OrganizedStructured WebDLsDBs
Digital Objects (DOs) Born digital Digitized version of “real” object –Is the DO version the same, better, or worse? –Decision for ETDs: structured + rendered Surrogate for “real” object –Not covered explicitly in metamodel for a minimal DL –Crucial in metamodel for archaeology DL
Metadata Objects (MDOs) MARC Dublin Core RDF IMS OAI (Open Archives Initiative) Crosswalks, mappings Ontologies Topics maps, concept maps
Complex to Simple MARC ($50)Dublin Core (DC) + thesis
Spaces Retrieval models –Boolean, extended Boolean –Vector, LSI –Probabilistic: classical, belief network, inference network, language models User interfaces and visualization
2D interfaces 3D interfaces GIS Other paradigms
Scenarios Recall OO for streams – now have objects as well as scenarios – ex interface components Information Access –Searching: ad hoc, filtering/routing –Browsing: using an organization, using a visualization, using links (i.e., hypertext, hypermedia) –Workflow: sessions, feedback, etc. Scenario-based Design Usability: goals, tasks, claims NOTE: this is covered in the outline
Societies User communities –Authors, editors, teachers, students, readers –Personal(ization), group(ware), community, global –Accessibility, universal access Librarians: reference, acquisition, operations Research community –Associations, conferences, publications, labs, projects Economics –Copyright, intellectual property rights, digital rights management, authorization, authentication, security, privacy, self-archiving (eprints) –Publishers, catalogers, distributors, sustainability –Open source, commercial, hybrid