Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.

Similar presentations


Presentation on theme: "Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge."— Presentation transcript:

1 Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge

2 Questions What form are the documents in? What form are the documents in? What structure do the documents have? What structure do the documents have? Which kinds of access do you want to provide? Which kinds of access do you want to provide? What metadata is available? What metadata is available? How do you want to present the documents? How do you want to present the documents?

3 Presenting Documents Structured documents (hierarchy) Structured documents (hierarchy) Unstructured text documents Unstructured text documents Page images Page images Page images and extracted text Page images and extracted text Audio and photographic images Audio and photographic images Video Video Music Music Foreign Language Foreign Language

4 Hierarchically Structured Text Table of contents Table of contents Chapter, section, subsection, etc. Chapter, section, subsection, etc. Granularity of document? Granularity of document? Example: Humanity Development Library Example: Humanity Development LibraryHumanity Development LibraryHumanity Development Library

5 Unstructured Text Long scroll of plain text Long scroll of plain text Structure unknown to the digital library system Structure unknown to the digital library system Browsing is less convenient Browsing is less convenient Pages of document may not correspond to physical pages of book Pages of document may not correspond to physical pages of book Example: Project Gutenberg Collection Example: Project Gutenberg CollectionProject Gutenberg CollectionProject Gutenberg Collection

6 Page Images Digitized images of the document’s pages Digitized images of the document’s pages Document accuracy Document accuracy OCR is error-prone OCR is error-prone Duplicating layout is difficult Duplicating layout is difficult Space requirements Space requirements Requires 20 times more storage space than text Requires 20 times more storage space than text Increased download time Increased download time Need for text representation for searching Need for text representation for searching Difficult to highlight search terms on an image Difficult to highlight search terms on an image

7 Page Images and Extracted Text Provide page images and extracted text Provide page images and extracted text Search on extracted text Search on extracted text View image or extracted text View image or extracted text Example: Maori Newspaper Collection Example: Maori Newspaper CollectionMaori Newspaper CollectionMaori Newspaper Collection

8 Other Document Types Audio and photographic images Audio and photographic images Example: Oral History Collection Example: Oral History Collection Example: Oral History Collection Example: Oral History Collection Video Video Example: Music Video Collection Example: Music Video CollectionMusic Video CollectionMusic Video Collection Music Music Representations: printed notation, MIDI, synthesized performance, human performance Representations: printed notation, MIDI, synthesized performance, human performance Example: Music Digital Library Example: Music Digital LibraryMusic Digital LibraryMusic Digital Library Multiple Languages Multiple Languages Interface and/or documents Interface and/or documents Example: Arabic Collection Example: Arabic CollectionArabic CollectionArabic Collection

9 Metadata Provides information to facilitate access Provides information to facilitate access Structured Structured Standardized Standardized

10 Metadata Examples Conventional bibliographic listing Conventional bibliographic listing Title Title Author Author Date Date Publication Publication Volume Number Volume Number Issue Number Issue Number Page Numbers Page Numbers MARC MARC Dublin Core Dublin Core

11 Metadata Aspects Historical Historical Describes provenance and preservation history Describes provenance and preservation history Functional Functional Describes usage, condition and audience Describes usage, condition and audience Technical Technical Describes interoperability requirements Describes interoperability requirements Relational Relational Describes links and citations Describes links and citations Intellectual Intellectual Describes content or subject Describes content or subject

12 Searching Types of query Types of query Case-folding and stemming Case-folding and stemming Phrase searching Phrase searching Different query interfaces Different query interfaces

13 Types of Queries Boolean Queries Boolean Queries Combine terms with AND, OR, and NOT Combine terms with AND, OR, and NOT Exact match Exact match Ranked Queries Ranked Queries List of terms to find List of terms to find Inexact match Inexact match Relevance ranking by some heuristic measure Relevance ranking by some heuristic measure

14 Case Folding and Stemming Case folding Case folding Upper case folded to lower case Upper case folded to lower case Not relevant to some languages Not relevant to some languages Stemming Stemming Reducing a word to its root form Reducing a word to its root form Morphological reduction Morphological reduction Not appropriate for all parts of documents Not appropriate for all parts of documents Language dependent Language dependent

15 Phrase Searching Searching for a contiguous group of words Searching for a contiguous group of words Two types of phrase searching: Two types of phrase searching: Postretrieval scan Postretrieval scan Determine if terms are consecutive by looking inside documents containing query terms Determine if terms are consecutive by looking inside documents containing query terms Smaller index, slower Smaller index, slower Proximity searching is more difficult Proximity searching is more difficult Word-level index Word-level index Index contains word number and document number Index contains word number and document number Determines if terms are consecutive by comparing indexes Determines if terms are consecutive by comparing indexes Larger index, faster Larger index, faster Phrases containing punctuation and white space? Phrases containing punctuation and white space?

16 Different Query Interfaces Ranked or boolean Ranked or boolean Fielded or non-fielded Fielded or non-fielded Case-folding and/or stemming Case-folding and/or stemming Ranked or natural order result list Ranked or natural order result list Use search history or not Use search history or not

17 Browsing Based on metadata Based on metadata Browsing alphabetical lists Browsing alphabetical lists Chinese is not alphabetic Chinese is not alphabetic Browsing by date Browsing by date Browsing structures Browsing structures Hierarchical classification structures Hierarchical classification structures

18 Phrase Browsing Phrase: any sequence of words appearing more than once in the collection Phrase: any sequence of words appearing more than once in the collection Automatic phrase extraction Automatic phrase extraction Key phrases Key phrases Phrase browser Phrase browser Phrase hierarchy Phrase hierarchy Sorted by document and collection frequencies Sorted by document and collection frequencies Leaves are documents Leaves are documents Example: The Complete Works of Shakespeare Example: The Complete Works of ShakespeareThe Complete Works of ShakespeareThe Complete Works of Shakespeare

19 Browsing Using Extracted Metadata Acronyms Acronyms Example: Acronym Extraction Demo Example: Acronym Extraction DemoAcronym Extraction DemoAcronym Extraction Demo Language identification Language identification Example: Language Extraction Demo Example: Language Extraction DemoLanguage Extraction DemoLanguage Extraction Demo


Download ppt "Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge."

Similar presentations


Ads by Google