Download presentation
Presentation is loading. Please wait.
Published byHerbert McCoy Modified over 8 years ago
1
Presenting Documents How to Build a Digital Library Ian H. Witten and David Bainbridge
2
Questions What form are the documents in? What form are the documents in? What structure do the documents have? What structure do the documents have? Which kinds of access do you want to provide? Which kinds of access do you want to provide? What metadata is available? What metadata is available? How do you want to present the documents? How do you want to present the documents?
3
Presenting Documents Structured documents (hierarchy) Structured documents (hierarchy) Unstructured text documents Unstructured text documents Page images Page images Page images and extracted text Page images and extracted text Audio and photographic images Audio and photographic images Video Video Music Music Foreign Language Foreign Language
4
Hierarchically Structured Text Table of contents Table of contents Chapter, section, subsection, etc. Chapter, section, subsection, etc. Granularity of document? Granularity of document? Example: Humanity Development Library Example: Humanity Development LibraryHumanity Development LibraryHumanity Development Library
5
Unstructured Text Long scroll of plain text Long scroll of plain text Structure unknown to the digital library system Structure unknown to the digital library system Browsing is less convenient Browsing is less convenient Pages of document may not correspond to physical pages of book Pages of document may not correspond to physical pages of book Example: Project Gutenberg Collection Example: Project Gutenberg CollectionProject Gutenberg CollectionProject Gutenberg Collection
6
Page Images Digitized images of the document’s pages Digitized images of the document’s pages Document accuracy Document accuracy OCR is error-prone OCR is error-prone Duplicating layout is difficult Duplicating layout is difficult Space requirements Space requirements Requires 20 times more storage space than text Requires 20 times more storage space than text Increased download time Increased download time Need for text representation for searching Need for text representation for searching Difficult to highlight search terms on an image Difficult to highlight search terms on an image
7
Page Images and Extracted Text Provide page images and extracted text Provide page images and extracted text Search on extracted text Search on extracted text View image or extracted text View image or extracted text Example: Maori Newspaper Collection Example: Maori Newspaper CollectionMaori Newspaper CollectionMaori Newspaper Collection
8
Other Document Types Audio and photographic images Audio and photographic images Example: Oral History Collection Example: Oral History Collection Example: Oral History Collection Example: Oral History Collection Video Video Example: Music Video Collection Example: Music Video CollectionMusic Video CollectionMusic Video Collection Music Music Representations: printed notation, MIDI, synthesized performance, human performance Representations: printed notation, MIDI, synthesized performance, human performance Example: Music Digital Library Example: Music Digital LibraryMusic Digital LibraryMusic Digital Library Multiple Languages Multiple Languages Interface and/or documents Interface and/or documents Example: Arabic Collection Example: Arabic CollectionArabic CollectionArabic Collection
9
Metadata Provides information to facilitate access Provides information to facilitate access Structured Structured Standardized Standardized
10
Metadata Examples Conventional bibliographic listing Conventional bibliographic listing Title Title Author Author Date Date Publication Publication Volume Number Volume Number Issue Number Issue Number Page Numbers Page Numbers MARC MARC Dublin Core Dublin Core METS METS
11
Metadata Aspects Historical Historical Describes provenance and preservation history Describes provenance and preservation history Functional Functional Describes usage, condition and audience Describes usage, condition and audience Technical Technical Describes interoperability requirements Describes interoperability requirements Relational Relational Describes links and citations Describes links and citations Intellectual Intellectual Describes content or subject Describes content or subject
12
Searching Types of query Types of query Boolean Boolean Ranked Ranked Case-folding and stemming Case-folding and stemming Phrase searching Phrase searching
13
Browsing Based on metadata Based on metadata Browsing alphabetical lists Browsing alphabetical lists Chinese is not alphabetic Chinese is not alphabetic Browsing by date Browsing by date Browsing structures Browsing structures Hierarchical classification structures Hierarchical classification structures
14
Phrase Browsing Phrase: any sequence of words appearing more than once in the collection Phrase: any sequence of words appearing more than once in the collection Automatic phrase extraction Automatic phrase extraction Key phrases Key phrases Phrase browser Phrase browser Phrase hierarchy Phrase hierarchy Sorted by document and collection frequencies Sorted by document and collection frequencies Leaves are documents Leaves are documents Example: The Complete Works of Shakespeare Example: The Complete Works of ShakespeareThe Complete Works of ShakespeareThe Complete Works of Shakespeare
15
Browsing Using Extracted Metadata Acronyms Acronyms Example: Acronym Extraction Demo Example: Acronym Extraction DemoAcronym Extraction DemoAcronym Extraction Demo Language identification Language identification Example: Language Extraction Demo Example: Language Extraction DemoLanguage Extraction DemoLanguage Extraction Demo
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.