Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow Research Center for Information Technology Innovation & Institute of Information Science, Academia Sinica
Outline Introduction Union catalog Databases and metadata for digital contents and websites Knowledge engineering Future perspective
Introduction The integration and management of digital contents has become an important issue as the amount of digital contents produced from different projects and institutions increases rapidly. The goal of our project is to achieve optimized preservation, retrieval, and presentation of digital collections.
Outline Introduction Union catalog Databases and metadata for digital contents and websites Knowledge engineering Future perspective
What is the union catalog ? It is a catalog and portal for all digital collections of TELDAP. It is an integrated platform for browsing and searching entire digital contents of TELDAP. Metadata provides core descriptions and licensing information of each digital collection.
Browsing by topics Search by keywords Home Page of Union Catalog
Some improved functions for IR Keyword suggestion Keyword extension Recommendation of related collections
Keyword suggestion
Keyword extension
Digital Image Recommendation of related collections Hyperlink to database Metadata Citation Social networking service Licensing Information
Outline Introduction Union catalog Databases and metadata for digital contents and websites Knowledge engineering Future perspective
Metadata models for different types of objects Archived digital items Union catalog metadata model- Dublin core+ Web sites DCCAP (Dublin Core Collections Application Profile) Fields for internal used only ― Unique Identifier, Format, Evaluation, Cataloging History Documents Document metadata-Dublin core
13 Over 4 million digital items and still increasing ElementDefinition Title A name given to the resource Creator An entity primarily responsible for making the content of the resource Subject and Keywords The topic of the content of the resource Description An account of the content of the resource Publisher An entity responsible for making the resource available Contributor An entity responsible for making contributions to the content of the resource Date A date associated with an event in the life cycle of the resource Resource Type The nature or genre of the content of the resource Format The physical or digital manifestation of the resource Resource Identifier An unambiguous reference to the resource within a given context Source A Reference to a resource from which the present resource is derived Language A language of the intellectual content of the resource Relation A reference to a related resource Coverage The extent or scope of the content of the resource Rights Management Information about rights held in and over the resource Metadata for digital items :
Metadata for websites Over 690 websites and still increasing Metadata – DCCAP (Dublin Core Collections Application Profile) – To Combine the standard with our requirements: 19 data fields
The Website Homepage Picture URL, Project Information Type, Name, Author, Subject, Description, Language, Item Type, Target Archived Information: URL, time, authorization Copyright, Purpose, Other Information Figure: Social networking service
Uses of Metadata Search collections by matching keyword and features Provide basic information of each collection Dynamic categorization Provide information to compute similarity or relatedness of two collections Extract keywords
(1) Chinese Keyword Search Keyword+(Features) Synonyms, hyponyms Matched Collections Collections+Weights Display Results Keyword Extension AAT- Taiwan &Teldap Thesauru s Keyword Matching Ranking Filtering Keyword Dictionary
English Keyword Search English Keyword+ (Features) Translations, Synonyms, Hyponyms Matched Collections Collections+Weights Display Results Keyword Translation & Extension AAT- Taiwan &Teldap Thesaurus Keyword Matching Ranking Filtering Keyword Dictionary
Ranking Algorithm Rank Value(item)= W1* Association(Keyword, item) + W2*Quality(item) –Association(Keyword, item)=W1*Topical Similarity(Topic(keyword), Topic(item)) + W2*Importance of relation (Keyword, item) –Quality(item) =W1* Image quality (item) + W2*Qualification of provider (item) + W3*Metadata (item) Topical Similarity(Topic(keyword), Topic(item)) = Ontology Distance(Topic(keyword), Topic(item)) Importance of relation (Keyword, item) = W1*Keyword- from Value + W2*Mutual Information (keyword, Topic(item)) Keyword-from Value= 1 if keyword is contained in title(item) 0.5 if keyword is contained in description(item) Mutual Information (keyword, Topic(item))= P(Keyword, Topic(item))/{P(Keyword)*P(Topic(item))}
Algorithm for Recommending Related Collections i-th Item Vector= {Topic, Institute, Keyword1,Keyword2,….} Similar(i-th item, j-th item)= W1*Topic Similar(i-th item, j-th item)+ W2* Institute Similar(i-th item, j-th item)+ Weight(Keyword1) *Delta(Keyword1) + Weight(Keyword2) * Delta(Keyword2)+…; where Delta(Keyword1) = 1 if Keyword1 of i-th item is also keyword of j-th item; otherwise 0; Recommendation= Similar(i-th item, j-th item)+ Evaluation(j-th item)
(2) Dynamic categorization User-oriented categorization General, elementary school students, high school students, researchers, …etc. Topical-based categorization Archaeology, painting, animal, plant, document, …etc. Functional-based categorization Research, education, business, technology,… Categorization based on institutions Academia Sinica, Taiwan U., Palace museum,…
(3) Multi-purposes of Core IR System and Databases Teldap –Whole collections –Searched by institutes, domains, and media types (documents, images, videos, and web sites) –Monolingual Digital Shop –Whole collections or only fine arts –General search and searched by licensing types –Rely on multilingual thesaurus Taiwan Academy – Fine arts Searched by institutes and domains – Multilingual – Rely on multilingual thesaurus
Figure: Digitalarchives.tw
Purpose: Education Target: Elementary school student, Junior high school student, Teacher… Purpose: Creative applications Purpose: Academic research Subject: Animal, Archaeology, Anthropology… Digitalarchives.tw
Figure: Taiwan Academy
Categorization based on institutions Topical-based categorization Taiwan Academy
Outline Introduction Union catalog Databases and metadata for digital contents and websites Knowledge engineering Future perspective
Plans of making knowledge structures for TELDAP Construct metadata models for different objects. Establish hyperlinks between contexts and objects. Develop keyword extraction tools. Design automatic tagging tools. Construct TELDAP ontology and thesaurus. Art & Architecture Thesaurus by Getty Chinese WordNet
(1) Metadata models for different objects Digital collections – Union catalog metadata model- Dublin core+ Web sites – DCCAP (Dublin Core Collections Application Profile) – Public fields – Private fields Unique Identifier, Format, Evaluation, Cataloging History Documents – Document metadata-Dublin core
(2) Create keyword dictionary Extract from metadata Collect from Google search terms By social tagging Manually collect while tag hyperlinks
Lexical Entry of Keyword Dictionary Keyword id Keyword Synset id Hypernym id Hyponym id Features Related Collections + Association Strengths
(2) Establish hyperlinks between contents and objects Identify keywords in contents. Tag keywords with related object hyperlinks.
Develop hyperlink tagging tools Word segmentation tools – Resolve word segmentation ambiguities and identify keywords. – CKIP word segmentation system:
Develop hyperlink tagging tools TELDAP keyword dictionary – Extract keywords from metadata and establish object-keyword relations. Extract text from XML data for each object. The text are classified by topics, titles, descriptions, authors, locations, eras etc. From each class of text file extract keywords by automatic word segmentation, keyword extraction, and manual post editing. – Current dictionary contains more than 120,000 Keywords.
Prototype system for hyperlink taggerhyperlink tagger Identify and select keywords from the input text
Prototype system for hyperlink tagger Produce text with hyperlinks
Prototype system for hyperlink tagger Hyperlinks point to the related digital collections
(3) Construct TELDAP ontology and thesaurus Establish association links between Chinese keywords and Getty AAT. Merge TELDAP keywords with Chinese AAT.
AAT Browsing trees of Taiwan Academy
AAT subject search of Taiwan Academy
Recommendation of related items
Outline Introduction Union catalog Databases and metadata for digital contents and websites Knowledge engineering Future perspective
Future Perspective Technology development – Construct multi-lingua thesauri – Getty AAT. – Maintain the TELDAP keyword-and-object relation database. – Construct name authority files, gazetteers, and universal calendars. – Design hyperlink taggers and keyword extension tools. – Design an authoring tool which provides hyperlinks of keyword related digital contents automatically. – Design knowledge-based content retrieval system.
Future Perspectives Content enrichment – Within TELDAP : Standardize object metadata model and data format. Provide object metadata in controlled vocabulary. Write scripts and stories for different topics with Wiki- like knowledge structure. Enrich the digital collections. Establish hyperlinks between text books and TELDAP collections. – Extend the knowledge sources : e.g. Wikipedia