1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.

Slides:



Advertisements
Similar presentations
Alexandria Digital Library Project The ADEPT Bucket Framework.
Advertisements

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Information Retrieval in Practice
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
CS/Info 430: Information Retrieval
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
1 CS 502: Computing Methods for Digital Libraries Lecture 6 DTDs.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Overview of Search Engines
Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.
Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
1 CS 430: Information Discovery Lecture 17 Library Catalogs 2.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Organizing Internet Resources OCLC’s Internet Cataloging Project -- funded by the Department of Education -- from October 1, 1994 to March 31, 1996.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
1 CS/INFO 430 Information Retrieval Lecture 20 Metadata 2.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS/INFO 430 Information Retrieval Lecture 16 Metadata 3.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 7 Descriptive Metadata 3 Dublin Core Automatic Generation of Catalog Records.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
1 CS/INFO 430 Information Retrieval Lecture 21 Metadata 3.
Metadata for the Web Andy Powell UKOLN University of Bath
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 CS 430: Information Discovery Lecture 5 Descriptive Metadata 1 Libraries Catalogs Dublin Core.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Retrieval Lecture 6 Vector Methods 2.
1 CS 430: Information Discovery Lecture 19 Non-Textual Materials 1.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 CS 430: Information Discovery Lecture 7 Automatic Generation of Catalog Records.
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
Automated Information Retrieval
Information Retrieval in Practice
Professional Development Programme: Design and Development of Institutional Repository Using DSpace Nipul G Shihora INFLIBNET Centre Gandhinagar
Summon discovers contents from one search box!
CS 430: Information Discovery
CS 430: Information Discovery
Attributes and Values Describing Entities.
Cataloging the Internet
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 15 Library Catalogs 3

2 Course Administration Midterm examination results have been sent by . Assignment 2 results will be mailed shortly. Assignment 3, due November 10, will be posted soon.

3 Automatic extraction of catalog data Example: Dublin Core records for web pages Strategies Manual by trained cataloguers - high quality records, but expensive and time consuming Entirely automatic - fast, almost zero cost, but poor quality Automatic followed by human editing - cost and quality depend on the amount of editing Manual collection level record, automatic item level record - moderate quality, moderate cost

4 DC-dot DC-dot is a Dublin Core metadata editor for web pages, created by Andy Powell at UKOLN DC-dot has two parts: (a) A skeleton Dublin Core record is created automatically from clues in the web page (b) A user interface is provided for cataloguers to edit the record

5

6 Automatic record for CS 430 home page DC-dot applied to continued on next slide

7 Automatic record for CS 430 home page (continued) DC-dot applied to

8 Observations on DC-dot applied to CS430 home page DC.Title is a copy of the html field DC.Publisher is the owner of the IP address where the page was stored DC.Subject is a list of headings and noun phrases presented for editing DC.Date is taken from the Last-Modified field in the http header DC.Type and DC.Format are taken from the MIME type of the http response DC.Identifier was supplied by the user as input

9

10 DC-dot applied to continued on next slide Automatic record for George W. Bush home page

11 DC-dot applied to Automatic record for George W. Bush home page (continued)

12 Observations on DC-dot applied to George W. Bush home page The home page has several meta tags: [The page has no html ] <META NAME="KEYWORDS" CONTENT="George W. Bush, Bush, George Bush, President, republican, 2000 election and more

13 Collection-level metadata Several of the most difficult fields to extract automatically are the same across all pages in a web site. Therefore create a collection record manually and combine it with automatic extraction of other fields at item level. For the CS 430 home page, collection-level metadata: See: Jenkins and Inman

14

15 Metadata extracted automatically by DC-dot D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose subject not included in this slide publisher Corporation for National Research Initiatives date W3CDTF type DCMIType Text format text/html format bytes identifier

16 Collection-level record D.C. Field Qualifier Content publisher Corporation for National Research Initiatives type article type resource work relation rel-type InSerial relation serial-name D-Lib Magazine relation issn language English rights Permission is hereby given for the material in D-Lib Magazine to be used for...

17 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose publisher (*) Corporation for National Research Initiatives date W3CDTF type (*) article type resource (*) work type DCMIType Text format text/html format bytes (*) indicates collection-level metadata continued on next slide

18 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content relation rel-type (*) InSerial relation serial-name (*) D-Lib Magazine relation issn (*) language (*) English rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for... identifier (*) indicates collection-level metadata

19 Manually created record D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose creator (+) David M. Levy publisher Corporation for National Research Initiatives date publication January 2000 type article type resource work (+) entry that is not in the automatically generated records continued on next slide

20 Manually created record D.C. Field Qualifier Content relation rel-type InSerial relation serial-name D-Lib Magazine relation issn relation volume (+) 6 relation issue (+) 1 identifier DOI (+) /january2000-levy identifier URL language English rights (+) Copyright (c) David M. Levy (+) entry that is not in the automatically generated records

21 Collection-level metadata Compare: (a) Metadata extracted automatically by DC-dot (b) Collection-level record (c) Combined item-level record (DC-dot plus collection-level) (d) Manual record For web pages information retrieval works better by automatic indexing, rather than automatic extraction of metadata followed by indexing of metadata. However, we will see later an effective example of automated extraction of metadata from video sequences (Informedia).

22 Metatest Metatest is a research project led by Liz Liddy at Syracuse with participation from the Human Computer Interaction group at Cornell. The aim is to compare the effectiveness as perceived by the user of indexing based on: (a) Manually created Dublin Core (b) Automatically created Dublin Core (higher quality than DC-dot) (c) Full text indexing Preliminary results suggest remarkably little difference in effectiveness.

23 Midterm Examination Q3 (a) The aggregate term weighting for term j in document i is sometimes written: w ij = tf ij * idf j Explain the purpose of tf ij and idf j. Term frequency assumes that the usefulness of a term for retrieval increase as the number of times that the term appears in the document increases. Inverse document frequency assumes that terms that apear in few documents are better discriminators than those that appear in many.

24 Midterm Examination Q3 (continued) In class, we recommended, the following term weighting for free text documents: tf ij = f ij / m i idf j = log 2 (n / n j ) + 1 (i) Explain why this form is frequently changed for the weighting of terms in documents, such as catalog records, that are not free text. The forms of tf and idf have been developed for free text. The distribution of terms are different in free text and catalog records. (ii) Explain why this form might give difficulties if the documents vary greatly in length. The scaling factors in tf and idf have been developed for collections of similar length records.

25 Midterm Examination Q3 (continued) (c) Consider the query: q: dog cat dog and the following set of documents: d1: bee dog bee cat bee elk elk d2: elk dog ant ant dog ant d2: cat cat cat cat dog (i) With no term weighting, what is the similarity between this query and each of the documents?

26 Midterm Examination Q3 (continued) Term vector matrix antbeecatdogelklength q11√2 d d2111√3 d311√2 Similarities qd1d2d3 q11/√21/√61

27 Midterm Examination Q3 (continued) (ii) Weighting both the query and the documents for term frequency, but not weighting for inverse document frequency, what is the similarity between this query and each of the documents?

28 Midterm Examination Q3 (continued) Term vector matrix antbeecatdogelklength q12√5 d13112√15 d2321√14 d341√17 Similarities qd1d2d3 q13/√754/√706/√85