CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.
Advertisements

1 CS 502: Computing Methods for Digital Libraries Lecture 18 Descriptive Metadata: Metadata Models.
The Knowledge Bank Project at the Ohio State University Presented at the American Accounting Association Meeting – Chicago 8/6/07 Charles J. Popovich Head.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
SLIDE 1IS 257 – Fall 2007 Codes and Rules for Description: History 2 University of California, Berkeley School of Information IS 245: Organization.
1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Journal Citation Reports on the Web. Copyright 2006 Thomson Corporation 2 Introduction JCR distills citation trend data for 7,600+ journals from more.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Introducing Symposia : “ The digital repository that thinks like a librarian”
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
CS 430 / INFO 430 Information Retrieval
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Web 2.0: Concepts and Applications 2 Publishing Online.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
1 CS 430: Information Discovery Lecture 17 Library Catalogs 2.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Web 2.0: Concepts and Applications 2 Publishing Online.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials Arwen Hutt, University of Tennessee.
Use & Access 26 March Use “Proof of Concept” Model for General Libraries & IS faculty Model for General Libraries & IS faculty Test bed for DSpace.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
1 Discussion Class 4 The Dublin Core Metadata Initiative.
Metadata for the Web Andy Powell UKOLN University of Bath
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 CS 430: Information Discovery Lecture 5 Descriptive Metadata 1 Libraries Catalogs Dublin Core.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
A presentation for Professor Turnbull’s Information Architecture class by Rhonda Hankins February 13, 2003.
Metadata Content Entering Metadata Information. Discovery vs. Access vs. Understanding Cannot search on content if it is not documented. Cannot access.
1 CS 430: Information Discovery Lecture 5 Ranking.
Times Tables 1 x 1=. Times Tables 1 x 2= Times Tables 1 x 3=
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
An Application Profile and Prototype Metadata Management System for Licensed Electronic Resources Adam Chandler Information Technology Librarian Central.
Roger Mills February don’t be evil stand on the shoulders of giants.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
Dublin Core Basics Workshop Lisa Gonzalez KB/LM Librarian.
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
CS 430: Information Discovery
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
CS 430: Information Discovery
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
CS 430: Information Discovery
Text & Web Mining 9/22/2018.
Attributes and Values Describing Entities.
HITS Hypertext Induced Topic Selection
CS 430: Information Discovery
CS 430: Information Discovery
Health On-Line Patient Education Web Site
HITS Hypertext Induced Topic Selection
Attributes and Values Describing Entities.
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

CS 430: Information Discovery Lecture 17 Ranking 1

Course Administration • Assignment 2 and Midterm examination were mailed a week ago. A few questions outstanding. • Assignment 3 will be posted shortly.

Midterm Examination -- Question 4 4(a) What is the Dublin Core principle of dumbing-down? Are there any fields in this record that do not satisfy the principle? "The theory behind this principle is that consumers of metadata should be able to strip off qualifiers and return to the base form of a property. ... this principle makes it possible for client applications to ignore qualifiers in the context of more coarse-grained, cross-domain searches." Lagoze 2001

Question 4 (continued) Dumbing-down failures: Description.note Title from home page as viewed on Nov. 1, 2000. Description Title from home page as viewed on Nov. 1, 2000. which is not a description of the object Publisher.place Nashville, Tenn. : Publisher Nashville, Tenn. : which is not the publisher of the object Correct dumbing-down: Subject.class.LCC E840.8.G65 Subject E840.8.G65 which is a subject code

Question 4 (continued) 4(b) The metadata in the fields Publisher and Publisher place end in punctuation marks. Can you suggest any reasons for doing so? This is a historic curiosity. It comes from the concept that the metadata will be printed, so that the metadata is stored in a printable format. Publisher Gore/Lieberman, Publisher.place Nashville, Tenn. : is intended to be combined with a date as follows: Nashville, Tenn. : Gore/Lieberman, 2001

Question 4 (continued) 4(c) This record has no Creator field. It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc." Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields? Why?

Question 4 (continued) Specification of Dublin Core: A. All fields are optional. It is not necessary to have a Creator. B. Definitions of fields Creator The person or organization primarily responsible for the intellectual content of the resource. Contributor A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element. Gore/Lieberman, Inc. is the corporate author of this web site and is therefore the Creator.

Midterm Examination -- Question 2 2(b) You have the collection of documents that contain the following index terms: D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta (i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

Incidence array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta 7 3 4 alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 1 D3 1 1 1 1 D4 1 1 1 1

Document similarity matrix D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65 0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87 0.25

Question 2 (continued) 2b(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with term weights inversely proportional to frequency.

Frequency Array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 3 D3 3 1 1 1 D4 2 1 1 2

Inverse Document Frequency Weighting Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: wik = fik / dk Where: wik is the weight given to term k in document i fik is the frequency with which term k appears in document i dk is the number of documents that contain term k

Frequency Array with Weights D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 0.33 0.50 0.50 0.33 0.50 0.33 0.33 D2 0.33 0.33 1.00 D3 1.50 0.50 0.50 0.33 D4 0.67 0.33 0.33 0.67 length 0.94 0.65 1.08 0.76 dk 3 2 2 3 2 3 3

Document similarity matrix D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46 0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86 0.06

Google Ranking algorithm Concept: The rank of a page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Page Ranks (Google) Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 Cited page Number 2 1 4 1 2 2

Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1 0.25 0.5 P2 0.25 P3 0.5 P4 0.5 0.25 1 0.5 P5 0.5 P6 0.25 0.5 = B Cited page Number 2 1 4 1 2 2

Weighting of Pages Initially all pages have weight 1 Recalculate weights w2 = Bw1 = 1.75 0.25 0.50 2.25 0.75 1

Google Ranks Iterate until wk = Bwk-1 This w is the high order eigenvector of B It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages Google: calculates the ranks for all pages (over one billion) lists hits in rank order