Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 430: Information Discovery

Similar presentations


Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

1 CS 430: Information Discovery
Lecture 17 Ranking 1

2 Course Administration
• Assignment 2 and Midterm examination were mailed a week ago. A few questions outstanding. • Assignment 3 will be posted shortly.

3 Midterm Examination -- Question 4
4(a) What is the Dublin Core principle of dumbing-down? Are there any fields in this record that do not satisfy the principle? "The theory behind this principle is that consumers of metadata should be able to strip off qualifiers and return to the base form of a property. ... this principle makes it possible for client applications to ignore qualifiers in the context of more coarse-grained, cross-domain searches." Lagoze 2001

4 Question 4 (continued) Dumbing-down failures:
Description.note Title from home page as viewed on Nov. 1, 2000. Description Title from home page as viewed on Nov. 1, 2000. which is not a description of the object Publisher.place Nashville, Tenn. : Publisher Nashville, Tenn. : which is not the publisher of the object Correct dumbing-down: Subject.class.LCC E840.8.G65 Subject E840.8.G65 which is a subject code

5 Question 4 (continued) 4(b) The metadata in the fields Publisher and Publisher place end in punctuation marks. Can you suggest any reasons for doing so? This is a historic curiosity. It comes from the concept that the metadata will be printed, so that the metadata is stored in a printable format. Publisher Gore/Lieberman, Publisher.place Nashville, Tenn. : is intended to be combined with a date as follows: Nashville, Tenn. : Gore/Lieberman, 2001

6 Question 4 (continued) 4(c) This record has no Creator field. It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc." Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields? Why?

7 Question 4 (continued) Specification of Dublin Core:
A. All fields are optional. It is not necessary to have a Creator. B. Definitions of fields Creator The person or organization primarily responsible for the intellectual content of the resource. Contributor A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element. Gore/Lieberman, Inc. is the corporate author of this web site and is therefore the Creator.

8 Midterm Examination -- Question 2
2(b) You have the collection of documents that contain the following index terms: D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta (i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

9 Incidence array D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta 7 3 4 alpha bravo charlie delta echo foxtrot golf D D D D

10 Document similarity matrix
D1 D2 D3 D4 D D D D

11 Question 2 (continued) 2b(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with term weights inversely proportional to frequency.

12 Frequency Array D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D D D D

13 Inverse Document Frequency Weighting
Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: wik = fik / dk Where: wik is the weight given to term k in document i fik is the frequency with which term k appears in document i dk is the number of documents that contain term k

14 Frequency Array with Weights
D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D D D D length 0.94 0.65 1.08 0.76 dk

15 Document similarity matrix
D1 D2 D3 D4 D D D D

16 Google Ranking algorithm
Concept: The rank of a page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

17 Page Ranks (Google) Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1
Cited page Number

18 Normalize by Number of Links from Page
Citing page P P P P P P6 P P P P P P = B Cited page Number

19 Weighting of Pages Initially all pages have weight 1
Recalculate weights w2 = Bw1 = 1.75 0.25 0.50 2.25 0.75 1

20 Google Ranks Iterate until wk = Bwk-1
This w is the high order eigenvector of B It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages Google: calculates the ranks for all pages (over one billion) lists hits in rank order


Download ppt "CS 430: Information Discovery"

Similar presentations


Ads by Google