Download presentation
Presentation is loading. Please wait.
Published byEmory Curtis Chandler Modified over 8 years ago
1
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Introduction to Digital Libraries Week 7: Ranking Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson 02/22/11
2
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Content Based Ranking http://en.wikipedia.org/wiki/Trip_hop –mentions the artist "DJ Shadow" once http://en.wikipedia.org/wiki/DJ_Shadow –mentions "DJ Shadow" 30+ times Intuitively, the former is more "about" DJ Shadow than the latter and the frequency of the term(s) reflect this "aboutness"
3
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Content Based Ranking How desirable is recall at web scale? 6 30+ 16
4
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Content Based Ranking Why is djshadow.rpod.ru on page 15 when it has the phrase "dj shadow" 20+ times, and Rhapsody.com appears on page 1 (pos #6) when it has the same phrase only 6 times?
5
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Is It Really About DJ Shadow? Fake page about the real DJ Shadow? Or real page about a fake DJ Shadow?
6
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Link-Based Metrics Content based metrics have an implicit assumption: everyone is telling the truth! –Lynch, “When Documents Deceive” http://scholar.google.com/scholar?cluster=4682764276311632091 –AIRWeb: Adversarial Information Retrieval on the Web http://airweb.cse.lehigh.edu/ We can mine the collective intelligence of the web community by seeing how they voted with their links –assumption: when choosing a target for their web page links, (honest) people do a good job of filtering out spam, poor quality, etc. –result: in search engine rankings, your document is influenced by the content of documents of others
7
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Want to link “to” a review of DJ Shadow’s “The Outsider”? http://www.google.com/search?q=dj+shadow+the+ou tsider+reviewhttp://www.google.com/search?q=dj+shadow+the+ou tsider+review –where’s the most knowledgeable review ever on http://f-measure.blogspot.com ??? http://f-measure.blogspot.com –class assignment: everyone go home and create 10 pages that link to: http://f- measure.blogspot.com/2009/01/dj-shadow- outsider-lp-review.htmlhttp://f- measure.blogspot.com/2009/01/dj-shadow- outsider-lp-review.html
8
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu But Not All Links Are Equal… You linking to my LP review is nice, but its not as nice as it would be if it were linked to by Spin Magazine, Rolling Stone, MTV, etc. –a page’s “importance” is defined by having other important pages link to it many links > few links "important" links > "unimportant" links
9
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Random "Surfer" Model The surfer starts at some random page on the Web Begins following links from page to page At each page, there is some probability 1-d the surfer becomes "bored" and randomly jumps to some other page in the Web –that is, they type a URL directly, follow on from email, etc. -- they just "teleport" to some other place in the Web
10
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Computing PageRank original paper version, sums to N (number of pages in graph) more common version, sums to 1 d = damping factor L() = out degree of a page PR() = PageRank of a page (all nodes start with PR() = 1 or 1/N).
11
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Calculating PageRank for a Page, One Iteration PR(A) = (1-0.85) + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) ) = 0.15 + 0.85 * ( 0.5/4 + 0.7/5 + 0.2/1 ) = 0.15 + 0.85 * ( 0.125 + 0.14 + 0.2) = 0.15 + 0.85 * 0.465 = 0.54525 fig 4-3 needs an extra link from C to match the text damping factor (d) = 0.85 (probability surfer landed on page by following a link) 1-d = 0.15 (probability surfer landed on page at “random”) since this is the original version where PR sums to N and we've only accounted for ~1.95 of total PR, pages not shown must be holding PR
12
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu PageRank Sinks "S" doesn't point to anybody else, so it will acquire PageRank, but not distribute it –.pdf,.jpeg,.html w/ no links, etc. Solution: pretend S has links to all other nodes (A,B,C,D,E) A B C D E S
13
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu When To Stop? Stop computing when the changes are small: |PR i+1 - PR i | < PageRank converges in O(log(N)) iterations –see: "The PageRank Citation Ranking: Bringing Order to the Web" for more information http://ilpubs.stanford.edu:8090/422/
14
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu PageRank is a Cool Name… …but recalling your linear algebra, you're really just computing the eigenvector of the adjacency matrix (each column sums to 0): the innovation was realizing the Web is a graph and applying eigenvector centrality for "quality" For more info, see PageRank paper, http://en.wikipedia.org/wiki/PageRank http://en.wikipedia.org/wiki/Modified_adjacency_matrix http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace
15
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu PageRank Visualizer http://www.mapequation.org/ http://arxiv.org/abs/0906.1405
16
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Check Your PageRank… http://www.prchecker.info/check_page_r ank.phphttp://www.prchecker.info/check_page_r ank.php 10/10: google.com, cnn.com 7/10: www.cs.odu.edu 6/10: www.cs.odu.edu/~mln/ 5/10: djshadow.com
17
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu PageRank Everywhere 100s (1000s?) of PR variations, optimizations, applications, etc. If expressible as a graph (or network), then someone is calculating PR.
18
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Authorship Networks Co-Authorship Networks in the Digital Library Research Community –http://dx.doi.org/10.1016/j.ipm.2005.03.012http://dx.doi.org/10.1016/j.ipm.2005.03.012 –http://arxiv.org/abs/cs.DL/0502056http://arxiv.org/abs/cs.DL/0502056 article1 hasAuthors v1, v2, v3 article2 hasAuthors v1, v3
19
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu NFL Rankings As part of Greg Szalkowski's research –http://ws- dl.blogspot.com/2009/1 2/nfl-playoff- outlook.htmlhttp://ws- dl.blogspot.com/2009/1 2/nfl-playoff- outlook.html –link from loser to winner, edge weight = margin of victory
20
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Citation Maps of Science http://www.eigenfactor.org/map/maps.htm
21
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu But PageRank & Friends Are Not the Only Method… Kleinberg introduced the "HITS" algorithm at roughly the same time as PageRank Constraint: –rather than build a full-up, web-scale search engine like Google, he built what could be described as real-time, post-query processor for content-based search engines of the day (e.g., AltaVista) that exploited the link structure in a manner similar to Google
22
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Motivation ford.com, toyota.com, etc. don't describe themselves as "automobile manufacturers", though a query for those terms arguably should return those companies"automobile manufacturers" harvard.edu is clearly canonical for a query of "Harvard", even though it uses the term less frequently than many other pages –Many search engines of the day could not "find themselves"
23
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Idea: Use Initial Search Results as "Root" include pages the root links to include pages that link to the root now you have a subgraph to work with
24
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Empirical Values Start with t=200 URIs in the root set Allow each page to bring d=50 "back link" (green nodes) pages into S –if adobe.com is in the root set, you don't want all of its back links to be in S S tended to be 1000-5000 possible optimization: exclude intra-domain links to separate "good" links from navigational links
25
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu In Degree is Insufficient Within S, the "good" pages receive more links, and so ordering by in degree would allow authorities to bubble up But "universally popular" links (e.g., yahoo.com, adobe.com, netscape.com) would still have too many in links, and they're (generally) not relevant to the query Example (for a "similar page" query for honda.com, but result is comparable): http://www.honda.com Honda http://www.ford.com/ Ford Motor Company http://www.e ff.org/blueribbon.html The Blue Ribbon Campaign for Online Free Speech http://www.mckinley.com/ Welcome to Magel lan! http://www.netscape.com Welcome to Netscape http://www.linkexchange.com/ LinkExchange — Welcome http://www.toyota.com/ Welcome to @Toyota http://www.pointcom.com/ PointCom http://home.netscape.com/ Welcome to Netscape http://www.yahoo.com Yahoo!
26
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Hubs… good hubs point to good authorities, and good authorities point to good hubs…hubsauthorities Insight: authoritative pages relevant to the query not only have high in degree, but also overlap in the pages that point to them:
27
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Computing Hubs & Authorities
28
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Passes the "Looks Right" Test (java) Authorities.328 http://www.gamelan.com/ Gamelan.251 http://java.sun.com/ JavaSoft Home Page.190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html The Java Developer: How Do I....190 http://lightyear.ncsa.uiuc.edu/ ∼ srp/java/javabooks.html The Java Book Pages.183 http://sunsite.unc.edu/javafaq/javafaq.html comp.lang.java FAQ (censorship) Authorities.378 http://www.e ff.org/ EFFweb - The Electronic Frontier Foundation.344 http://www.e ff.org/blueribbon.html The Blue Ribbon Campaign for Online Free Speech.238 http://www.cdt.org/ The Center for Democracy and Technology.235 http://www.vtw.org/ Voters Telecommunications Watch.218 http://www.aclu.org/ ACLU: American Civil Liberties Union (“search engines”) Authorities.346 http://www.yahoo.com/ Yahoo!.291 http://www.excite.com/ Excite.239 http://www.mckinley.com/ Welcome to Magellan!.231 http://www.lycos.com/ Lycos Home Page.231 http://www.altavista.digital.com/ AltaVista: Main Page (Gates) Authorities.643 http://www.roadahead.com/ Bill Gates: The Road Ahead.458 http://www.microsoft.com/ Welcome to Microsoft.440 http://www.microsoft.com/corpinfo/bill-g.htm
29
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Also Can Be Used to Cluster or Discover Communities (jaguar*) Authorities: principal eigenvector.370 http://www2.ecst.csuchico.edu/ ∼ jschlich/Jaguar/jaguar.html.347 http://www-und.ida.liu.se/ ∼ t94patsa/jserver.html.292 http://tangram.informatik.uni-kl.de:8001/ ∼ rgehm/jaguar.html.287 http://www.mcc.ac.uk/ dlms/Consoles/jaguar.html Jaguar Page (jaguar jaguars) Authorities: 2nd non-principal vector, positive end.255 http://www.jaguarsnfl.com/ O ffi cial Jacksonville Jaguars NFL Website.137 http://www.nando.net/SportServer/football/nfl/jax.html Jacksonville Jaguars Home Page.133 http://www.ao.net/ ∼ brett/jaguar/index.html Brett’s Jaguar Page.110 http://www.usatoday.com/sports/football/sfn/sfn30.htm Jacksonville Jaguars (jaguar jaguars) Authorities: 3rd non-principal vector, positive end.227 http://www.jaguarvehicles.com/ Jaguar Cars Global Home Page.227 http://www.collection.co.uk/ The Jaguar Collection - Official Web site.211 http://www.moran.com/sterling/sterling.html.211 http://www.coys.co.uk/ Atari video game NFL team Expensive car
30
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu "Does Authority Mean Quality?" So your page has a high {PageRank | Authority}, does it mean that "experts" find it to be a quality site? Experiment: choose 5 popular culture topics and compare authority (web) vs. quality (humans): –Babylon 5 –Buffy the Vampire Slayer –The Simpsons –Tori Amos –Smashing Pumpkins
31
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Data Acquisition 40 students picked 15 "best" sites corresponding to a topic from the Yahoo directory listing (not Yahoo search) –note: probably pre-filtered many bad/spam sites For each of the URLs, compute: –PageRank, Authority & Hub, In- & Out-Degree –# & size of pages, # of images, # of audio files 3 experts for each topic (4 for The Simpsons) ranked the URIs 1 (worst) to 7 (best)
32
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Inter-rater Agreement from table 2 onward, scores of 5-7 are "good" and 1-4 are "other"
33
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Link Based Methods
34
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Precision Baseline
35
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Precision at 5 and 10 rank by metric, then compute precision based on human assessment of "good" or "other" (simple majority) (relevance is binary)
36
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Upper Bound for relevance, use: # of good ratings majority = -------------------- # of raters because of inter-rater disagreements, we cannot achieve 100% on all topics rank by metric, then compute precision based on majority score (results are nearly the same b/c of the low # of reviewers)
37
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu t-test bold values indicate statistically significant differences (p < 0.05) diagonal values are avg precision @ 5; others are p-values from t-test no statistically significant difference between these methods
38
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu And Now the Opposite Question Does Quality Mean Authority? –"Correlation of Expert and Search Engine Rankings" http://arxiv.org/abs/0809.2851 –"Correlation of Music Charts and Search Engine Rankings" http://www.cs.odu.edu/~mln/pubs/jcdl09/jcdl09-se- billboard.pdfhttp://www.cs.odu.edu/~mln/pubs/jcdl09/jcdl09-se- billboard.pdf –"Comparing the Performance of US College Football Teams in the Web and on the Field" http://www.cs.odu.edu/~mln/pubs/ht09/ht09-se- football.pdfhttp://www.cs.odu.edu/~mln/pubs/ht09/ht09-se- football.pdf Quick answer: not really…
39
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Design Choose Top 10 or Top 25 lists of universities, business schools, companies, songs, movies, m & f tennis players, best places to live Map real world objects to URIs –ex.: Harvard Business School --> hbs.edu Find relative SE ranking of the corresponding URIs –not looking for absolute values, but rather: rank(URI1) >= rank(URI2) >= rank(URI3) …
40
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu First Study: Experts & SEs First study would only use 1 URI for real world object. Not a problem for companies & universities, but for a problem for other popular culture items.
41
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Second Study: SEs & Billboard Hot 100 Experiment now uses n URIs for a real world object Link structure reflect current popularity? Or SEs juicing their results based external sources? Or based on search traffic? (i.e., usage and not link or content)
42
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu College Football High correlation early, but decayed over the season -- link inertia? more info: http://ws-dl.blogspot.com/2009/07/hypertext-2009.htmlhttp://ws-dl.blogspot.com/2009/07/hypertext-2009.html
43
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Summary PageRank in its original form is long gone –and HITS was only used on Teoma SEO, Google dance, spamdexing, infinite ranking variations & optimizations are all beyond the scope of this class But there is utility in understanding the link- based ranking as originally proposed… Not discussed: usage-based ranking…
44
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Simple Counting… nice, but do you trust it enough to rank with it?
45
ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs.odu.edu Large-Scale, Usage-Based Networks http://www.mesur.org/services/maps.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.