Digital Libraries IS479 Ranking

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Advances & Link Analysis
Link Structure and Web Mining Shuying Wang
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
ODU CS 751/851 Spring 2011 Michael L. Nelson Introduction to Digital Libraries Week 7: Ranking Old Dominion University Department of Computer.
Collective Intelligence Week 3: Crawling, Searching, Ranking Old Dominion University Department of Computer Science CS 795/895 Spring 2009 Michael L. Nelson.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
Chapter 7 Web Structure Mining
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
אתגרים אלגוריתמיים למנועי חיפוש
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
PageRank algorithm based on Eigenvectors
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Searching for Truth: Locating Information on the WWW
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Searching for Truth: Locating Information on the WWW
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Digital Libraries IS479 Ranking

Content Based Ranking http://en.wikipedia.org/wiki/Trip_hop mentions the artist "DJ Shadow" once http://en.wikipedia.org/wiki/DJ_Shadow mentions "DJ Shadow" 30+ times Intuitively, the latter is more "about" DJ Shadow than the former and the frequency of the term(s) reflect this "aboutness"

Content Based Ranking How desirable is recall at web scale? 6 30+ 16

Content Based Ranking Why is djshadow.rpod.ru on page 15 when it has the phrase "dj shadow" 20+ times, and Rhapsody.com appears on page 1 (pos #6) when it has the same phrase only 6 times?

Is It Really About DJ Shadow? Fake page about the real DJ Shadow? Or real page about a fake

Link-Based Metrics Content based metrics have an implicit assumption: everyone is telling the truth! We can mine the collective intelligence of the web community by seeing how they voted with their links assumption: when choosing a target for their web page links, (honest) people do a good job of filtering out spam, poor quality, etc. result: in search engine rankings, your document is influenced by the content of documents of others

But Not All Links Are Equal… You linking to my LP review is nice, but its not as nice as it would be if it were linked to by Spin Magazine, Rolling Stone, MTV, etc. a page’s “importance” is defined by having other important pages link to it many links > few links "important" links > "unimportant" links

Random "Surfer" Model The surfer starts at some random page on the Web Begins following links from page to page At each page, there is some probability 1-d the surfer becomes "bored" and randomly jumps to some other page in the Web that is, they type a URL directly, follow on from email, etc. -- they just "teleport" to some other place in the Web

Computing PageRank original paper version, sums to N (number of pages in graph) more common version, sums to 1 d = damping factor L() = out degree of a page PR() = PageRank of a page (all nodes start with PR() = 1 or 1/N).

Calculating PageRank for a Page, One Iteration fig 4-3 needs an extra link from C to match the text PR(A) = (1-0.85) + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) ) = 0.15 + 0.85 * ( 0.5/4 + 0.7/5 + 0.2/1 ) = 0.15 + 0.85 * ( 0.125 + 0.14 + 0.2) = 0.15 + 0.85 * 0.465 = 0.54525 damping factor (d) = 0.85 (probability surfer landed on page by following a link) 1-d = 0.15 (probability surfer landed on page at “random”) since this is the original version where PR sums to N and we've only accounted for ~1.95 of total PR, pages not shown must be holding PR

PageRank Sinks C A D S B E "S" doesn't point to anybody else, so it will acquire PageRank, but not distribute it .pdf, .jpeg, .html w/ no links, etc. Solution: pretend S has links to all other nodes (A,B,C,D,E)

When To Stop? Stop computing when the changes are small: |PRi+1 - PRi| <  PageRank converges in O(log(N)) iterations see: "The PageRank Citation Ranking: Bringing Order to the Web" for more information http://ilpubs.stanford.edu:8090/422/

PageRank is a Cool Name… …but recalling your linear algebra, you're really just computing the eigenvector of the adjacency matrix (each column sums to 0): the innovation was realizing the Web is a graph and applying eigenvector centrality for "quality" For more info, see PageRank paper, http://en.wikipedia.org/wiki/PageRank http://en.wikipedia.org/wiki/Modified_adjacency_matrix http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace

PageRank Visualizer http://www.mapequation.org/ http://arxiv.org/abs/0906.1405

Check Your PageRank… http://www.prchecker.info/check_page_rank.php 10/10: google.com, cnn.com 7/10: www.cs.odu.edu 6/10: www.cs.odu.edu/~mln/ 5/10: djshadow.com 4/10: f-measure.blogspot.com

But PageRank & Friends Are Not the Only Method… Kleinberg introduced the "HITS" algorithm at roughly the same time as PageRank Constraint: rather than build a full-up, web-scale search engine like Google, he built what could be described as real-time, post-query processor for content-based search engines of the day (e.g., AltaVista) that exploited the link structure in a manner similar to Google

Motivation ford.com, toyota.com, etc. don't describe themselves as "automobile manufacturers", though a query for those terms arguably should return those companies harvard.edu is clearly canonical for a query of "Harvard", even though it uses the term less frequently than many other pages Many search engines of the day could not "find themselves"

Idea: Use Initial Search Results as "Root" include pages that link to the root now you have a subgraph to work with include pages the root links to

Empirical Values Start with t=200 URIs in the root set Allow each page to bring d=50 "back link" (green nodes) pages into S if adobe.com is in the root set, you don't want all of its back links to be in S S tended to be 1000-5000 possible optimization: exclude intra-domain links to separate "good" links from navigational links

In Degree is Insufficient Within S, the "good" pages receive more links, and so ordering by in degree would allow authorities to bubble up But "universally popular" links (e.g., yahoo.com, adobe.com, netscape.com) would still have too many in links, and they're (generally) not relevant to the query Example (for a "similar page" query for honda.com, but result is comparable): http://www.honda.com Honda http://www.ford.com/ Ford Motor Company http://www.eff.org/blueribbon.html The Blue Ribbon Campaign for Online Free Speech http://www.mckinley.com/ Welcome to Magellan! http://www.netscape.com Welcome to Netscape http://www.linkexchange.com/ LinkExchange — Welcome http://www.toyota.com/ Welcome to @Toyota http://www.pointcom.com/ PointCom http://home.netscape.com/ Welcome to Netscape http://www.yahoo.com Yahoo!

Hubs… Insight: authoritative pages relevant to the query not only have high in degree, but also overlap in the pages that point to them: good hubs point to good authorities, and good authorities point to good hubs…

Computing Hubs & Authorities

HITS Example 1, 1 2, 0 0, 3 0, 1 3, 0 0, 0 .67, 0 .50, .33 0, .50 .83, 0 0,0 .57, 0 .43, .33 0,1 0, .42 .86, 0 .33, 0 .17, .17 0, .17 .50, 0 .25, .14 0, .43 0, .21 .42, 0 .31, 0 .23, .16 0, .46 0, .19 .46, 0 Iteration 1: Input Iteration 1: Update Scores Iteration 1: Normalize Scores Iteration 2: Input Iteration 2: Update Scores Iteration 2: Normalize Scores Iteration 3: Input Iteration 3: Update Scores Iteration 3: Normalize Scores Slide 20 from Chapter 10 of “Search Engines: Information Retrieval in Practice” http://www.search-engines-book.com/slides/

Passes the "Looks Right" Test (java) Authorities .328 http://www.gamelan.com/ Gamelan .251 http://java.sun.com/ JavaSoft Home Page .190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html The Java Developer: How Do I... .190 http://lightyear.ncsa.uiuc.edu/∼srp/java/javabooks.html The Java Book Pages .183 http://sunsite.unc.edu/javafaq/javafaq.html comp.lang.java FAQ (censorship) Authorities .378 http://www.eff.org/ EFFweb - The Electronic Frontier Foundation .344 http://www.eff.org/blueribbon.html The Blue Ribbon Campaign for Online Free Speech .238 http://www.cdt.org/ The Center for Democracy and Technology .235 http://www.vtw.org/ Voters Telecommunications Watch .218 http://www.aclu.org/ ACLU: American Civil Liberties Union (“search engines”) Authorities .346 http://www.yahoo.com/ Yahoo! .291 http://www.excite.com/ Excite .239 http://www.mckinley.com/ Welcome to Magellan! .231 http://www.lycos.com/ Lycos Home Page .231 http://www.altavista.digital.com/ AltaVista: Main Page (Gates) Authorities .643 http://www.roadahead.com/ Bill Gates: The Road Ahead .458 http://www.microsoft.com/ Welcome to Microsoft .440 http://www.microsoft.com/corpinfo/bill-g.htm

Also Can Be Used to Cluster or Discover Communities (jaguar*) Authorities: principal eigenvector .370 http://www2.ecst.csuchico.edu/∼jschlich/Jaguar/jaguar.html .347 http://www-und.ida.liu.se/∼t94patsa/jserver.html .292 http://tangram.informatik.uni-kl.de:8001/∼rgehm/jaguar.html .287 http://www.mcc.ac.uk/ dlms/Consoles/jaguar.html Jaguar Page (jaguar jaguars) Authorities: 2nd non-principal vector, positive end .255 http://www.jaguarsnfl.com/ Official Jacksonville Jaguars NFL Website .137 http://www.nando.net/SportServer/football/nfl/jax.html Jacksonville Jaguars Home Page .133 http://www.ao.net/∼brett/jaguar/index.html Brett’s Jaguar Page .110 http://www.usatoday.com/sports/football/sfn/sfn30.htm Jacksonville Jaguars (jaguar jaguars) Authorities: 3rd non-principal vector, positive end .227 http://www.jaguarvehicles.com/ Jaguar Cars Global Home Page .227 http://www.collection.co.uk/ The Jaguar Collection - Official Web site .211 http://www.moran.com/sterling/sterling.html .211 http://www.coys.co.uk/ Atari video game NFL team Expensive car