Lecture 22 SVD, Eigenvector, and Web Search

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
1 ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine.
Link Structure and Web Mining Shuying Wang
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Quality of a search engine
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Thanks to Ray Mooney & Scott White
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Searching and Integrating Information on the Web
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Digital Libraries IS479 Ranking
Presentation transcript:

Lecture 22 SVD, Eigenvector, and Web Search Shang-Hua Teng

Earlier Search Engines Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … Main technique: “inverted index” Conceptually: use a matrix to represent how many times a term appears in one page # of columns = # of pages (huge!) # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1  page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 …

Search by Keywords If the query has one keyword, just return all the pages that have the word E.g., “toyota”  all pages containing “toyota”: page2, page4,… There could be many many pages! Solution: return those pages with most frequencies of the word first

Multi-keyword Search For each keyword W, find all the set of pages mentioning W Intersect all the sets of pages Assuming an “AND” operation of those keywords Example: A search “toyota honda” will return all the pages that mention both “toyota” and “honda”

Observations The “matrix” can be huge: Now the Web has more than 10 billion pages! There are many “terms” on the Web. Many of them are typos. It’s not easy to do the computation efficiently: Given a word, find all the pages… Intersect many sets of pages… For these reasons, search engines never store this “matrix” so naively.

Problems Spamming: Search engines can be easily “fooled” People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). Search engines can be easily “fooled”

Closer look at the problems Lacking the concept of “importance” of each page on each topic E.g.: a random page may not be as “important” as Yahoo’s main page. A link from Yahoo is hence most likely more important than a link from that random page But, how to capture the importance of a page? A guess: # of hits?  where to get that info? # of inlinks to a page  Google’s main idea.

PageRank Intuition: Problem: The importance of each page should be decided by what other pages “say” about this page One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) Problem: We can easily fool this technique by generating many dummy pages that point to a page

Link Analysis The goal is to rank pages We want to take advantage of the link structure to do this Two main approaches Static: we will use the links to calculate a ranking of the pages offline (Google) Dynamic: we will use the links in the results of a search to dynamically determine a ranking (IBM Clever – Huts and Authorities)

The Link Graph View documents as graph nodes and the hyperlinks between documents as directed edges Can give weights on edges (links) based on Position in the document Weight of anchor term Number of occurrences of link Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. Ne Am MS

Hyperlink analysis Idea: Mine structure of the web graph Related work: Classic IR work (citations = links) a.k.a. “Bibliometrics” Socio-metrics Many Web related papers use this approach 11/23/2018

Google’s approach Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank [Brin and Page]

Intuition of PageRank Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen web page with probability a to a randomly chosen successor of the current page with probability 1-a The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.

PageRank: Formulation PageRank = stationary probability for this random process (Markov chain), i.e. where n is the total number of nodes in the graph

PageRank: Matrix Formulation Transition Matrix Eigenvector of the Transition matrix

Example: MiniWeb a=0 Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. Their PageRank are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.

Iterative computation Ne Final result: Netscape and Amazon have the same importance, and twice the importance of Microsoft. Does it capture the intuition? MS Am

Observations The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:

Problem 1 of algorithm: dead ends Ne MS Am MS does not point to anybody Result: weights of the Web “leak out”

Problem 2 of algorithm: spider traps Ne MS Am MS only points to itself Result: all weights go to MS!

Google’s Hack: setting a > 0 “tax each page” Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. Example: assume 20% tax rate in the “spider trap” example.

Dynamic Ranking, Hubs and Authorities, IBM Clever Goal: to get a ranking for a particular query (instead of the whole web). Assume: We have a (set of) search engine(s) that can give a set of pages P that match a query.

Hubs and Authorities Motivation: find web pages to a topic E.g.: “find all web sites about automobiles” “Authority”: a page that offers info about a topic E.g.: BMW, Toyota, Ford, … “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic Auto sale, ebay, www.ConsumerReports.org

Kleinberg Goal: Given a query find: Good sources of content (authorities) Good sources of links (hubs)

Two values of a page Each page has a hub value and an authority value. In PageRank, each page has one value: “weight” Two vectors: h: hub values a: authority values

HITS algorithm: find hubs and authorities First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” Find pages S containing the keyword (“automobile”) Find all pages these S pages point to, i.e., their forward neighbors. Find all pages that point to S pages, i.e., their backward neighbors Compute the subgraph of these pages root Focused subgraph

An edge for each hyperlink, but no edges within the same host Neighborhood graph Subgraph associated to each query Back Set Query Results = Start Set Forward Set b1 Result1 f1 f2 b2 Result2 ... … ... bm fs Resultn An edge for each hyperlink, but no edges within the same host

Step 2: computing h and a Initially: set hub and authority to 1 In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) The authority value of each page is the total hub value of its backward neighbors (after normalization) Iterate until converge hubs authorities

Computing Hubs and Authorities(1) For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,h n). (3) (4)

Computing Hubs and Authorities(2) (5) (6) (7) Let In other words, a is an eigenvector of B: B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. B is symmetric and has n orthogonal unit eigenvectors.

Hubs and Authorities Hubs and authorities scores are the first singular vector of the matrix A

Example: MiniWeb Normalization! Ne Therefore: MS Am

Example: MiniWeb Ne MS Am