Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search
Shang-Hua Teng

Earlier Search Engines
Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … Main technique: “inverted index” Conceptually: use a matrix to represent how many times a term appears in one page # of columns = # of pages (huge!) # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ ‘toyota’  page 2 mentions ‘toyota’ twice ‘honda’ …

Search by Keywords If the query has one keyword, just return all the pages that have the word E.g., “toyota”  all pages containing “toyota”: page2, page4,… There could be many many pages! Solution: return those pages with most frequencies of the word first

Multi-keyword Search For each keyword W, find all the set of pages mentioning W Intersect all the sets of pages Assuming an “AND” operation of those keywords Example: A search “toyota honda” will return all the pages that mention both “toyota” and “honda”

Observations The “matrix” can be huge:
Now the Web has more than 10 billion pages! There are many “terms” on the Web. Many of them are typos. It’s not easy to do the computation efficiently: Given a word, find all the pages… Intersect many sets of pages… For these reasons, search engines never store this “matrix” so naively.

Problems Spamming: Search engines can be easily “fooled”
People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times Though these pages may be unimportant compared to even if the latter only mentions “toyota” only once (or 0 time). Search engines can be easily “fooled”

Closer look at the problems
Lacking the concept of “importance” of each page on each topic E.g.: a random page may not be as “important” as Yahoo’s main page. A link from Yahoo is hence most likely more important than a link from that random page But, how to capture the importance of a page? A guess: # of hits?  where to get that info? # of inlinks to a page  Google’s main idea.

PageRank Intuition: Problem:
The importance of each page should be decided by what other pages “say” about this page One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) Problem: We can easily fool this technique by generating many dummy pages that point to a page

Link Analysis The goal is to rank pages
We want to take advantage of the link structure to do this Two main approaches Static: we will use the links to calculate a ranking of the pages offline (Google) Dynamic: we will use the links in the results of a search to dynamically determine a ranking (IBM Clever – Huts and Authorities)

The Link Graph View documents as graph nodes and the hyperlinks between documents as directed edges Can give weights on edges (links) based on Position in the document Weight of anchor term Number of occurrences of link Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. Ne Am MS

Hyperlink analysis Idea: Mine structure of the web graph Related work:
Classic IR work (citations = links) a.k.a. “Bibliometrics” Socio-metrics Many Web related papers use this approach 11/23/2018

Google’s approach Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank [Brin and Page]

Intuition of PageRank Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen web page with probability a to a randomly chosen successor of the current page with probability 1-a The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.

PageRank: Formulation
PageRank = stationary probability for this random process (Markov chain), i.e. where n is the total number of nodes in the graph

PageRank: Matrix Formulation
Transition Matrix Eigenvector of the Transition matrix

Example: MiniWeb a=0 Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. Their PageRank are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.

Iterative computation
Ne Final result: Netscape and Amazon have the same importance, and twice the importance of Microsoft. Does it capture the intuition? MS Am

Observations The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:

Problem 1 of algorithm: dead ends
Ne MS Am MS does not point to anybody Result: weights of the Web “leak out”

Problem 2 of algorithm: spider traps
Ne MS Am MS only points to itself Result: all weights go to MS!

Google’s Hack: setting a > 0 “tax each page”
Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. Example: assume 20% tax rate in the “spider trap” example.

Dynamic Ranking, Hubs and Authorities, IBM Clever
Goal: to get a ranking for a particular query (instead of the whole web). Assume: We have a (set of) search engine(s) that can give a set of pages P that match a query.

Hubs and Authorities Motivation: find web pages to a topic
E.g.: “find all web sites about automobiles” “Authority”: a page that offers info about a topic E.g.: BMW, Toyota, Ford, … “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic Auto sale, ebay,

Kleinberg Goal: Given a query find:
Good sources of content (authorities) Good sources of links (hubs)

Two values of a page Each page has a hub value and an authority value.
In PageRank, each page has one value: “weight” Two vectors: h: hub values a: authority values

HITS algorithm: find hubs and authorities
First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” Find pages S containing the keyword (“automobile”) Find all pages these S pages point to, i.e., their forward neighbors. Find all pages that point to S pages, i.e., their backward neighbors Compute the subgraph of these pages root Focused subgraph

An edge for each hyperlink, but no edges within the same host
Neighborhood graph Subgraph associated to each query Back Set Query Results = Start Set Forward Set b1 Result1 f1 f2 b2 Result2 ... … ... bm fs Resultn An edge for each hyperlink, but no edges within the same host

Step 2: computing h and a Initially: set hub and authority to 1
In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) The authority value of each page is the total hub value of its backward neighbors (after normalization) Iterate until converge hubs authorities

Computing Hubs and Authorities(1)
For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,h n). (3) (4)

Computing Hubs and Authorities(2)
(5) (6) (7) Let In other words, a is an eigenvector of B: B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. B is symmetric and has n orthogonal unit eigenvectors.

Hubs and Authorities Hubs and authorities scores are the first singular vector of the matrix A

Example: MiniWeb Normalization! Ne Therefore: MS Am

Example: MiniWeb Ne MS Am

Lecture 22 SVD, Eigenvector, and Web Search

Similar presentations

Presentation on theme: "Lecture 22 SVD, Eigenvector, and Web Search"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 22 SVD, Eigenvector, and Web Search

Similar presentations

Presentation on theme: "Lecture 22 SVD, Eigenvector, and Web Search"— Presentation transcript:

Similar presentations

About project

Feedback