Download presentation
Presentation is loading. Please wait.
Published byKaia Seeds Modified over 10 years ago
1
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State University and American Institute of Mathematics Bay Area Mathematical Adventures February 27, 2008
2
With material from Becky Atherton Matrices Markov Chains Digraphs Google’s PageRank Matrices Markov Chains Digraphs Google’s PageRank Outline
3
Introduction to Matrices A matrix is a rectangular array of numbers Matrices are used to solve systems of equations Matrices are easy for computers to work with A matrix is a rectangular array of numbers Matrices are used to solve systems of equations Matrices are easy for computers to work with
4
Matrix arithmetic Matrix Addition Matrix Multiplication
5
At each time period, every object in the system is in exactly one state, one of 1, …,n. Objects move according to the transition probabilities: the probability of going from state j to state i is t ij Transition probabilities do not change over time. At each time period, every object in the system is in exactly one state, one of 1, …,n. Objects move according to the transition probabilities: the probability of going from state j to state i is t ij Transition probabilities do not change over time. Introduction to Markov Chains
6
The transition matrix of a Markov chain T = [t ij ] is an n n matrix. Each entry t ij is the probability of moving from state j to state i. 0 t ij 1 Sum of entries in a column must be equal to 1 (stochastic). T = [t ij ] is an n n matrix. Each entry t ij is the probability of moving from state j to state i. 0 t ij 1 Sum of entries in a column must be equal to 1 (stochastic).
7
Example: Customers can choose from three major grocery stores: H-Mart, Freddy’s and Shopper’s Market. Each year H-Mart retains 80% of its customers, while losing 15% to Freddy’s and 5% to Shopper’s Market. Freddy’s retains 65% of its customers, loses 20% to H-Mart and 15% to Shopper’s Market. Shopper’s Market keeps 70% of its customers, loses 20% to H-Mart and 10% to Freddy’s. Each year H-Mart retains 80% of its customers, while losing 15% to Freddy’s and 5% to Shopper’s Market. Freddy’s retains 65% of its customers, loses 20% to H-Mart and 15% to Shopper’s Market. Shopper’s Market keeps 70% of its customers, loses 20% to H-Mart and 10% to Freddy’s.
8
Example: The transition matrix.
9
Look at the calculation used to determine the probability of starting at H-Mart and shopping there two year later: We can obtain the same result by multiplying row one by column one in the transition matrix:
10
This matrix tells us the probabilities of going from one store to another after 2 years: Compute the probability of shopping at each store 2 years after shopping at Shopper’s Market:
11
If the initial distribution was evenly distributed between H-Mart, Freddy’s, and Shpper’s market, compute the distribution after two years:
12
To utilize a Markov chain to compute probabilities, we need to know the initial probability vector q (0) If there are n states, let the initial probability vector be where To utilize a Markov chain to compute probabilities, we need to know the initial probability vector q (0) If there are n states, let the initial probability vector be where –q i is the probability of being in state i initially –All entries 0 q i 1 –Column sum = 1 –q i is the probability of being in state i initially –All entries 0 q i 1 –Column sum = 1
13
What happens after 10 years? Example:
14
Let q (k) be the probability distribution after k steps. We are iterating q (k+1) = T q (k) Eventually, for a large enough k, q (k+1) = q (k) = s Resulting in s = T s s is called a steady state vector s =q (k) is an eigenvector for eigenvalue 1 Let q (k) be the probability distribution after k steps. We are iterating q (k+1) = T q (k) Eventually, for a large enough k, q (k+1) = q (k) = s Resulting in s = T s s is called a steady state vector s =q (k) is an eigenvector for eigenvalue 1
15
In the grocery example, there was a unique steady state vector s, and T q (k) s. This does not need to be the case:
16
How can we guarantee convergence to an unique steady state vector regardless of initial conditions? One way is by having a regular transition matrix A nonnegative matrix is regular if some power of the matrix has only nonzero entries. One way is by having a regular transition matrix A nonnegative matrix is regular if some power of the matrix has only nonzero entries.
17
Digraphs A directed graph (digraph) is a set of vertices (nodes) and a set of directed edges (arcs) between vertices The arcs indicate relationships between nodes Digraphs can be used as models, e.g. cities and airline routes between them web pages and links A directed graph (digraph) is a set of vertices (nodes) and a set of directed edges (arcs) between vertices The arcs indicate relationships between nodes Digraphs can be used as models, e.g. cities and airline routes between them web pages and links
18
How Matrices, Markov Chains and Digraphs are used by Google
19
How does Google work? Robot web crawlers find web pages Pages are indexed & cataloged Pages are assigned PageRank values PageRank is a program that prioritizes pages Developed by Larry Page & Sergey Brin in 1998 When pages are identified in response to a query, they are ranked by PageRank value Robot web crawlers find web pages Pages are indexed & cataloged Pages are assigned PageRank values PageRank is a program that prioritizes pages Developed by Larry Page & Sergey Brin in 1998 When pages are identified in response to a query, they are ranked by PageRank value
20
Why is PageRank important? Only a few years ago users waited much longer for search engines to return results to their queries. When a search engine finally responded, the returned list had many links to information that was irrelevant, and useless links invariably appeared at or near the top of the list, while useful links were deeply buried. The Web's information is not structured like information in the organized databases and document collections - it is self organized. The enormous size of the Web, currently containing ~10^9 pages, completely overwhelmed traditional information retrieval (IR) techniques. Only a few years ago users waited much longer for search engines to return results to their queries. When a search engine finally responded, the returned list had many links to information that was irrelevant, and useless links invariably appeared at or near the top of the list, while useful links were deeply buried. The Web's information is not structured like information in the organized databases and document collections - it is self organized. The enormous size of the Web, currently containing ~10^9 pages, completely overwhelmed traditional information retrieval (IR) techniques.
21
By 1997 it was clear that IR technology of the past wasn't well suited for Web search Researchers set out to devise new approaches. Two big ideas emerged, each capitalizing on the link structure of the Web to differentiate between relevant information and fluff. One approach, HITS (Hypertext Induced Topic Search), was introduced by Jon Kleinberg The other, which changed everything, is Google's PageRank that was developed by Sergey Brin and Larry Page By 1997 it was clear that IR technology of the past wasn't well suited for Web search Researchers set out to devise new approaches. Two big ideas emerged, each capitalizing on the link structure of the Web to differentiate between relevant information and fluff. One approach, HITS (Hypertext Induced Topic Search), was introduced by Jon Kleinberg The other, which changed everything, is Google's PageRank that was developed by Sergey Brin and Larry Page
22
How are PageRank values assigned? Number of links to and from a page give information about the importance of a page. More inlinks the more important the page Inlinks from “good” pages carry more weight than inlinks from “weaker” pages. If a page points to several pages, its weight is distributed proportionally. Number of links to and from a page give information about the importance of a page. More inlinks the more important the page Inlinks from “good” pages carry more weight than inlinks from “weaker” pages. If a page points to several pages, its weight is distributed proportionally.
23
Imagine the World Wide Web as a directed graph (digraph) Each page is a vertex Each link is an arc Imagine the World Wide Web as a directed graph (digraph) Each page is a vertex Each link is an arc 1 2 3 4 5 6 A sample 6 page web (6 vertex digraph)
24
PageRank defines the rank of page i recursively by r j is the rank of page j I i is the set of pages that point into page i O j is the set of pages that have outlinks from page j r j is the rank of page j I i is the set of pages that point into page i O j is the set of pages that have outlinks from page j
25
For example, the rank of page 2 in our sample web: 1 2 3 4 5 6
26
Since this is a recursive definition, PageRank assigns an initial ranking equally to all pages: then iterates
27
Process can be written using matrix notation. Let q (k) be the PageRank vector at the k th iteration Let T be the transition matrix for the web Then q (k+1) = T q (k) T is the matrix such that t ij is the probability of moving from page j to page i in one time step Based on the assumption that all outlinks are equally likely to be selected. Let q (k) be the PageRank vector at the k th iteration Let T be the transition matrix for the web Then q (k+1) = T q (k) T is the matrix such that t ij is the probability of moving from page j to page i in one time step Based on the assumption that all outlinks are equally likely to be selected.
28
Using our 6-node sample web: Transition matrix: 1 2 3 4 5 6
29
To eliminate dangling nodes and obtain a stochastic matrix, replace a column of zeros with a column of 1/n’s, where n is the number of web pages.
30
Web’s nature is such that T would not be regular Brin & Page force the transition matrix to be regular by making sure every entry satisfies 0 < t ij < 1 Create perturbation matrix E having all entries equal to 1/n Web’s nature is such that T would not be regular Brin & Page force the transition matrix to be regular by making sure every entry satisfies 0 < t ij < 1 Create perturbation matrix E having all entries equal to 1/n Form “Google Matrix”:
31
Using = 0.85 for our 6-node sample web:
32
By calculating powers of the transition matrix, we can determine the stationary vector:
33
Stationary vector for our 6-node sample web:
34
How does Google use this stationary vector? Query requests term 1 or term 2 Inverted file storage is accessed Term 1 doc 3, doc 2, doc 6 Term 2 doc 1, doc 3 Relevancy set is {1, 2, 3, 6} Query requests term 1 or term 2 Inverted file storage is accessed Term 1 doc 3, doc 2, doc 6 Term 2 doc 1, doc 3 Relevancy set is {1, 2, 3, 6} s 1 =.2066, s 2 =.1770, s 3 =.1773, s 6 =.1309 Doc 1 deemed most important s 1 =.2066, s 2 =.1770, s 3 =.1773, s 6 =.1309 Doc 1 deemed most important
35
Adding a perturbation matrix seems reasonable, based on the “random jump” idea- user types in a URL This is only the basic idea behind Google, which has many refinements we have ignored PageRank as originally conceived and described here ignores the “Back” button PageRank currently undergoing development Details of PageRank’s operations and value of are a trade secret. Adding a perturbation matrix seems reasonable, based on the “random jump” idea- user types in a URL This is only the basic idea behind Google, which has many refinements we have ignored PageRank as originally conceived and described here ignores the “Back” button PageRank currently undergoing development Details of PageRank’s operations and value of are a trade secret.
36
Updates to Google matrix done periodically Google matrix is HUGE Sophisticated numerical methods are be used Updates to Google matrix done periodically Google matrix is HUGE Sophisticated numerical methods are be used
37
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.