Download presentation
Presentation is loading. Please wait.
Published byLeopold Janssen Modified over 5 years ago
1
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine
Rogier Brussee ICI
2
Context Written 1997, Brin and Page were PhD students
Indexes 24*10^6 pages (<< 10^100) Describes their efforts to create a web search engine open for academia Altavista, Lycos and Yahoo ruled, Internet bubble was still growing.
3
Disclaimer Google does = Google does in Can only guess what still applies Principles sound right, probably survived Lots of room for tweaking: Dark Art Datastructures described up to bit level should have changed. Scaled up tremendously Index > 10^10 pages ?????? (<<10^100) So did hardware and OS. Business model changed Ads should not drive search result is still stated policy.
4
What does Google do Preprocess: Crawl
Index: words, anchors and links in docs Invert Index (i.e. sort) Value content (PageRank + “looks” weights) At Query time look up query Rank results (PageRank + IR measure)
5
Google Architecture (in 1997)
URL server : hands out URL’s to crawl Crawler: gets and parses content and caches DNS. Store Server compresses and formats pages Repository: Big database with compressed content Indexer: decompresses pages in repository and creates an hitlist index : docID wordID + metadata dressing(Capitals, typeface) anchors (URL’s + texts) metadata info about doc’s (like title, headers, size, contenttype) Barrel: ROW 1 : storage systems that store the docID wordID index divided up in wordID’s ROW2 : storage systems that store inverted index: wordID docID Lexicon : has WordId Word (+ metadata) and vice versa : relatively small (>~ 200 MB). Sorter: Inverts the Barrels Anchors: Anchor text from Links found by indexer Db. Links : DB of what links to what. URL-resolver: find and parse URL’s create docID. Doc-index: DB of URL docID and vice versa. PageRank : computes the Page Rank Searcher: Search indices and rank results.
6
Ranking I Google ranks words differently depending on Capitalisation
Typeface (with respect to average) In title or anchor or .. For phrases also proximity of words is important Gives IR score (precise formula is not mentioned) And then there is PageRank !
7
Ranking II Together determines rank you see when googling
No single factor is dominant
8
PageRank Called after Lawrence Page.
Measure of collectively defined importance of web page Probabilistic model of user doing random surfing before Google gives recommandation PageRank is a probability to find user at page in model after infinite number of clicks Quantitative version of effect of information scent Really pioneered by ants ! Go to the ant, thou sluggard; Consider her ways, and be wise (Proverbs 6:6)
9
Ant model for PageRank (1-d)/n (1-d) d 1 k Chance d to follow a link
Put 1000 ants on every page. Let ants follow links according to rules above. Wait long enough we get a stationary distribution. The number of ants on a node / total number of ants is PageRank 1 k Chance d to follow a link Chance (1-d) to jump to random page out of n pages 2
10
Mathematical Explanation
We have initial ant distribution p = (p_1, ….p_n) on n pages Normalise sum_i p_i = 1, we have p_i >= 0. We have a Markov chain with transition probability: t_ij = d/k_j + (1-d)/n if there is one of k_j links on page j to page i t_ij = (1-d)/ n otherwise Gives transition matrix T = (t_ji) , i,j = 1,…,n Note: t_ij > 0 and sum_i t_ij = 1. After one “round” ant distribution is Tp = ( sum_j t_ij p_j)_{i = 1,..,n} Note (Tp)_i > 0 and sum (Tp)_i = 1. After n rounds distribution is T^n p. Define lim_{n infty} T^n p = p^(0) (exists) Tp^(0) = p ^(0) : stationary distribution of Markov chain Pagerank is stationary distribution of the Markov chain Existence of the fixpoint Peron-Frobenius theorem a direct consequence of Brouwers fixpoint theorem: The simplex Delta = { x, \sum x_i =1 , x_i >= 0} is mapped to itself by T but Delta is topologically a closed n-1 dimensional disk. Connectedness of Markov Graph implies uniqueness. It suffices to see that the fixpoint is isolated because by linearity there would be whole eigenspace otherwise. However on \sum x_i =0, which is an invariant complement to the fixpoint p^(o) T is contracting in L_1.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.