Search Engines Indexing Page Ranking. The W W W Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 WebSite4 WebSite5.

Slides:

Advertisements

Similar presentations

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

Link Analysis, PageRank and Search Engines on the Web

Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.

Link Structure and Web Mining Shuying Wang

1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Presented By: - Chandrika B N

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.

Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Overview of Web Ranking Algorithms: HITS and PageRank

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.

Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Google PageRank Algorithm

CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

PageRank and Markov Chains

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Laboratory of Intelligent Networks (LINK) Youn-Hee Han

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Anatomy of a search engine

CS 440 Database Management Systems

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI

Presentation transcript:

Search Engines Indexing Page Ranking

The W W W Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 WebSite4 WebSite5 WebSite3 WebSite1 WebSite2

The Web Search Problem Search Engine Query: set of key words or phrase Response: list of documents (pages) containing the key words or phrase Important requirements: Response must be quick Documents must be relevant

Tasks of a Search Engine Discover documents around the WWW Search keywords in documents Filter/rank documents according to their relevance WebCrawlers (spiders, bots, wanderers, etc) Based on graph searching algorithms (BFS or DFS ?) For obvious performance reasons, this cannot be done by string searching after every query ! Solution: Indexing; Web Search Engine Architectures

Web Search Engine Architecture Page Repository Text IndexPageRank Text AnalysisLink Analysis Ranker WebCrawler Query

Outline Data structures and algorithms for indexing the web The PageRank algorithm

Outline Data structures and algorithms for indexing the web The PageRank algorithm

Indexing the web Once a crawl has collected pages, their text is compressed and stored in a repository Each URL mapped to a unique ID A lexicon (sorted list of all words) is created A hit list (“Inverted index”) is created for every word in the lexicon Terminology: –Forward index : Document -> list of contained words –Inverted index : Word -> list of containing documents

Simple Inverted Indexing Words -> PageIDs

Using Simple Inverted Indexes for Queries Simple indexes help searching for keywords or sets of keywords –Example: Search “cat” => found in pages 1 and 3 Search “cat” AND “dog” => found in page 3 Simple indexes cannot help performing phrase queries: –Example: Search “cat sat” => found in pages 1 and 3, but actually only page 1 contains the phrase “cat sat” –Solution: indexing contains also the in-page location

Fully Inverted Indexing Words -> PageID’s + in-page locations

Using Fully Inverted Indexing for Queries Performing queries for phrases: –Search “cat sat” “cat” found at 1-2, 3-2 “sat” found at 1-3, 3-7 “cat” AND “sat”: –in page 1, at 1-2 AND 1-3 => distance 1 between words –in page 3, at 3-2 AND 3-7 => distance 5 between words –Using the distance between words, only page 1 matches the search phrase

Using Metainformation If the searched word is part of a title, the document is probably more relevant for the query

Indexing the web Once a crawl has collected pages, their text is compressed and stored in a repository Each URL (document) mapped to a unique ID A lexicon (sorted list of all words) is created A hit list (“Inverted index”) is created for every word in the lexicon –Occurrences of a word in a particular document, including position, font, capitalization, metainformation (part of titles)

Google’s Indexing – Step 1 Each document is parsed an transformed into a collection of “hit lists” that are put into “barrels”, sorted by docID. Hit: –Hit type: Plain or fancy. –Fancy hit: Occurs in URL, title, anchor text, metatag.

Forward Barrels Google’s Forward Barrels Wordid#hitsHit, hit, hit, hit, hit Wordid#hitsHit Wordid#hitsHit Wordid#hitsHit, hit, hit Wordid#hitsHit, hit Barrel i Barrel i+1 Docid Wordid#hitsHit, hitDocid Wordid#hitsHit, hit Wordid#hitsHit, hit Wordid#hitsHit, hit, hitDocid

Google’s Indexing – Step 2 Each barrel is then sorted by wordID to create the inverted index. This sorting also creates the lexicon file. –Lexicon: –Lexicon is mostly cached in-memory

wordid#docs wordid#docs wordid#docs Lexicon (in-memory)Postings (“Inverted Barrels”, on disk) Google’s Inverted Index Sorted by wordid Docid#hitsHit, hit, hit, hit, hit Docid#hitsHit Docid#hitsHit Docid#hitsHit, hit, hit Docid#hitsHit, hit Barrel i Barrel i+1

Outline Data structures and algorithms for indexing the web The PageRank algorithm

Motivation Efficient matching: Indexing helps finding pages that contain the search phrase, giving priority to the pages that contain it in titles or other privileged positions. Still there can be a huge number of such matches ! Also needed for an effective search: a measure of importance of the pages that matched the search criteria Problem: Assessing the importance of web pages without human evaluation of the content –First solution: the PageRank algorithm

PageRank History History: –Proposed by 2 PhD students, Sergey Brin and Lawrence Page in 1998 at Stanford. –“The Anatomy of a Large-Scale Hypertextual Web Search Engine”. –“The PageRank citation ranking: Bringing order to the web ”, 66.pdfhttp://ilpubs.stanford.edu:8090/422/1/ pdf –Algorithm of the first generation of Google Search Engine.

PageRank Principles Measure the importance of Web page based on the link structure alone. The importance of a page is given by the number of pages linking to it (number of “votes” received) as well as their importance (the importance of the voters) If a page contains links to a number of l pages, its contribution to the importance of each page is a fraction 1/l of its own importance (it “splits” its votes)

PageRank Principles - Example Importance(P1)=100 Outdegree(P1)=2 P1 Importance(P2)=9 Outdegree(P2)=3 P2 Importance(P3)=53 P3 Importance(P4)=3 P

Issues with Computing PageRank The simplified PageRank computation principles presented before cannot be directly applied: –Pages without inlinks: which should be their PR value? (it cannot be zero, otherwise nothing gets propagated) –Cycles in page graphs: we cannot go forever round the cycle, always increasing the scores The solution to this problem can be formulated from one of the possible viewpoints on PageRank: –Algebraic point of view –Probabilistic point of view

PageRank – The Probabilistic Point of View The Random Surfer Model Since the importance of a web page P is measured by its popularity (how many incoming links it has) we can view the importance of the page as the probability that a random surfer that starts browsing the net at any page arrives at the page P following hyperlinks. If the random surfer is at a page having k outlinks, he has 1/k probability to go next to any of the k pages

The Random Surfer Model Initial data: The page graph contains N pages P i, i=1..N We denote by B i the set of all pages P j that have links to P i We denote by l j the outdegree of page P j (the number of its outgoing links) Initially, each page P i has 1/N probability to be choosen as a start page. This is the initial probability (at moment 0) of the page to be reached, PR(i, 0)

The Random Surfer Model Updating probabilities: At a moment t, each page P i has a probability PR(i, t) At next moment t’, the probability of page Pi is PR(i, t’) and it is the weighted sum of the probabilities of its incoming pages, weighted by their outdegrees:

The Random Surfer Model Updating probabilities: PR(j, t) Outdegree(P j )=l j PjPj PiPi PR(j, t)/l j PR(i, t’)

The Random Surfer Model Convergence: The values PR(i, t), when t→∞, converge to PR(i) The fact that PR converges to a unique probabilistic vector (the stationary distribution) can be mathematically proved (see: stochastic matrices, eigenvectors, the power method for finding eigenvector)

Example N=4 l 1 =3, l 2 =2, l 3 =1, l 4 =2 Initially (t=0): –PR(1,0)=1/4 –PR(2,0)=1/4 –PR(3,0)=1/4 –PR(4,0)=1/4 P1 P2P4 P3 1 1/2 1/3 1/2 1/3 1/2 1/3 1/2 PR(1,0)=1/4PR(3,0)=1/4 PR(2,0)=1/4 PR(4,0)=1/4

Example (cont) t=1; PR(1,1)=1*PR(3,0)+1/2*PR(4,0) = 1 * /2 * 0.25 = 0.37 PR(2,1)=1/3*PR(1,0)= 1/3 * 0.25 = 0.08 PR(3,1)=1/3*PR(1,0)+1/2*PR(2,0)+ 1/2*PR(4,0) = 1/3 * /2 * /2 * 0.25 = 0.33 PR(4,1)=1/3*PR(1,0)+1/2*PR(2,0)= 1/3 * /2 * 0.25 = 0.20 P1 P2P4 P3 1 1/2 1/3 1/2 1/3 1/2 1/3 1/2 PR(1,0)=0.25PR(3,0)=0.25 PR(2,0)=0.25 PR(4,0)=0.25

Example (cont) t=2; PR(1,2)=1*PR(3,1)+1/2*PR(4,1) = 1 * /2 * 0.20 = 0.43 PR(2,2)=1/3*PR(1,1)= 1/3 * 0.37 = 0.12 PR(3,2)=1/3*PR(1,1)+1/2*PR(2,1)+ 1/2*PR(4,1) = 1/3 * /2 * /2 * 0.20 = 0.27 PR(4,2)=1/3*PR(1,1)+1/2*PR(2,1)= 1/3 * /2 * 0.08 = 0.16 P1 P2P4 P3 1 1/2 1/3 1/2 1/3 1/2 1/3 1/2 PR(1,1)=0.37PR(3,1)=0.33 PR(2,1)=0.08 PR(4,1)=0.20

Example (cont) t=3; PR(1,3)=1*PR(3,2)+1/2*PR(4,2) = 1 * /2 * 0.16 = 0.35 PR(2,3)=1/3*PR(1,2)= 1/3 * 0.43 = 0.14 PR(3,3)=1/3*PR(1,2)+1/2*PR(2,2)+ 1/2*PR(4,2) = 1/3 * /2 * /2 * 0.16 = 0.29 PR(4,3)=1/3*PR(1,2)+1/2*PR(2,2)= 1/3 * /2 * 0.12 = 0.20 P1 P2P4 P3 1 1/2 1/3 1/2 1/3 1/2 1/3 1/2 PR(1,2)=0.43PR(3,2)=0.27 PR(2,2)=0.12 PR(4,2)=0.16

Example (cont) The values of PR calculated until now: t=0: [0.25, 0.25, 0.25, 0.25] t=1: [0.37, 0.08, 0.33, 0.20] t=2: [0.43, 0.12, 0.27, 0.16] t=3: [0.35, 0.14, 0.29, 0.20] We can continue the iterations, and get: t=4: [0.39, 0.11, 0.29, 0.19] t=5: [0.39, 0.13, 0.28, 0.19] t=6: [0.38, 0.13, 0.29, 0.19] t=7: [0.38, 0.12, 0.29, 0.19] t=8: [0.38, 0.12, 0.29, 0.19] PR(1)=0.38 PR(2)=0.12 PR(3)=0.29 PR(4)=0.19

Dangling Nodes and Disconnected Components Problems with the initial Random Surfer Model: –If the random web surfer arrives at a page Pj that has no outlinks (a dangling node), he has nowhere to go. The accumulated importance of Pj “gets lost”, since it is not transferred further to any other pages –If the web is formed by several connected components, the random web surfer will never reach pages that are in a different connected component than the initial random node

Example – The Dangling Node Problem N=3 l 1 =2, l 2 =2, l 3 =0 Initially (t=0): –PR(1,0)=1/3 –PR(2,0)=1/3 –PR(3,0)=1/3 Update rules: –PR(1,t’)=1/2 *PR(2,t) –PR(2,t’)=1/2*PR(1,t) –PR(3,t’)=1/2*PR(1,t)+1/2*PR(2,t) P1 P2 P3 1/2

Example – The Dangling Node Problem (cont) Applying the update rules we get: t=0: [1/3, 1/3. 1/3] t=1: [1/6, 1/6, 1/3] t=2: [1/12, 1/12, 1/6] t=3: [1/24, 1/24, 1/12] …. Result: PR(1)=PR(2)=PR(3)=0 ! P1 P2 P3 1/2 This result has no meaning as a ranking -> a solution must be found for dangling nodes

Solution for dangling nodes and disconnected components The PageRank Random Surfer model is updated as follows: –Most of the time (a percentage d) a surfer will follow links from a page, as in the model before. If a page has no outlinks, he will continue after it with a random page (a page with no outlinks will be considered to have N outlinks to any other page). –A smaller, but positive percentage of time (the rest of the percentage 1-d) the surfer will dump the current page and choose arbitrarily a different page from the web and “teleport” there

Computing PageRank The probability of reaching a page P i The probability of arriving from a page Pj that has a link to Pi The probability of arriving from a page Pj that has no outlinks The probability of arriving through teleporting at a random time d=dumping factor, heuristic

The dumping factor Dumping factor (d) can have values in [0,1] If d=0: all the web surfer moves are random jumps (teleports), no links are followed If d=1: the web surfer makes no teleports, he only follows links, except for the case of dangling nodes The value of d also influences how fast the vector converges to the stationary distribution (the number of needed iterations) Usual value (proposed by Brin and Page): d=0.85 Convergence is reached in less than 100 iterations

public Map computePageRank(Digraph g) { double d=0.85; int iterations=100; int N=g.getNumberOfNodes(); List nodes= g.getAllNodes(); List nodesWithoutOutlinks = g.getNodesWithoutOutlinks(); Map opr = new HashMap (); // old pageranks Map npr = new HashMap (); // new pageranks for (Vertex n:nodes) npr.put(n, 1.0/N); // init pageranks with 1/N for (Vertex n:nodes) opr.put(n, 1.0/N); while (iterations>0) { double dp=0; for (Vertex p:nodesWithoutOutlinks) dp=dp+opr.get(p)/N; for (Vertex p:nodes) { double nprp; nprp=dp+(1-d)/N; for (Vertex ip: g.inboundNeighbors(p)) nprp=nprp+d*opr.get(ip)/g.outDegree(ip); npr.put(p,nprp); } Map temp; temp=opr; opr=npr; npr=temp; iterations=iterations-1; } return npr; }

PageRank – the Algebraic Point of View Initial data: The page graph contains N pages P i, i=1..N We denote by B i the set of all pages P j that have links to P i We denote by l j the outdegree of page P j (the number of its outgoing links) The Hyperlink matrix A: a square matrix with the rows and column corresponding to web pages, where A[i,j] = 1/l j if there is a link from j to i and A[i,j] = 0 if not.

Example –The Hyperlink Matrix P1 P2P4 P3 1 1/2 1/3 1/2 1/3 1/2 1/3 1/ /3000 1/20 1/31/

Properties of the Hyperlink Matrix –All entries are nonnegative –The sum of the entries in a column j is 1, if j has outgoing links. –All elements of a column j are 0 if j has no outgoing links (j is a dangling node) If the web has no dangling nodes, the Hyperlink matrix is stochastic

Stochastic Matrices A column stochastic matrix (probability matrix, Markov matrix) is a square matrix of nonnegative real numbers, with each column summing to 1.

Stochastic Matrices The Perron-Frobenius Theorem: Every positive column stochastic matrix A has a unique stationary column vector X (an eigenvector with eigenvalue 1): A*X=X The Power Method Convergence Theorem: Let A be a positive column stochastic matrix of size n*n and X its stationary column vector. Then X can be calculated by following procedure: Initialize the column vector Z with all entries equal to 1/n. Then the sequence Z, A*Z, A 2 *Z ….,A k *Z converges to the vector X.

The Google Matrix A= Transition matrix S= a matrix obtained from A, by setting the elements of the columns where all elements of the column are 0, to 1/N G= the Google matrix: G[i,j]=d*S[i,j]+(1-d)/N Property: the Google matrix is a stochastic matrix The stationary vector of G contains the PageRank values

PageRank and the History of Search Engines PageRank (1998) was the first algorithm to introduce the concept of “importance of a webpage” and calculate it without relying on external information – crucial factor in Google ascension Drawbacks: –PageRank can be manipulated –SEO (“Search Engine Optimisation”)

PageRank and the Future of Search Engines 2011: Google Panda: –introduce filters that prevent low quality sites and/or pages from ranking well in the search results, identifying –use human feedback and machine learning algorithms 2012: Google Penguin: –decrease ranking of sites identified as using “black-hat SEO techniques” 2013: Google Hummingbird –Judge the context of a query - thereby judging the intent of a person carrying out a search, to determine what they are trying to find out

Other Uses of PageRank Ranking scientific articles according to their citations Ranking streets for predicting human movement and street congestion Automatic summarization – extracting the most relevant sentences from a text

Tool Project #3 Optional: Automatic Summarization Tool, based on PageRank –Text is represented as a graph of sentences –Edges are given by the “similarity” of two sentences (what can be used as a form of “recommendation” or “vote” between sentences ?) –Apply PageRank (or a modified version, able to cope with undirected, maybe weighted graphs) and take the top x% sentences to form the abstract

Bibliography John Mac Cormick: Nine Algorithms that changed the Future, Chapters 2 & 3 Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web pdfhttp://ilpubs.stanford.edu:8090/422/1/ pdf David Austing, How Google Finds Your Needle in the Web's Haystack, AMS Feature Column pagerank pagerank