Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

Markov Models.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CSE 380 – Computer Game Programming Pathfinding AI
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
An introduction to iterative projection methods Eigenvalue problems Luiza Bondar the 23 rd of November th Seminar.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Kijung Shin Jinhong Jung Lee Sael U Kang
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
Mathematics of the Web Prof. Sara Billey University of Washington.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link Analysis 2 Page Rank Variants
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion

Today’s topics Overview Motivation Personal PageRank Vector Efficient calculation of PPV Experimental results Discussion

PageRank Overview Ranking method of web pages based on the link structure of the web Important pages are those linked-to by many important pages Original PageRank has no initial preference for any particular pages

PageRank Overview random surfer The ranking is based on the probability that a random surfer will visit a certain page at a given time E(p) E(p) can be: Uniformly distributed Biased distributed

Motivation We would like to give higher importance to user selected pages P preferred pages User may have a set P of preferred pages to random page Instead of jumping to any random page with probability c, the jump is restricted to P That way, we increase the probability that the random surfer will stay in the near environment of pages in P personalized view Considering P will create a personalized view of the importance of pages on the web

Personalized PageRank Vector (PPV) Restrict preference sets P to subsets of a set of hub pages H - set of pages with high PageRank PPV is a vector of length n, where n is the number of pages on the web PPV[p] = the importance of page p

PPV Equation u – preference vector |u| = 1 u(p) = the amount of preference for page p A – n x n matrix c – the probability the random surfer jumps to a page in P

PPV – Problem Not practical to compute PPV’s during query time Not practical to compute and store offline There are preference sets How to calculate PPV? How to do it efficiently?

Main Steps to solution preference vectors common Break down preference vectors into common components offline online Computation divided between offline (lots of time) and online (focused computation) redundant Eliminates redundant computation

Linearity Theorem The solution to a linear combination of preference vectors is the same linear combination of the corresponding PPV’s. x unit vector Let x i be a unit vector r i hub vector Let r i be the PPV corresponding to x i, called hub vector

Example …r1 …r2 …r12 … x1, x2, x12 Personal preferences of David …rk …

Good, but not enough… If hub vector r i for each page in H can be computed ahead of time and stored, then computing PPV is easier The number of pre-computed PPV decrease from to |H|. But…. Each hub vector computation requires multiple scans of the web graph Time and space grow linearly with |H| The solution so far is impractical

Decomposition of Hub Vectors In order to compute and store the hub vectors efficiently, we can further break them down into… Partial vector Partial vector –unique component Hubs skeleton Hubs skeleton –encode interrelationships among hub vectors hub vector Construct into full hub vector during query time Saves computation time and storage due to sharing of components among hub vectors

Inverse P-distance Hub vector r p inverse P-distance vector Hub vector r p can be represented as inverse P-distance vector l(t) – the number of edges in path t P(t) – the probability of traveling on path t

Partial Vectors r p Breaking r p into into two components: Partial Vectors- Partial Vectors- computed without using any intermediate nodes from H The rest For well-chosen sets H, it will be true that for many pages p,q Partial Vector Paths that going through some page

Precompute and store the partial vector Cheaper to compute and store than Decreases as |H| increases Add at query time to compute the full hub vector But… Computing and storing could be expensive as itself Good, but not enough…

Hubs Skeleton Breaking down : Hubs skeleton Hubs skeleton - The set of distances among hub, giving the interrelationships among partial vectors r p (H) for each p, r p (H) has size at most |H|, much smaller than the full hub vector Partial Vectors Hubs skeleton Handling the case p or q is itself in H Paths that go through some page

Example H a b d c

Putting it all together Given a chosen reference set P 1. Form a preference vector u 2. Calculate hub vector for each i k 3. Combine the hub vectors Pre- computed of partial vectors Hubs skeleton may be deferred to query time

Algorithms Decomposition theorem Basic dynamic programming algorithm Partial vectors - Selective expansion algorithm Hubs skeleton - Repeated squaring algorithm

Decomposition theorem r p The basis vector r p is the average of the basis vectors of its out-neighbors, plus a compensation factor Define relationships among basis vectors r p Having computed the basis vectors of p’s out-neighbors to certain precision, we can use the theorem to compute r p to greater precision

Basic dynamic programming algorithm dynamic programming algorithm Using the decomposition theory, we can build a dynamic programming algorithm which iteratively improves the precision of the calculation On iteration k, only paths with length ≤ k-1 are being considered The error is reduced by a factor of 1-c on each iteration

Computing partial vectors Selective expansion algorithm Tours passing through a hub page H are never considered The expansion from p will stop when reaching page from H

Computing hubs skeleton Repeated squaring algorithm Using the intermediate results from the computation of partial vectors squared The error is squared on each iteration – reduces error much faster r p (H) Running time and storage depend only on the size of r p (H) This allows to defer the computation to query time

Experimental results Perform experiments using real web data from Stanford’s WebBase, containing 80 million pages after removing leaf pages Experiments were run using a 1.4 gigahertz CPU on a machine with 3.5 gigabytes of memory

Experimental results Partial vector approach is much more effective when H contains high-PageRank pages H was taken from the top 1000 to the top 100,000 pages with the highest PageRank

Experimental results Compute hubs skeleton for |H|=10,000 Average size is 9021 entries, much less than dimensions of full hub vectors

Experimental results r p (H) Instead of using the entire set r p (H), using only the highest m enteries Hub vector containing 14 million nonzero entries can be constructed from partial vectors in 6 seconds

Discussion Are personalized PageRank’s even useful? What if personally chosen pages are not representative enough? Too focused? Even if overhead is scalable with number of pages, do light-web users want to accept that overhead? performance depends on choice of personal pages

References Scaling Personalized Web Search Glen Jeh and Jennifer Widom WWW2003 Personalized PageRank seminar: Link mining freiburg.de/~ml/teaching/ws04/lm/ _PageRank_ Alcazar.ppt freiburg.de/~ml/teaching/ws04/lm/ _PageRank_ Alcazar.ppt