Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation
Overview : What is OPIC? Why Should we care ? Advantages vs off-line algorithms How does it work? Scenario of OPIC Challenge Mathematical mode Algorithm Prons and Cons 7/13/2010 2Adaptive On-Line Page Importance Computation
What is OPIC? OPIC stands for On-line Page Important Computation. Why should we care? OPIC provide a more effective way of computing page importance than other old algorithms. 7/13/2010 3Adaptive On-Line Page Importance Computation
Advantages vs off-line algorithms Work online with a large amount of dynamic graph Use much less resources.eg.It does not require storing the link matrix Can focus crawling to the most interest pages fully integrated in the crawling process 7/13/2010 4Adaptive On-Line Page Importance Computation
How does it work? It is on-line in that it continuously refines its estimate of page importance while the web graph is visited. 7/13/2010 5Adaptive On-Line Page Importance Computation
Scenario of OPIC Initially, ditribute some cash to each page Each page when it is crawled distributes its current cash equally to all pages it points to. Record credit history of each page(when crawled, a page’s current cash sent to its children, but the cash amount it ever has record in the credit history ) The page importance of one page= (credit history + current cash)/(total history amount+ total current cash) 7/13/2010 6Adaptive On-Line Page Importance Computation
Challenge How to find the values of current cash and history? Intuitively, the cash flow goes through from parent nodes to child nodes, in a inductive way. 7/13/2010 7Adaptive On-Line Page Importance Computation
Mathematical mode Let G be any directed graph with n vertices. Fix an arbitrary ordering between the vertices. G can be represented as a matrix L[ i, j], such that L[i,j]>=0, L[i,j]>0 iff exist a edge from i to j The basic idea is to define the importance of a page in an inductive way and then compute it using a fixpoint. If the graph contains n nodes, the importance is represented as a vector x in a n dimensional space 7/13/2010 8Adaptive On-Line Page Importance Computation
Mathematical mode (cont.) Importance is defined inductively by the equation Given a linear transformation A, a non-zero vector x is defined to be an eigenvector of the transformation if it satisfies the eigenvalue equation Ax=λx 7/13/2010 9Adaptive On-Line Page Importance Computation
Find a fixpoint By definition, such a fixpoint is an eigenvector of L with a real positive eigenvalue. Lx = λx Problems Solution Multiple solutions Iteration may not converge Google defines L[i,j]=1/d[i] iff there is an edge from i to j. L’[i,j]=L[i,j]+,where is a small real. a new graph G’ which is G plus a small edge for any pair i,j the convergence of iteration is guaranteed because this small edge makes G’ stongely connected and aperiodic 10Adaptive On-Line Page Importance Computation 7/13/2010
Algorithm for static graphs At each step, an estimate of any page k’s importance is (H[k]+C[k])/(G+1) 11Adaptive On-Line Page Importance Computation 7/13/2010
Crawling strategies Random : We choose the next page to crawl randomly with equal probability. Greedy : We read next the page with highest cash. This is a greedy way to decrease the value of the error factor. Impact on convergence speed. There are two main strategies here: 12Adaptive On-Line Page Importance Computation 7/13/2010
The Adaptive OPIC algorithm(for changing graphs) Base on time window two main dimensions Fixed window Variable Window Interpolation The page selection strategy that is used (e.g., Greedy or Random ) The window policy that is considered (e.g., Fixed Window or Interpolation). 13Adaptive On-Line Page Importance Computation 7/13/2010
14Adaptive On-Line Page Importance Computation
Pros it may start even when a (large) part of the matrix is still unknown it is integrated in the crawling process it works on-line even while the graph is being updated It requires less storage resources than standard algorithms It requires less CPU, memory and disk access than standard algorithms 7/13/ Adaptive On-Line Page Importance Computation
Cons it is strictly tailored to the computational cost model of crawling the Web converges slower than others after reading the same pages 7/13/ Adaptive On-Line Page Importance Computation
Reference K. Bharat and A. Broder. Estimating the relative size andoverlap of public web search engines. 7th InternationalWorld Wide Web Conference (WWW7), 1998 Andrei Z. Broder and al. Graph structure in the web.WWW9/Computer Networks, S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: a new approach to topic-specific web resource discovery. 8th World Wide Web Conference, J. Dean and M.R. Henzinger. Finding related pages in theworld wide web. 8th International World Wide WebConference, Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd. The pagerank citation ranking: Bringing order to the web, S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A firstexperience in archiving the french web. ECDL, /13/ Adaptive On-Line Page Importance Computation
Q&A 7/13/ Adaptive On-Line Page Importance Computation