Download presentation
Presentation is loading. Please wait.
Published byGodwin Fitzgerald Modified over 9 years ago
1
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen, IDS Lab., Seoul National University
2
Copyright 2008 by CEBT Introduction Page importance is a key factor for web search Currently page importance is measured by using the link graph HITS PageRank If many important pages link to a page then the page is also likely to be important 2
3
Copyright 2008 by CEBT PageRank Drawbacks 3 Link graph is not reliable Links can easily be created and deleted on the web Can easily be manipulated by web spammers using link farms PageRank does not considers the length of time which a web surfer spends on the web page
4
Copyright 2008 by CEBT BrowseRank Utilize user browsing graph Generated from user behavior data Behavior data can be recorded by Internet browsers at web clients and collected at web servers Behavior data includes – URL – Time – Method of visiting (URL input or hyperlink click) 4
5
Copyright 2008 by CEBT BrowseRank (2) More visits of the page and longer time spent on a page indicates that the page is important Uses continuous-time Markov process as model on user browsing graph Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states 5 PastPresentFuture
6
Copyright 2008 by CEBT Originality Propose the use of browsing graph for computing page importance Propose the use of continuous-time Markov process to model a random walk on the user browsing graph 6
7
Copyright 2008 by CEBT User Behavior Data When user surfs on the web Can input the URL Choose to click on a hyperlink Behavior data can be stored as triples 7
8
Copyright 2008 by CEBT User Behavior Data (2) Session Segmentation Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session URL Pair construction Within session, URL’s are placed in adjacent records Indicates that the user transits from the first page to the second page 8
9
Copyright 2008 by CEBT User Behavior Data (3) Reset probability estimation For sessions segmented by type rule, the first URL is input by the user Assign reset probabilities to those URL’s Staying time extraction For each URL pair, use the time difference of second and first page as staying time For last session either use random time [for time rule] or time difference from next session [for type rule] 9
10
Copyright 2008 by CEBT User Browsing Graph Vertex: Represent a URL Metadata: Reset Probabilities, Staying Time Directed Edge: Represents Transition between pages Edge Weight: Number of transitions 10 25 18 30 3 45 15 7 6 17 14
11
Copyright 2008 by CEBT Model Continuous-time time-homogeneous Markov Process model Assumptions Independence of users ad sessions Markov Property Time-homogeneity 11
12
Copyright 2008 by CEBT Continuous-time Markov Model 12 Xs represents page which the surfer is visiting at time s, s > 0 Continuous-time time-homogenous Markov Process P ij (t) denotes the transition probability from page i to page j for time interval t Stationary probability distribution Π unique and independent of t Computing matrix P is difficult because it is hard to get information for all time intervals Algorithm is based on
13
Copyright 2008 by CEBT Algorithm 13
14
Copyright 2008 by CEBT Experiments Website-Level BrowseRank Finding important websites and depressing spam sites Page-Level BrowseRank Improving relevance ranking Dataset 3 billion records 950 million unique URL’s Website Level Graph – 5.6 million vertices – 53 million edges – 40 million websites 14
15
Copyright 2008 by CEBT Top-20 Websites 15
16
Copyright 2008 by CEBT Spam fighting 2714 websites labeled spam by human experts 16
17
Copyright 2008 by CEBT Page Level Testing 17 Adopted 3 measures to evaluate performance MAP Precission (P@n) Normalized Discounted Cummulative Gain (NDCG@n)
18
Copyright 2008 by CEBT Results (1) 18
19
Copyright 2008 by CEBT Results (2) 19
20
Copyright 2008 by CEBT Technical Issues User behavior data tends to be sparse User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages Time homogeneity assumption is mainly for technical convenience Content information and metadata was not used in BrowseRank 20
21
Copyright 2008 by CEBT Discussion Better approach to find page importance Already highlights technical issues Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client. 21
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.