Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,

Similar presentations


Presentation on theme: "Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,"— Presentation transcript:

1 Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR 2008 2009. 04.10. Summarized & presented by Babar Tareen, IDS Lab., Seoul National University

2 Copyright  2008 by CEBT Introduction  Page importance is a key factor for web search  Currently page importance is measured by using the link graph HITS PageRank  If many important pages link to a page then the page is also likely to be important 2

3 Copyright  2008 by CEBT PageRank Drawbacks 3  Link graph is not reliable Links can easily be created and deleted on the web Can easily be manipulated by web spammers using link farms  PageRank does not considers the length of time which a web surfer spends on the web page

4 Copyright  2008 by CEBT BrowseRank  Utilize user browsing graph Generated from user behavior data Behavior data can be recorded by Internet browsers at web clients and collected at web servers Behavior data includes – URL – Time – Method of visiting (URL input or hyperlink click) 4

5 Copyright  2008 by CEBT BrowseRank (2)  More visits of the page and longer time spent on a page indicates that the page is important  Uses continuous-time Markov process as model on user browsing graph  Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states 5 PastPresentFuture

6 Copyright  2008 by CEBT Originality  Propose the use of browsing graph for computing page importance  Propose the use of continuous-time Markov process to model a random walk on the user browsing graph 6

7 Copyright  2008 by CEBT User Behavior Data  When user surfs on the web Can input the URL Choose to click on a hyperlink  Behavior data can be stored as triples 7

8 Copyright  2008 by CEBT User Behavior Data (2)  Session Segmentation Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session  URL Pair construction Within session, URL’s are placed in adjacent records Indicates that the user transits from the first page to the second page 8

9 Copyright  2008 by CEBT User Behavior Data (3)  Reset probability estimation For sessions segmented by type rule, the first URL is input by the user Assign reset probabilities to those URL’s  Staying time extraction For each URL pair, use the time difference of second and first page as staying time For last session either use random time [for time rule] or time difference from next session [for type rule] 9

10 Copyright  2008 by CEBT User Browsing Graph  Vertex: Represent a URL Metadata: Reset Probabilities, Staying Time  Directed Edge: Represents Transition between pages  Edge Weight: Number of transitions 10 25 18 30 3 45 15 7 6 17 14

11 Copyright  2008 by CEBT Model  Continuous-time time-homogeneous Markov Process model  Assumptions Independence of users ad sessions Markov Property Time-homogeneity 11

12 Copyright  2008 by CEBT Continuous-time Markov Model 12  Xs represents page which the surfer is visiting at time s, s > 0  Continuous-time time-homogenous Markov Process  P ij (t) denotes the transition probability from page i to page j for time interval t  Stationary probability distribution Π unique and independent of t  Computing matrix P is difficult because it is hard to get information for all time intervals  Algorithm is based on

13 Copyright  2008 by CEBT Algorithm 13

14 Copyright  2008 by CEBT Experiments  Website-Level BrowseRank Finding important websites and depressing spam sites  Page-Level BrowseRank Improving relevance ranking  Dataset 3 billion records 950 million unique URL’s Website Level Graph – 5.6 million vertices – 53 million edges – 40 million websites 14

15 Copyright  2008 by CEBT Top-20 Websites 15

16 Copyright  2008 by CEBT Spam fighting  2714 websites labeled spam by human experts 16

17 Copyright  2008 by CEBT Page Level Testing 17  Adopted 3 measures to evaluate performance MAP Precission (P@n) Normalized Discounted Cummulative Gain (NDCG@n)

18 Copyright  2008 by CEBT Results (1) 18

19 Copyright  2008 by CEBT Results (2) 19

20 Copyright  2008 by CEBT Technical Issues  User behavior data tends to be sparse  User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages  Time homogeneity assumption is mainly for technical convenience  Content information and metadata was not used in BrowseRank 20

21 Copyright  2008 by CEBT Discussion  Better approach to find page importance  Already highlights technical issues  Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client. 21


Download ppt "Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,"

Similar presentations


Ads by Google