LinkSelector: Select Hyperlinks for Web Portals Prof. Olivia Sheng Xiao Fang School of Accounting and Information Systems University of Utah
2 Agenda Introduction Problem definition -- Hyperlink Selection Solution -- LinkSelector Evaluation Collaboration
3 Introduction Size of WWW More than 3 billion web pages (Google.com, 2001) 1 million pages added daily (Lawrence and Giles,1999) How to find information on the Web Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking through hyperlinks
4 Introduction Product Category List A B C D E F Product Category A Product List A1 A2 A3 A4 A5 Product A2 Price: 1000 Detailed description Click on A Click on A2 Web Page 1 Web Page 2 Web Page 3 B2
5 Introduction Portal page: is a specific web page which serves as the entrance to a website. Portal page Important Mainly consisting of hyperlinks
6 Introduction Web portal is a personalized entrance to a website. (e.g., My Yahoo!) Default Web Portal/Portal Page Most My Yahoo! users never customize their default web portals (Manber et al., 2000).
7 Introduction Homepage of a Website/Portal Page
8 Introduction Not all hyperlinks in a website can be placed in the portal page of the website Hyperlinks in a portal page are selected from a hyperlink pool which is a set of hyperlinks pointing to top-level web pages, e.g., hyperlinks in a site index page.
9 Portal page
10 Hyperlink pool
11 Portal page
12 Hyperlink pool
13 Introduction Number of hyperlinks in a portal page one to several dozens (e.g., 14 in My Yahoo!). (Neilson, 1999) Number of hyperlinks in a hyperlink pool: one to several hundreds (e.g., 102 in My Yahoo!).
14 Introduction It is too computational expensive to do an exhaustive search (e.g., ). Current practice of hyperlink selection – expert selection Based on domain experts’ experiences Subjective and slower to adapt
15 Introduction Our approach is based on Web access patterns extracted from a web log – objective (web surfers’ actual visiting behaviors) Web structural patterns extracted from an existing website – objective and dynamically adaptive
16 Hyperlink Selection Metrics to measure the quality of a portal page Effectiveness Efficiency Usage The quality of a portal page is measured using a web log. A web log can be divided into sessions.
17 Hyperlink Selection Effectiveness: is the percentage of the user- sought top-level web pages that can be easily accessed from a portal page. Efficiency measures the usefulness of hyperlinks placed in a portal page. Usage : how often a portal page is visited.
18 Hyperlink Selection Given the hyperlink pool of a website, HP, the number of hyperlinks to be placed in the portal page of the website, N, where N < |HP|; Construct the portal page by selecting N hyperlinks from the hyperlink pool HP Objective: optimize the effectiveness, efficiency and usage of the resulting portal page
19 LinkSelector LinkSelector is based on relationships between hyperlinks in a hyperlink pool. Structure Relationship Access Relationship
20 LinkSelector Structure Relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Other Structure relationships: L1 L4 L1 L6 L1 L8 L3 L5 L3 L7 Structure relationship: L1 L2 L1: initial hyperlink L2: terminal hyperlink
21 LinkSelector A k-HS is denoted as a hyperlink set with k hyperlinks. e.g., {L1,L2} is a 2-HS The support of a k-HS is the percentage of sessions in which hyperlinks in the k-HS are accessed together. Example: If L1 and L2 are accessed together in 20 sessions out of total 100 sessions, then the support of the 2-HS {L1,L2} is 20%. Access Relationship
22 LinkSelector Access Relationship Definition : For a k-HS, where, there exists an access relationship among hyperlinks in the k-HS if and only if its support is greater than a pre-defined threshold. Example: If threshold = 0.15 and the support of the 2-HS {L1, L2} is 0.2 then, there exists an access relationship between hyperlinks L1 and L2 and the support of the relationship is 0.2
23 LinkSelector Discover structure relationships Parse the existing website Discover access relationships Data Preprocessing Web log cleaning Session identification Association rule mining (Agrawal and Srikant,1994 )
24 LinkSelector
25 Evaluation Summary of Data Hyperlink pool: site-index page of the UA web Site 110 links
26 Evaluation Summary of Data Web log: collected from the UA web server in Sep M records (raw) 4.2 M records (clean) total 344 K sessions 262 K sessions Training data (23 days) 82 K sessions Testing data (7 days)
27 Evaluation Average improvement: 12.7% Improvement decrease from 22.1% to 8.4% Average number of sessions per day: 11.5k
28 Evaluation Group II relationship: 0.2% of the training sessions Group I relationship /shared/sports-entertain.shtml /shared/athletics.shtml
29 Evaluation Average improvement: 17.0% Improvement decreases from 30.2% to 9.4% 605/day more user-sought top-level web pages can be easily accessed from the portal page constructed using LinkSelector than from those constructed using the other two approaches
30 Evaluation Average improvement: 16.9% Improvement decrease from 30.2% to 9.3%