Download presentation
Presentation is loading. Please wait.
1
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002
2
2 Agenda Introduction Related work Problem definition -- Hyperlink Selection Solution -- LinkSelector Evaluation Contributions, limitations and future work
3
3 Introduction Size of WWW (Lawrence and Giles,1999) 800 million web pages 1 million pages added daily How to find information on the Web Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking on hyperlinks
4
4 Introduction Product Category List A B C D E F Product Category A Product List A1 A2 A3 A4 A5 Product A2 Price: 1000 Detailed description Click on A Click on A2 Web Page 1 Web Page 2 Web Page 3 B2
5
5 Introduction Portal page: is the entrance to a website. Portal page Homepage of a website Default web portal (e.g.,My Yahoo!) Most My Yahoo! users never customize their default web portals (Manber et al., 2000).
6
6 Introduction Hyperlinks in a portal page are selected from a hyperlink pool. A hyperlink pool is a set of hyperlinks pointing to top-level web pages, e.g., hyperlink in a site index page.
7
7 Portal page
8
8 Hyperlink pool
9
9 Portal page
10
10 Hyperlink pool
11
11 Introduction Number of hyperlinks in a portal page: several dozens (e.g., 32 in the Arizona Home page). Number of hyperlinks in a hyperlink pool: several hundreds (e.g., 743 in the Arizona Index page).
12
12 Introduction It is too computational expensive to do an exhaustive search (e.g., ). Current practice of hyperlink selection – expert selection Only reflect domain experts’ perspectives Subjective
13
13 Introduction Our approach is based on web access patterns extracted from a web log – objective and reflect web surfers’ perspectives web structural patterns extracted from an existing website -- objective
14
14 Related work Web mining is the process of applying data mining techniques to extract patterns from the Web. Web Data Content: texts and graphics in web pages Structure: hyperlinks Usage: web logs
15
15 Related work Web content mining is the process of automatically retrieving, parsing, indexing and categorizing web documents.(Chakrabrati, 2000) Web structure mining HITS (Kleinberg, 1998) PageRank (Brin and Page, 1998)
16
16 Related work Web usage mining is the process of applying data mining techniques to extract web access patterns from a web log.
17
17 Related work Web usage mining General purpose, e.g., Chen et al. 1996; Cooley et al., 1999 Website improvement, e.g., Perkowitz and Etzioni, 2000 Personalization, e.g., Yan 1996
18
18 Related work Limitations of previous web usage mining research Not considering web structure information, e.g., Chen et al., 1996 Web structure information are used to exclude “uninteresting” web visiting patterns, e.g., Yan et al., 1996 and Cooley et al., 1999
19
19 Hyperlink Selection The quality of a portal page is measured using a web log and a web log can be divided into sessions. Metrics to measure the quality of a portal page Effectiveness Efficiency Usage
20
20 Hyperlink Selection Effectiveness: is the percentage of user- sought top-level web pages that can be easily accessed from the portal page. What are the user-sought top-level web pages? How to define the easiness to find a web page from a Portal page?
21
21 Hyperlink Selection User-sought top-level web pages Session j: L1, L10, L11, L2, L13, L14, L5, L9, L7, L12 L1, L2, L5, and L7 are in the hyperlink pool User-sought top-level web pages: L1, L2, L5, L7
22
22 Hyperlink Selection Usually, web pages that are 1-2 clicks away from a portal page can be easily found from the portal page.
23
23 Hyperlink Selection Effectiveness measured at session level Effectiveness measured at log level
24
24 Hyperlink Selection Efficiency measures the usefulness of hyperlinks placed in a portal page. Efficiency measured at session level Efficiency measured at log level
25
25 Hyperlink Selection Usage : how often a portal page is visited.
26
26 Hyperlink Selection Definition : Given a website w, its hyperlink pool HP and the number of hyperlinks to be placed in the portal page of w – N, where, the hyperlink selection problem is to construct the portal page by selecting N hyperlinks from the hyperlink pool HP to maximize the effectiveness, efficiency and usage of the resulting portal page (i.e., all metrics are measured at the web log level).
27
27 LinkSelector LinkSelector is based on relationships between hyperlinks in a hyperlink pool. Structure Relationship Access Relationship
28
28 LinkSelector Structure Relationship
29
29 LinkSelector Structure Relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1 L2 L1 L4 L1 L6 L1 L8 L3 L5 L3 L7
30
30 LinkSelector Access Relationship k-HS is denoted as a hyperlink set with k hyperlinks. e.g., {L1,L2} is a 2-HS The support of a k-HS is the percentage of sessions that web pages pointed to by hyperlinks in the k-HS are accessed together. e.g., If web pages pointed to by L1 and L2 are accessed together in 20 sessions out of total 100 sessions, then the support of 2-HS {L1,L2} is 20%.
31
31 LinkSelector Access Relationship Definition : For a k-HS, where, there exists an access relationship among elements in the k-HS if and only if its support is greater than a pre-defined threshold.
32
32 LinkSelector
33
33 LinkSelector Group-I relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1 L2 L1 L4 L1 L6 L1 L8 L3 L5 L3 L7
34
34 LinkSelector Group-I relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1 L2 L1 L4 L1 L6 L1 L8 L3 L5 L3 L7 Access relationships: {L1,L2},0.1 {L1,L4},0.1 {L1,L6},0.05 {L1,L8},0.05 {L3,L5},0.4 {L3,L7},0.5
35
35 LinkSelector Group-I relationship provides indicators of preference for individual hyperlinks in hyperlink selection the number of structure relationships a hyperlink participate in as an initial hyperlink the quality of these structure relationships
36
36 LinkSelector Group-II relationship no structure relationship between L9 and L12 an access relationship between L9 and L12
37
37 LinkSelector Group-II relationship provides indicators of hyperlink pair preference in hyperlink selection: hyperlink pairs with Group-II relationships are preferred to hyperlink pairs without Group-II relationships within hyperlink pairs with Group-II relationships, hyperlink pairs with higher support of access relationship are preferred to those with lower support of access relationships
38
38 LinkSelector Group-III relationship reveals patterns that are not relevant to hyperlink selection Group-IV relationship does not reveal interesting patterns. L1 Web page 1 L5 Web page 2
39
39 LinkSelector The Sketch of LinkSelector
40
40 LinkSelector Discover Structure Relationships
41
41 LinkSelector Access relationship can be discovered from a web log using association rule mining Data Preprocessing Association rule mining
42
42 LinkSelector Data Preprocessing Web log cleaning Error logs (e.g., status code 404) Accessory logs (e.g.,.gif) Session identification Modify web server 30-min time interval
43
43 LinkSelector Association rule mining Given: a transaction database tid item 001 1,2,3 002 1,2,4 003 2,3,4 004 4,5,6 An itemset is a set of items, e.g., {1,2} The support of an itemset is the percentage of transactions that contain (e.g., purchase) the itemst. Objective: discover all itemsets with supports larger than a user-defined threshold. Apriori Algorithm (Agrawal and Srikant,1994 )
44
44 LinkSelector Calculate Preferences for Hyperlinks
45
45 LinkSelector Calculate Preferences for Hyperlinks Sets No structure relationships between and and and. Preference for hyperlink set is 0.022
46
46 LinkSelector Clustering is a data mining algorithm to segment objects into groups based on their similarities 1 4 3 2 5 0.2 0.12 1 2 3 4 5 1 0 0.2 0.1 0.1 0.05 2 0.2 0 0.1 0.1 0.05 3 0.1 0.1 4 5
47
47 LinkSelector Hyperlink Clustering : Hyperlinks Objects Preferences for hyperlinks Weights of objects Preferences for hyperlinks sets Similarities among hyperlinks
48
48 LinkSelector Limitations of classical clustering algorithms Weights of objects are not considered. Only considers similarities between two objects
49
49 LinkSelector Our solution Indexes of the proposed similarity matrix are clusters while indexes of the traditional similarity matrix are objects to be clustered. Similarities involving two and more objects are considered in the proposed similarity matrix. Weights of objects are considered
50
50 LinkSelector
51
51 Experiment Data collected from UA website Hyperlink pool: 110 links Web log: collected in Sep. 2001 10 M records 4.2 M records total 344 K sessions 262 k sessions Training data (23 days) 82 k sessions Testing data (7 days)
52
52 Experiment Hyperlinks selected by LinkSelector, Domain experts and access frequency (N=6)
53
53 Experiment Average improvement: 12.7% Improvement decrease from 22.1% to 8.4% Average number of sessions per day: 11.5k
54
54 Experiment Average improvement: 17.0% Improvement decrease from 30.2% to 9.4% Absolute number of hyperlinks improved: 15610 to 6509
55
55 Experiment Average improvement: 16.9% Improvement decrease from 30.2% to 9.3%
56
56 Experiment Hyperlinks selected by LinkSelector, Classical Hierarchical Clustering and Association rule mining (N=6)
57
57 Experiment Average improvement compared with association rule mining: 25.8% Average improvement compared with classical clustering: 102.0%
58
58 Experiment Average improvement compared with association rule mining: 31.7% Average improvement compared with classical clustering: 124.0%
59
59 Experiment Average improvement compared with association rule mining: 31.6% Average improvement compared with classical clustering: 123.0%
60
60 Contributions 1. We proposed and formally defined a new and important research problem – hyperlink selection. 2. We proposed and showed what a web mining based hyperlink selection approach outperforms other hyperlink selection approaches. 3.We developed a new clustering algorithm for hyperlink selection.
61
61 Limitations and Future work User Study Adaptive LinkSelector Structure of website changes Web visiting patterns change
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.