Presentation is loading. Please wait.

Presentation is loading. Please wait.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Similar presentations


Presentation on theme: "LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002."— Presentation transcript:

1 LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002

2 2 Agenda Introduction Related work Problem definition -- Hyperlink Selection Solution -- LinkSelector Evaluation Contributions, limitations and future work

3 3 Introduction Size of WWW (Lawrence and Giles,1999) 800 million web pages 1 million pages added daily How to find information on the Web Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking on hyperlinks

4 4 Introduction Product Category List A B C D E F Product Category A Product List A1 A2 A3 A4 A5 Product A2 Price: 1000 Detailed description Click on A Click on A2 Web Page 1 Web Page 2 Web Page 3 B2

5 5 Introduction Portal page: is the entrance to a website. Portal page Homepage of a website Default web portal (e.g.,My Yahoo!) Most My Yahoo! users never customize their default web portals (Manber et al., 2000).

6 6 Introduction Hyperlinks in a portal page are selected from a hyperlink pool. A hyperlink pool is a set of hyperlinks pointing to top-level web pages, e.g., hyperlink in a site index page.

7 7 Portal page

8 8 Hyperlink pool

9 9 Portal page

10 10 Hyperlink pool

11 11 Introduction Number of hyperlinks in a portal page: several dozens (e.g., 32 in the Arizona Home page). Number of hyperlinks in a hyperlink pool: several hundreds (e.g., 743 in the Arizona Index page).

12 12 Introduction It is too computational expensive to do an exhaustive search (e.g., ). Current practice of hyperlink selection – expert selection Only reflect domain experts’ perspectives Subjective

13 13 Introduction Our approach is based on web access patterns extracted from a web log – objective and reflect web surfers’ perspectives web structural patterns extracted from an existing website -- objective

14 14 Related work Web mining is the process of applying data mining techniques to extract patterns from the Web. Web Data Content: texts and graphics in web pages Structure: hyperlinks Usage: web logs

15 15 Related work Web content mining is the process of automatically retrieving, parsing, indexing and categorizing web documents.(Chakrabrati, 2000) Web structure mining HITS (Kleinberg, 1998) PageRank (Brin and Page, 1998)

16 16 Related work Web usage mining is the process of applying data mining techniques to extract web access patterns from a web log.

17 17 Related work Web usage mining General purpose, e.g., Chen et al. 1996; Cooley et al., 1999 Website improvement, e.g., Perkowitz and Etzioni, 2000 Personalization, e.g., Yan 1996

18 18 Related work Limitations of previous web usage mining research Not considering web structure information, e.g., Chen et al., 1996 Web structure information are used to exclude “uninteresting” web visiting patterns, e.g., Yan et al., 1996 and Cooley et al., 1999

19 19 Hyperlink Selection The quality of a portal page is measured using a web log and a web log can be divided into sessions. Metrics to measure the quality of a portal page Effectiveness Efficiency Usage

20 20 Hyperlink Selection Effectiveness: is the percentage of user- sought top-level web pages that can be easily accessed from the portal page. What are the user-sought top-level web pages? How to define the easiness to find a web page from a Portal page?

21 21 Hyperlink Selection User-sought top-level web pages Session j: L1, L10, L11, L2, L13, L14, L5, L9, L7, L12 L1, L2, L5, and L7 are in the hyperlink pool User-sought top-level web pages: L1, L2, L5, L7

22 22 Hyperlink Selection Usually, web pages that are 1-2 clicks away from a portal page can be easily found from the portal page.

23 23 Hyperlink Selection Effectiveness measured at session level Effectiveness measured at log level

24 24 Hyperlink Selection Efficiency measures the usefulness of hyperlinks placed in a portal page. Efficiency measured at session level Efficiency measured at log level

25 25 Hyperlink Selection Usage : how often a portal page is visited.

26 26 Hyperlink Selection Definition : Given a website w, its hyperlink pool HP and the number of hyperlinks to be placed in the portal page of w – N, where, the hyperlink selection problem is to construct the portal page by selecting N hyperlinks from the hyperlink pool HP to maximize the effectiveness, efficiency and usage of the resulting portal page (i.e., all metrics are measured at the web log level).

27 27 LinkSelector LinkSelector is based on relationships between hyperlinks in a hyperlink pool. Structure Relationship Access Relationship

28 28 LinkSelector Structure Relationship

29 29 LinkSelector Structure Relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1  L2 L1  L4 L1  L6 L1  L8 L3  L5 L3  L7

30 30 LinkSelector Access Relationship k-HS is denoted as a hyperlink set with k hyperlinks. e.g., {L1,L2} is a 2-HS The support of a k-HS is the percentage of sessions that web pages pointed to by hyperlinks in the k-HS are accessed together. e.g., If web pages pointed to by L1 and L2 are accessed together in 20 sessions out of total 100 sessions, then the support of 2-HS {L1,L2} is 20%.

31 31 LinkSelector Access Relationship Definition : For a k-HS, where, there exists an access relationship among elements in the k-HS if and only if its support is greater than a pre-defined threshold.

32 32 LinkSelector

33 33 LinkSelector Group-I relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1  L2 L1  L4 L1  L6 L1  L8 L3  L5 L3  L7

34 34 LinkSelector Group-I relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1  L2 L1  L4 L1  L6 L1  L8 L3  L5 L3  L7 Access relationships: {L1,L2},0.1 {L1,L4},0.1 {L1,L6},0.05 {L1,L8},0.05 {L3,L5},0.4 {L3,L7},0.5

35 35 LinkSelector Group-I relationship provides indicators of preference for individual hyperlinks in hyperlink selection the number of structure relationships a hyperlink participate in as an initial hyperlink the quality of these structure relationships

36 36 LinkSelector Group-II relationship no structure relationship between L9 and L12 an access relationship between L9 and L12

37 37 LinkSelector Group-II relationship provides indicators of hyperlink pair preference in hyperlink selection: hyperlink pairs with Group-II relationships are preferred to hyperlink pairs without Group-II relationships within hyperlink pairs with Group-II relationships, hyperlink pairs with higher support of access relationship are preferred to those with lower support of access relationships

38 38 LinkSelector Group-III relationship reveals patterns that are not relevant to hyperlink selection Group-IV relationship does not reveal interesting patterns. L1 Web page 1 L5 Web page 2

39 39 LinkSelector The Sketch of LinkSelector

40 40 LinkSelector Discover Structure Relationships

41 41 LinkSelector Access relationship can be discovered from a web log using association rule mining Data Preprocessing Association rule mining

42 42 LinkSelector Data Preprocessing Web log cleaning Error logs (e.g., status code 404) Accessory logs (e.g.,.gif) Session identification Modify web server 30-min time interval

43 43 LinkSelector Association rule mining Given: a transaction database tid item 001 1,2,3 002 1,2,4 003 2,3,4 004 4,5,6 An itemset is a set of items, e.g., {1,2} The support of an itemset is the percentage of transactions that contain (e.g., purchase) the itemst. Objective: discover all itemsets with supports larger than a user-defined threshold. Apriori Algorithm (Agrawal and Srikant,1994 )

44 44 LinkSelector  Calculate Preferences for Hyperlinks

45 45 LinkSelector  Calculate Preferences for Hyperlinks Sets No structure relationships between and and and. Preference for hyperlink set is 0.022

46 46 LinkSelector Clustering is a data mining algorithm to segment objects into groups based on their similarities 1 4 3 2 5 0.2 0.12 1 2 3 4 5 1 0 0.2 0.1 0.1 0.05 2 0.2 0 0.1 0.1 0.05 3 0.1 0.1 4 5

47 47 LinkSelector Hyperlink Clustering : Hyperlinks  Objects Preferences for hyperlinks  Weights of objects Preferences for hyperlinks sets  Similarities among hyperlinks

48 48 LinkSelector Limitations of classical clustering algorithms Weights of objects are not considered. Only considers similarities between two objects

49 49 LinkSelector Our solution Indexes of the proposed similarity matrix are clusters while indexes of the traditional similarity matrix are objects to be clustered. Similarities involving two and more objects are considered in the proposed similarity matrix. Weights of objects are considered

50 50 LinkSelector

51 51 Experiment Data collected from UA website Hyperlink pool: 110 links Web log: collected in Sep. 2001 10 M records  4.2 M records total 344 K sessions 262 k sessions  Training data (23 days) 82 k sessions  Testing data (7 days)

52 52 Experiment Hyperlinks selected by LinkSelector, Domain experts and access frequency (N=6)

53 53 Experiment Average improvement: 12.7% Improvement decrease from 22.1% to 8.4% Average number of sessions per day: 11.5k

54 54 Experiment Average improvement: 17.0% Improvement decrease from 30.2% to 9.4% Absolute number of hyperlinks improved: 15610 to 6509

55 55 Experiment Average improvement: 16.9% Improvement decrease from 30.2% to 9.3%

56 56 Experiment Hyperlinks selected by LinkSelector, Classical Hierarchical Clustering and Association rule mining (N=6)

57 57 Experiment Average improvement compared with association rule mining: 25.8% Average improvement compared with classical clustering: 102.0%

58 58 Experiment Average improvement compared with association rule mining: 31.7% Average improvement compared with classical clustering: 124.0%

59 59 Experiment Average improvement compared with association rule mining: 31.6% Average improvement compared with classical clustering: 123.0%

60 60 Contributions 1. We proposed and formally defined a new and important research problem – hyperlink selection. 2. We proposed and showed what a web mining based hyperlink selection approach outperforms other hyperlink selection approaches. 3.We developed a new clustering algorithm for hyperlink selection.

61 61 Limitations and Future work User Study Adaptive LinkSelector Structure of website changes Web visiting patterns change


Download ppt "LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002."

Similar presentations


Ads by Google