LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Web Mining.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data e Web Mining Paolo Gobbo
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Collaborative filtering with ordinal scale-based implicit ratings for mobile music recommendations S.-K. Lee et al., KAIST,
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Chapter 12: Web Usage Mining - An introduction
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
LinkSelector: Select Hyperlinks for Web Portals Prof. Olivia Sheng Xiao Fang School of Accounting and Information Systems University of Utah.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Discovery of Aggregate Usage Profiles for Web Personalization
Application of Apriori Algorithm to Derive Association Rules Over Finance Data Set Presented By Kallepalli Vijay Instructor: Dr. Ruppa Thulasiram.
Recommender systems Ram Akella November 26 th 2008.
Information Retrieval
Overview of Web Data Mining and Applications Part I
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Data Mining By Dave Maung.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Research Academic Computer Technology Institute (RACTI) Patras Greece1 An Algorithmic Framework for Adaptive Web Content Christos Makris, Yannis Panagis,
Chaoyang University of Technology Clustering web transactions using rough approximation Source : Fuzzy Sets and Systems 148 (2004) 131–138 Author : Supriya.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Data mining in web applications
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Lin Lu, Margaret Dunham, and Yu Meng
Data Mining Chapter 6 Search Engines
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Fig. 1 (a) The PageRank algorithm (b) The web link structure
Presentation transcript:

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002

2 Agenda Introduction Related work Problem definition -- Hyperlink Selection Solution -- LinkSelector Evaluation Contributions, limitations and future work

3 Introduction Size of WWW (Lawrence and Giles,1999) 800 million web pages 1 million pages added daily How to find information on the Web Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking on hyperlinks

4 Introduction Product Category List A B C D E F Product Category A Product List A1 A2 A3 A4 A5 Product A2 Price: 1000 Detailed description Click on A Click on A2 Web Page 1 Web Page 2 Web Page 3 B2

5 Introduction Portal page: is the entrance to a website. Portal page Homepage of a website Default web portal (e.g.,My Yahoo!) Most My Yahoo! users never customize their default web portals (Manber et al., 2000).

6 Introduction Hyperlinks in a portal page are selected from a hyperlink pool. A hyperlink pool is a set of hyperlinks pointing to top-level web pages, e.g., hyperlink in a site index page.

7 Portal page

8 Hyperlink pool

9 Portal page

10 Hyperlink pool

11 Introduction Number of hyperlinks in a portal page: several dozens (e.g., 32 in the Arizona Home page). Number of hyperlinks in a hyperlink pool: several hundreds (e.g., 743 in the Arizona Index page).

12 Introduction It is too computational expensive to do an exhaustive search (e.g., ). Current practice of hyperlink selection – expert selection Only reflect domain experts’ perspectives Subjective

13 Introduction Our approach is based on web access patterns extracted from a web log – objective and reflect web surfers’ perspectives web structural patterns extracted from an existing website -- objective

14 Related work Web mining is the process of applying data mining techniques to extract patterns from the Web. Web Data Content: texts and graphics in web pages Structure: hyperlinks Usage: web logs

15 Related work Web content mining is the process of automatically retrieving, parsing, indexing and categorizing web documents.(Chakrabrati, 2000) Web structure mining HITS (Kleinberg, 1998) PageRank (Brin and Page, 1998)

16 Related work Web usage mining is the process of applying data mining techniques to extract web access patterns from a web log.

17 Related work Web usage mining General purpose, e.g., Chen et al. 1996; Cooley et al., 1999 Website improvement, e.g., Perkowitz and Etzioni, 2000 Personalization, e.g., Yan 1996

18 Related work Limitations of previous web usage mining research Not considering web structure information, e.g., Chen et al., 1996 Web structure information are used to exclude “uninteresting” web visiting patterns, e.g., Yan et al., 1996 and Cooley et al., 1999

19 Hyperlink Selection The quality of a portal page is measured using a web log and a web log can be divided into sessions. Metrics to measure the quality of a portal page Effectiveness Efficiency Usage

20 Hyperlink Selection Effectiveness: is the percentage of user- sought top-level web pages that can be easily accessed from the portal page. What are the user-sought top-level web pages? How to define the easiness to find a web page from a Portal page?

21 Hyperlink Selection User-sought top-level web pages Session j: L1, L10, L11, L2, L13, L14, L5, L9, L7, L12 L1, L2, L5, and L7 are in the hyperlink pool User-sought top-level web pages: L1, L2, L5, L7

22 Hyperlink Selection Usually, web pages that are 1-2 clicks away from a portal page can be easily found from the portal page.

23 Hyperlink Selection Effectiveness measured at session level Effectiveness measured at log level

24 Hyperlink Selection Efficiency measures the usefulness of hyperlinks placed in a portal page. Efficiency measured at session level Efficiency measured at log level

25 Hyperlink Selection Usage : how often a portal page is visited.

26 Hyperlink Selection Definition : Given a website w, its hyperlink pool HP and the number of hyperlinks to be placed in the portal page of w – N, where, the hyperlink selection problem is to construct the portal page by selecting N hyperlinks from the hyperlink pool HP to maximize the effectiveness, efficiency and usage of the resulting portal page (i.e., all metrics are measured at the web log level).

27 LinkSelector LinkSelector is based on relationships between hyperlinks in a hyperlink pool. Structure Relationship Access Relationship

28 LinkSelector Structure Relationship

29 LinkSelector Structure Relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1  L2 L1  L4 L1  L6 L1  L8 L3  L5 L3  L7

30 LinkSelector Access Relationship k-HS is denoted as a hyperlink set with k hyperlinks. e.g., {L1,L2} is a 2-HS The support of a k-HS is the percentage of sessions that web pages pointed to by hyperlinks in the k-HS are accessed together. e.g., If web pages pointed to by L1 and L2 are accessed together in 20 sessions out of total 100 sessions, then the support of 2-HS {L1,L2} is 20%.

31 LinkSelector Access Relationship Definition : For a k-HS, where, there exists an access relationship among elements in the k-HS if and only if its support is greater than a pre-defined threshold.

32 LinkSelector

33 LinkSelector Group-I relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1  L2 L1  L4 L1  L6 L1  L8 L3  L5 L3  L7

34 LinkSelector Group-I relationship L2 L4 L6 L8 L1 L3 Web page 1 Web page 2 L5 L7 Web page 3 Structure relationships: L1  L2 L1  L4 L1  L6 L1  L8 L3  L5 L3  L7 Access relationships: {L1,L2},0.1 {L1,L4},0.1 {L1,L6},0.05 {L1,L8},0.05 {L3,L5},0.4 {L3,L7},0.5

35 LinkSelector Group-I relationship provides indicators of preference for individual hyperlinks in hyperlink selection the number of structure relationships a hyperlink participate in as an initial hyperlink the quality of these structure relationships

36 LinkSelector Group-II relationship no structure relationship between L9 and L12 an access relationship between L9 and L12

37 LinkSelector Group-II relationship provides indicators of hyperlink pair preference in hyperlink selection: hyperlink pairs with Group-II relationships are preferred to hyperlink pairs without Group-II relationships within hyperlink pairs with Group-II relationships, hyperlink pairs with higher support of access relationship are preferred to those with lower support of access relationships

38 LinkSelector Group-III relationship reveals patterns that are not relevant to hyperlink selection Group-IV relationship does not reveal interesting patterns. L1 Web page 1 L5 Web page 2

39 LinkSelector The Sketch of LinkSelector

40 LinkSelector Discover Structure Relationships

41 LinkSelector Access relationship can be discovered from a web log using association rule mining Data Preprocessing Association rule mining

42 LinkSelector Data Preprocessing Web log cleaning Error logs (e.g., status code 404) Accessory logs (e.g.,.gif) Session identification Modify web server 30-min time interval

43 LinkSelector Association rule mining Given: a transaction database tid item 001 1,2, ,2, ,3, ,5,6 An itemset is a set of items, e.g., {1,2} The support of an itemset is the percentage of transactions that contain (e.g., purchase) the itemst. Objective: discover all itemsets with supports larger than a user-defined threshold. Apriori Algorithm (Agrawal and Srikant,1994 )

44 LinkSelector  Calculate Preferences for Hyperlinks

45 LinkSelector  Calculate Preferences for Hyperlinks Sets No structure relationships between and and and. Preference for hyperlink set is 0.022

46 LinkSelector Clustering is a data mining algorithm to segment objects into groups based on their similarities

47 LinkSelector Hyperlink Clustering : Hyperlinks  Objects Preferences for hyperlinks  Weights of objects Preferences for hyperlinks sets  Similarities among hyperlinks

48 LinkSelector Limitations of classical clustering algorithms Weights of objects are not considered. Only considers similarities between two objects

49 LinkSelector Our solution Indexes of the proposed similarity matrix are clusters while indexes of the traditional similarity matrix are objects to be clustered. Similarities involving two and more objects are considered in the proposed similarity matrix. Weights of objects are considered

50 LinkSelector

51 Experiment Data collected from UA website Hyperlink pool: 110 links Web log: collected in Sep M records  4.2 M records total 344 K sessions 262 k sessions  Training data (23 days) 82 k sessions  Testing data (7 days)

52 Experiment Hyperlinks selected by LinkSelector, Domain experts and access frequency (N=6)

53 Experiment Average improvement: 12.7% Improvement decrease from 22.1% to 8.4% Average number of sessions per day: 11.5k

54 Experiment Average improvement: 17.0% Improvement decrease from 30.2% to 9.4% Absolute number of hyperlinks improved: to 6509

55 Experiment Average improvement: 16.9% Improvement decrease from 30.2% to 9.3%

56 Experiment Hyperlinks selected by LinkSelector, Classical Hierarchical Clustering and Association rule mining (N=6)

57 Experiment Average improvement compared with association rule mining: 25.8% Average improvement compared with classical clustering: 102.0%

58 Experiment Average improvement compared with association rule mining: 31.7% Average improvement compared with classical clustering: 124.0%

59 Experiment Average improvement compared with association rule mining: 31.6% Average improvement compared with classical clustering: 123.0%

60 Contributions 1. We proposed and formally defined a new and important research problem – hyperlink selection. 2. We proposed and showed what a web mining based hyperlink selection approach outperforms other hyperlink selection approaches. 3.We developed a new clustering algorithm for hyperlink selection.

61 Limitations and Future work User Study Adaptive LinkSelector Structure of website changes Web visiting patterns change