Discovery of Significant Usage Patterns from Clickstream Data

Slides:



Advertisements
Similar presentations
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY This material is based upon work supported by Science Foundation Ireland under Grant No. 03/IN3/1361 TEMPORAL.
Advertisements

Web Mining.
Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Data Mining for Web Personalization
Indexing DNA Sequences Using q-Grams
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining Frequent Spatio-temporal Sequential Patterns
Mining Multiple-level Association Rules in Large Databases
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Web Usage Mining: Processes and Applications
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Discovery of Aggregate Usage Profiles for Web Personalization
1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.
Overview of Web Data Mining and Applications Part I
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web logs Data Engineering Lab 성 유 진.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Dr. Guandong Xu Intelligent Web & Information Systems (IWIS) Department of Computer Science, Aalborg University Web Usage Mining & Personalization.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Recommender systems Drew Culbert IST /12/02.
Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.
黃福銘 (Angus F.M. Huang) ANTS Lab, IIS, Academia Sinica TrajPattern: Mining Sequential Patterns from Imprecise Trajectories.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Discovery of Aggregate Usage Profiles for Web Personalization Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Yuqing Sun, Jim Wiltshire WebKDD 2000.
Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Self-Organized Web Usage Regularities. Problems of foraging information on WWW Slow accession Difficulty in finding useful information is related to balkanization.
07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department.
18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Research Academic Computer Technology Institute (RACTI) Patras Greece1 An Algorithmic Framework for Adaptive Web Content Christos Makris, Yannis Panagis,
Chaoyang University of Technology Clustering web transactions using rough approximation Source : Fuzzy Sets and Systems 148 (2004) 131–138 Author : Supriya.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Fuzzy Set Approach for Improving Web Log Mining Sajitha Naduvil-Vadukootu Csc 8810 : Computational Intelligence Instructor: Dr. Yanqing Zhang Dec 4, 2006.
Smart Miner: A New Framework for Mining Large Scale Web Usage Data
Data Mining: Concepts and Techniques
Mining Dependent Patterns
DATA MINING © Prentice Hall.
Data Mining Jim King.
Jiawei Han Department of Computer Science
Lin Lu, Margaret Dunham, and Yu Meng
Lecture 9: Entity Resolution
Section 7.12: Similarity By: Ralucca Gera, NPS.
Association Rule Mining
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
DATA MINING Introductory and Advanced Topics Part II - Clustering
SpeedTracer: A Web usage mining and analysis tool
A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS*
Clustering Wei Wang.
Handwritten Characters Recognition Based on an HMM Model
Applying principles of computer science in a biological context
Presentation transcript:

Discovery of Significant Usage Patterns from Clickstream Data Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741 05/04/05 , Travelocity

Web Usage Mining Overview Our Work: Significant Usage Patterns OUTLINE Web Usage Mining Overview Our Work: Significant Usage Patterns Ongoing/Future Research 05/04/05 , Travelocity

Web Usage Mining Applications Personalization Improve structure of a site’s Web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) 05/04/05 , Travelocity

Web Usage Mining Activities Preprocessing Web log Cleanse Remove extraneous information Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Count patterns that occur in sessions Pattern is sequence of pages referenced in session. Pattern Analysis 05/04/05 , Travelocity

Pattern Types Association Rules None of the properties hold Episodes Only ordering holds Sequential Patterns Ordered and maximal Forward Sequences Ordered, consecutive, and maximal Maximal Frequent Sequences All properties hold User Preferred Navigation Trail Not a true pattern, but representative of many 05/04/05 , Travelocity

Web Usage Mining Issues Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Security, privacy, and legal issues 05/04/05 , Travelocity

CAN’T SEE THE FOREST FOR THE TREES The BIG PICTURE 2003-10-0515:49:20050721435700000026210000000000               0265202652 000000000 2003-10-0516:40:49050832595900000872710001142380               0710707107 000000000 2003-10-0504:55:10050767799900000191300000670518               0000000000 000000000 2003-10-0509:43:10050781766100000603030000000000               0365700469 000000000 2003-10-0514:49:360508182420000007066200000000000811a39        0914207107 000000000 2003-10-0521:23:57050759031600000465050002794335               1199207107 000000000 2003-10-0511:30:16050730512600000465050000195747               1684600597corduroy+coats CAN’T SEE THE FOREST FOR THE TREES S-P1-P2-P3-P4-P5-P6-C1-C2-E S-P1-P2-P3-P4-P5-C4-I6-I7-I8-E 05/04/05 , Travelocity

SIGNIFICANT USAGE PATTERNS Solution Clustering Abstraction User Preferred Navigation Trails SIGNIFICANT USAGE PATTERNS 05/04/05 , Travelocity

Interests… Motivations… Web Log Web Server Preprocess Web Data: Cleanse Sessionize … Markov Model per Cluster Markov Model URL Abstraction User defined beginning/ending Web pages Significant Usage Pattern User Preferred Navigation Trail Cluster Web Sessions Normalized Probability

Significant Usage Pattern (SUP): SUP is a path that is extracted from a Markov model with user defined starting and ending states, and its corresponding normalized product of probabilities along the path satisfies a given threshold. Differences from previous research: - SUP is extracted from clusters of user sessions - user sessions are abstracted sessions - starting and ending with specific Web pages of user interests Need not be an exact pattern found in any session, but rather is representative of patterns found. 05/04/05 , Travelocity

Model 05/04/05 , Travelocity Sessionized Web Log Abstraction Hierarchy Sub-Abstracted Sessions Clusters of User Sessions Similarity Matrix Concept-based Abstracted Sessions per Cluster Apply Needleman-Wunsch global alignment algorithm Apply Nearest neighbor clustering algorithm Concept-based Abstracted URLs Transition Matrix per Cluster Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Patterns per Cluster Pattern Discovery Build Markov model for each cluster 05/04/05 , Travelocity

Abstract Web session data JCPenney Homepage D1 … D2 Dn C1 Cn I1 In Department level Category level Item level Fig 2. Hierarchy of JCPenney Web site Web session example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015 05/04/05 , Travelocity

Alignment of Web Sessions Compute the similarity between any two Web pages The higher the level in the hierarchy, the more importance it is in determining the similarity of two Web pages, should give more weight. - step 1: compare the two Web page representation strings from left to right and stop at the first pair where they are different. - step 2: compute the ratio of sum of the weights of those matching parts to the sum of total weights . Example Page 1: D0|C875|I weight=6+1+4+1+2=14 Page 2: D0|C875 weight=6+1+4+1=12 Similarity=12/14=0.857 05/04/05 , Travelocity

Generating Significant Usage Patterns 1 2 5 4 3 0.4 0.2 0.5 E 0.6 05/04/05 , Travelocity

 > 0.4, beginning state is 1, end state is 4 Examples  > 0.4, end state is 4  > 0.4, beginning state is 1, end state is 4 SUP  S1234 0.45 1234 0.46 S12354 0.53 12354 0.56 S124 124 0.5 S134 0.43 134 S1354 1354 0.58 S2354 S354 05/04/05 , Travelocity

Average Session Length Experimental Result Cluster Cluster No. No. of Sessions Average Session Length No. of States Threshold () Beginning Web page SUPs in BNF Notation Non-Purchase 1 1746 9.6 98 0.3 S S-{C}-E 0.25 P86806 P86806-{C}-E 2 241 6.6 38 0.37 S-{P}-[C]-E 0.34 P86806-[I]-{P}-E 3 13 3.0 6 S-<C | I>-{P}-E 0.2 P86806-[{P}- [P86806]]-E Purchase 1858 14.9 55 0.47 S-[C]-[I]-{P}-E 0.51 132 39.1 100 0.457 S -[{{C}|{I}}]-{P}-E 0.434 P86806-[{C }]-{P}-E 10 31.6 47 0.52 S-{P}-[{I}]-[{P}]-{C}-E 0.43 P86806-[I]-[{P}]-{C}-E 05/04/05 , Travelocity

Future/Ongoing Research Scalability Fewer patterns Smaller patterns MM less space than table Clusters to identify Behaviors Business vs Leisure Cloaked Crawler Online Identification of Cluster 05/04/05 , Travelocity