Lin Lu, Margaret Dunham, and Yu Meng

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Data Mining for Web Personalization

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Word Spotting DTW.

Ricochet A Family of Unconstrained Algorithms for Graph Clustering.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Personalizing Web Page Recommendation via Collaborative Filtering and Topic-Aware Markov Model Qingyan Yang, Ju Fan, Jianyong Wang, Lizhu Zhou Database.

Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.

Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.

Aki Hecht Seminar in Databases (236826) January 2009

Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical.

Discovery of Aggregate Usage Profiles for Web Personalization

1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore

Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.

FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.

Dr. Guandong Xu Intelligent Web & Information Systems (IWIS) Department of Computer Science, Aalborg University Web Usage Mining & Personalization.

Introduction to Profile Hidden Markov Models

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

南台科技大學資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.

Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague.

Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Personalized Course Navigation Based on Grey Relational Analysis Han-Ming Lee, Chi-Chun Huang, Tzu- Ting Kao (Dept. of Computer Science and Information.

1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Self-Organized Web Usage Regularities. Problems of foraging information on WWW Slow accession Difficulty in finding useful information is related to balkanization.

07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department.

Learning User Behaviors for Advertisements Click Prediction Chieh-Jen Wang & Hsin-Hsi Chen National Taiwan University Taipei, Taiwan.

Research Academic Computer Technology Institute (RACTI) Patras Greece1 An Algorithmic Framework for Adaptive Web Content Christos Makris, Yannis Panagis,

Chaoyang University of Technology Clustering web transactions using rough approximation Source : Fuzzy Sets and Systems 148 (2004) 131–138 Author : Supriya.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.

Fuzzy Set Approach for Improving Web Log Mining Sajitha Naduvil-Vadukootu Csc 8810 : Computational Intelligence Instructor: Dr. Yanqing Zhang Dec 4, 2006.

Learning to Align: a Statistical Approach

Recommender Systems & Collaborative Filtering

DATA MINING © Prentice Hall.

A Melody Composer for both Tonal and Non-Tonal Languages

Data Mining Jim King.

Web Mining Ref:

Yi-Chia Wang LTI 2nd year Master student

Web Data Extraction Based on Partial Tree Alignment

Authors Bo Sun, Fei Yu, Kui Wu, Yang Xiao, and Victor C. M. Leung.

Section 7.12: Similarity By: Ralucca Gera, NPS.

Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.

Boštjan Kožuh Statistical Office of the Republic of Slovenia,

Pairwise sequence Alignment.

Intro to Alignment Algorithms: Global and Local

DATA MINING Introductory and Advanced Topics Part II - Clustering

Mining Sequential Patterns

SpeedTracer: A Web usage mining and analysis tool

A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS*

Scale-Space Representation for Matching of 3D Models

Handwritten Characters Recognition Based on an HMM Model

Discovery of Significant Usage Patterns from Clickstream Data

Presentation transcript:

Discovery of Significant Usage Patterns from Clusters of Clickstream Data Lin Lu, Margaret Dunham, and Yu Meng Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275-0122 llu(mhd,ymeng)@engr.smu.edu WebKDD’05 1

Beginning/ending Web page(s) Introduction Significant Usage Patterns (SUP) - SUP is extracted from clusters of abstracted user sessions - Use a unique two-phase abstraction technique - With desired beginning and/or ending Web pages - With normalized probability Clustering Abstraction Beginning/ending Web page(s) Normalized Sequential Pattern N Y* - Maximal Frequent Sequence Maximal Frequent Forward Sequence User Preferred Navigational Trail [1,2] Significant Usage Pattern Y WebKDD’05 2

Model WebKDD’05 3 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery SUPs per Cluster WebKDD’05 3

Alignment of Web sessions Create sub-abstracted Web sessions URL -> {<Concept hierarchy keyword> <Unique ID> <|>} JCPenney Homepage D1 … Dn C1 Cn I1 In Department level Category level Item level Fig 1. Hierarchy of J.C. Penney Web site D2 Example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015 WebKDD’05 4

Alignment of Web sessions Computing the similarity between any two Web pages The higher the level in the hierarchy, the more importance in determining the similarity of two Web pages, should give more weight. Scoring scheme - step 1: determine the longer page representation string in the two Web page representations. - step 2: weight is assigned to each level in the hierarchy: the lowest level in longer page representation string is given weight 2 to its abstract level, the second to the lowest level is given weight 4 to its abstract level, and so on. The corresponding ID is always given weight 1. WebKDD’05 5

Alignment of Web sessions Computing the similarity between any two Web pages - step 1: compare the two Web page representation strings from the left to the right and stopped at the first pair which they are different. - step 2: compute the ratio of the sum of the weights of those matching parts to the weight of longer page representation string. Example: Page 1: D0|C875|I Weight=6+1+4+1+2=14 Page 2: D0|C875 Weight=6+1+4+1=12 Similarity=12/14=0.857 WebKDD’05 6

Model WebKDD’05 7 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’05 7

Alignment of Web sessions Computing optimal alignment of two sequences using Needleman-Wunsch algorithm Y1 … Yj-1 Yj Yn -d -(j-1)d -jd -nd X1 Xi-1 -(i-1)d Xi -id Xm -md A(m, n) A(i-1, j-1) A(i-1, j) A(i, j-1) A(i, j) A(i, j) = max[A(i-1, j-1)+s(Xi, Yj); A(i-1, j)-d; A(i, j-1)-d] where s(Xi, Yj) is the similarity between Xi and Yj, d is the score of aligning Xi (Yj) with a gap WebKDD’05 8

Alignment of Web sessions Apply Needleman-Wunsch global alignment algorithm Scoring scheme [3] if (matching) score = 20; //a pair of Web pages with similarity 1 else if (mis-matching) score = –10; //a pair of Web pages with similarity 0 else if (gap) score = –10; //a Web page aligns with a gap else score = –10 ~ 20; //the pair of Web pages with similarity between 0 and 1 Example: P47104 D0|C0|I D469|C469 D2652|C2652 D469|C16758|I D0|C0|I D469|C469 P47104 D0|C0|I D469|C469 D2652|C2652 -10 -20 -30 -40 D469|C16758|I 5.7 -4.3 -14.3 10 17.1 7.1 30 32.1 Thus, session similarity = 32.1/4 = 8.025 WebKDD’05 9

Model WebKDD’05 10 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’05 10

Model WebKDD’05 11 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’05 11

Create Concept-based Abstracted Sessions Represent the abstracted page accesses in a session as a sequence like: P1 D1 C1 I1 P2 D2 C2 I2 … In a session, the same Pi, Di, Ci, and Ii (i=1, 2…) represents the same page. However, in different sessions, the same page may be represented by different elements. Example: Original session: D7107|C7121 D7107|C7126|I076bdf3 D7107|C7131|I084fc96 D7107|C7131 P55730 P96 P27 P14 P27592 P28 P33711 -505884861 Abstracted session: C1 I1 I2 C2 P1 P2 P3 P4 P5 P6 P7 -505884861 WebKDD’05 12

Generating Significant Usage Patterns Use Markov model to represent sessions in each cluster Example: 0.4 0.17 0.2 0.5 0.33 0.25 0.75 1 S 2 5 3 4 E (1) 1, 2, 3, 5, 4 (2) 2, 4, 3, 5 (3) 3, 2, 4, 5 (4) 1, 3, 4, 3 (5) 4, 2, 3, 4, 5 The probability of a path normalized where Pti is the transition probability between two adjacent states WebKDD’05 13

Generating Significant Usage Patterns Example:  > 0.4, end state is 4  > 0.4, beginning state is 1, end state is 4 SUP  S1234 0.45 1234 0.46 S12354 0.53 12354 0.56 S124 124 0.5 S134 0.43 134 S1354 1354 0.58 S2354 S354 WebKDD’05 14

Experimental Result sessions without purchase WebKDD’05 15 On average purchase sessions are longer than those sessions without purchase - review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase WebKDD’05 15

Average Session Length Experimental Result SUPs in non-purchase cluster Cluster No. No. of Sessions Threshold () Average Session Length No. of States SUPs 1 1746 0.3 9.6 98 1. S-C1-C1-C2-C3-C4-C5-C6-C7-E 2. S-C1-C1-C2-C3-C4-C5-E 3. S-C1-C1-C2-C3-E 4. S-C1-C2-C3-C3-C4-C5-C6-C7-E 5. S-C1-C2-C3-C4-C4-C5-C6-C7-E … 2 241 0.37 6.6 38 1. S-P1-P2-P3-P3-E 2. S-P1-P2-P3-P4-P4-P5-E 3. S-P1-P2-P3-P4-P4-E 4. S-P1-P2-P3-P4-P5-P4-E 5. S-P1-P2-P3-P4-P5-P5-E 3 13 3.0 6 1. S-C1-P1-P2-E 2. S-C1-P1-E 3. S-I1-P1-P1-P2-E 4. S-I1-P1-P1-E 5. S-I1-P1-E Interested in gathering information of products in different categories. S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E Interested in reviewing general pages (to gather general information). Not serious visitors (the average session length is 3) WebKDD’05 16

Experimental Result WebKDD’05 17 Cluster No. No. of Sessions Average Length States Threshold () Beginning Web page SUPs in BNF Notation Non- Purchase 1 1746 9.6 98 0.3 S S-{C}-E 0.25 P86806 P86806-{C}-E 2 241 6.6 38 0.37 S-{P}-[C]-E 0.34 P86806-[I]-{P}-E 3 13 3.0 6 S-<C | I>-{P}-E 0.2 P86806-[{P}- [P86806]]-E 1858 14.9 55 0.47 S-[C]-[I]-{P}-E 0.51 132 39.1 100 0.457 S -[{{C}|{I}}]-{P}-E 0.434 P86806-[{C }]-{P}-E 10 31.6 47 0.52 S-{P}-[{I}]-[{P}]-{C}-E 0.43 P86806-[I]-[{P}]-{C}-E review the information, compare among products, and fill out the payment and shipping information The average length of SUPs is longer in the purchase cluster than in non-purchase cluster SUPs in the purchase cluster have higher probability than those in non-purchase cluster. have purchase in mind vs. random browsing behavior WebKDD’05 17

Conclusion and Future Work Summary - By applying clustering to abstracted user sessions, it is more likely to find groups of users with similar motivations for visiting a specific website. - By giving the flexibility for user to specify the beginning and/or ending Web page(s), users can have more control in generating patterns of their interests. Future - Scalability - Cluster to identify different user groups - Online identification of user to predefined cluster WebKDD’05 18

References [1] J. Borges and M. Levene, “Data Mining of User Navigation Patterns”, In Proc. the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), 31-36, San Diego, August 15, 1999. [2] J. Borges and M. Levene, “An average linear time algorithm for web data mining”, International Journal of Information Technology and Decision Making, 3, (2004), 307-320. [3] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment”, Third International Workshop on Management of Information on the Web in conjunction with 13th International Conference on Database and Expert Systems Applications DEXA'2002, pp 394-398, Aix en Provence, France, September 2-6, 2002.

Thank you Questions? WebKDD’05 20