Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Clustering II.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.
Segmentation Graph-Theoretic Clustering.
Discovery of Aggregate Usage Profiles for Web Personalization
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
The identification of interesting web sites Presented by Xiaoshu Cai.
Segmentation Course web page: vision.cis.udel.edu/~cv May 7, 2003  Lecture 31.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Mining and Visualizing the Evolution of Subgroups in Social Networks Falkowsky, T., Bartelheimer, J. & Spiliopoulou, M. (2006) IEEE/WIC/ACM International.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Self-Organized Web Usage Regularities. Problems of foraging information on WWW Slow accession Difficulty in finding useful information is related to balkanization.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Chaoyang University of Technology Clustering web transactions using rough approximation Source : Fuzzy Sets and Systems 148 (2004) 131–138 Author : Supriya.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
DATA MINING © Prentice Hall.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Research in Computational Molecular Biology , Vol (2008)
Lin Lu, Margaret Dunham, and Yu Meng
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Segmentation Graph-Theoretic Clustering.
DATA MINING Introductory and Advanced Topics Part II - Clustering
Discovery of Significant Usage Patterns from Clickstream Data
Topological Signatures For Fast Mobility Analysis
Presentation transcript:

Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh

Nov, 2002Banerjee and Ghosh2 Motivation Why Characterize or Predict web user behavior? Site-centric view: Personalization, sticky websites User-centric view: personal agents for information acquisition Universalist approaches: Pagerank, web metrics,…

Nov, 2002Banerjee and Ghosh3 Clustering Users from Web Logs Wide variety of web behavior  segment users based on surfing behavior as a first step to further analysis. User: set of sessions Session: sequence of –(page I.d., time spent on that page) tuples –How to cluster sets of sequences?

Nov, 2002Banerjee and Ghosh4 The Approach Cluster Sessions –Session Similarity Measure –Session Similarity Graph Outlier Detection –Graph Partitioning Create a Cluster Space Cluster users in this Space

Nov, 2002Banerjee and Ghosh5 A Similarity Measure for Sessions 1.Overlap between two sessions represented by the longest common subsequence (LCS) 2.Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS) The similarity component : –Average min-max similarity for each page in the LCS The importance component : –Average of the fraction of overall session time spent in the LCS

Nov, 2002Banerjee and Ghosh6 Session Clustering Find the pairwise similarity values between all pair of sessions; record only similarities >  Incrementally construct similarity graph G  –the vertices are the sessions, the edge weights are the session similarity values –no isolated vertices (discard “outliers”) Balanced Graph Partitioning –we used Metis [Karypis, Kumar]

Nov, 2002Banerjee and Ghosh7 The Cluster Space Given: each session assigned to one of k clusters (sets)  Sessions of a user are distributed among the k sets –vector u = [u 1 u 2 … u k ] T where u i = number of sessions of the user belonging to cluster I Stage II : User Clustering –find pairwise similarity values using the extended Jaccard measure –partition similarity graph Gives l user clusters and a set of outlier users

Nov, 2002Banerjee and Ghosh8 The Dataset : Sulekha.com

Nov, 2002Banerjee and Ghosh9 Dataset details Logs over a one month period Raw log size 184 Mb 453,953 files accessed 37,753 sessions in all 23,310 sessions after some preprocessing/filtering 2,493 users

Nov, 2002Banerjee and Ghosh10 Results : Session Clusters Cluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles -(/,12)(/movies,6)(/contests,178) -(/contests,142) -(/coffeehouse,5)(/contests,183) -(/contests,172) -(/,10)(/contests,143) -(/,22)(/articles,22) -(/,20)(/articles,20) -(/,21)(/articles,21) -(/,19)(/articles,19) -(/,20)(/articles,19) Cluster 3 – interest in author, articles Cluster 4 – read articles -(/,148)(/authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors,295)(/articles,295) -(/,33)(/authors,90)(/articles,475) -(/,32)(/authors,91)(/articles,425) -(/,39)(/articles,98)(/misc,17) (/articles,2649) -(/,9)(/articles,2666) -(/authors,26)(/articles,2561) -(/misc,20)(/articles,77)(/misc 32)(/articles,43)(/authors,16) (/articles,2373.1)

Nov, 2002Banerjee and Ghosh11 Results : User Clusters user : [( xxx.xxx)] –(/authors,3)(/articles,129) –(/authors,8)(/articles,8) –(/authors,80)(/articles,2141) user : [( xxx.xxx)] –(/home,77)(/articles,111)(/authors,93)(/articles,629)(/m isc,58) (/coffeehouse,75)(/wo-men,967) –(/articles,2627) user : [( xxx.xxx)] –(/home,323)(/articles,24)(/authors,45)(/articles,1290) A user cluster : people who read the articles

Nov, 2002Banerjee and Ghosh12 Results : User Clusters user : [( xxx.xxx)] –(/home,21)(/wo-men,1075)(/philosophy,52) user : [( xxx.xxx)] –(/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo- men,31) –(/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290) –(/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093) A user cluster : people interested in wo-men, philosophy, coffeehouse

Nov, 2002Banerjee and Ghosh13 Results : User Clusters user : [( xxx.xxx)] –(/coffeehouse,12)(/biztech,25)(/books,48) –(/coffeehouse,13)(/biztech,26)(/books,19) user : [( xxx.xxx)] –(/coffeehouse,162) –(/coffeehouse,40) user : [( xxx.xxx)] –(/coffeehouse,12)(/contests 12) –(/coffeehouse,43)(/contests 44) A user cluster : people interested in coffeehouse – bookmarked it !

Nov, 2002Banerjee and Ghosh14 Result Visualization using CLUSION [Strehl &Ghosh 01] Sessions Users

Nov, 2002Banerjee and Ghosh15 Conclusions Segmentation: a basic pre-processing step for Web Mining Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure For certain websites, time spent on the pages matters –not handled by current commercial tools Outlier detection before clustering is important Results QA-ed by human subjects –Results for clusters & outliers at both levels were subjectively good No good way to find cluster quality analytically Formation of similarity graph is a slow process

Nov, 2002Banerjee and Ghosh16 Future Work Improve the present method by: –using cluster seeds for cluster growing –using alternative clustering algorithms for each stage –studying the effect of thresholds, number of clusters on performance –studying the importance of order of page-visits –studying the importance of balanced clustering

Nov, 2002Banerjee and Ghosh17 Backup

Nov, 2002Banerjee and Ghosh18 Issues : Choice of Parameters Number of session clusters, k, should be chosen appropriately Thresholds for forming session & user similarity graphs : –threshold value should be chosen after looking at the distribution of edge weights

Nov, 2002Banerjee and Ghosh19 Related Work Research in Web Mining : –Extraction of navigational patterns : Spiliopoulou, Faulstich –Ordering relationships : Mannila, Meek –Surfing prediction : Pitkow, Pirolli –Clustering web usage sessions : Fu, Sandhu, Shih

Nov, 2002Banerjee and Ghosh20 Example Sessions : –Session 1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)] –Session 2 = [(b,5) (d,12) (f,1) (a,7) (c,5)] LCS pages = [(b)(d)(c)] Corresponding Index, Times Sequences : –Index 1 = [(1)(2)(3)], Time 1 = [(100) (8) (5)] –Index 2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)] Similarity over each LCS page : of the two times –Similarity on page b = 5/100 = 0.05 –Similarity on page d = 8/12 = 0.67 –Similarity on page c = 5/5 = 1.00

Nov, 2002Banerjee and Ghosh21 Example (contd.) The similarity component = ( )/3 = 0.57 The importance component : –Fraction of time spent in the LCS by Session 1 = 113/149 = 0.76 –Fraction of time spent in the LCS by Session 2 = 22/30 = 0.73 –The mean = ( )/2 = 0.75 The overall similarity = 0.57 x 0.75 = 0.43

Nov, 2002Banerjee and Ghosh22 Issues : Session Resolution Generate coarse resolution paths making use of the concept hierarchy of the website Reduces computations; Increases interpretability of results Original PathConcept-level Path (/authors/ramesh_mahadevan.html,3) (/articles/rm_phattas.html,75) (/articles/rm_desidads.html,39) (/authors,3) (/articles,114) (/authors/arun_sampath.html,109) (/philosophy/messages/1951.html,102) (/philosophy/messages/1953.html,46) (/,3) (/philosophy/messages/1954.html,69) (/authors,109) (/philosophy,148) (/,3) (/philosophy,69)

Nov, 2002Banerjee and Ghosh23 Comments Results QA-ed by human subject –Results for clusters & outliers at both levels were subjectively good –No good way to find cluster quality analytically Clustering algorithms for the two stages –Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage –Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate Cluster space –Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem