Download presentation
Presentation is loading. Please wait.
Published byDamian McCarthy Modified over 9 years ago
1
Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal
2
Web server data 7 internet shops (home electronics) 80.000 visitors (IP-addresses) in 25 days 0.5 million sessions 3 million clicks (records in a log file) Example record: 11;1076262912;193.170.198.122; eb5cbe50997fcb7f9155c6c194c832a8;/znacka/?c=162&tisk=ano; http://www.google.com./search?hl=cs&q=Sennheiser+HD+650 &btnG=Vyhledat+Googlem&lr=lang_cs Objective: discover interesting patterns !!!
3
Data Mining Process
4
Anomalies/Strange things I Multiple IP-addresses per session –2 IP-addresses: 3.051 sessions –3 IP-addresses: 362 sessions –4 IP-addresses: 113 sessions –……………… –22 IP-addresses: 1 session –Some sessions involve IP’s from different countries A few sessions (12) refer to multiple shops
5
Anomalies/Strange things II Sessions with long duration –476 sessions longer than 24 hours (up to 18 days) Very Intensive Sessions –2.865 sessions with more than 100 visited pages –19 sessions with more than 1.000 visited pages –2 sessions with more than 10.000 visited pages Frequent IP-addresses with short sessions –E.g.: 29.320 sessions in less than 20 hours from 147.229.205.80 “Parallel sessions” –Overlapping sequences of clicks from the same IP to the same shop within a short period with multiple SIDs (Opening a new window? Making a transaction? )
6
Anomalies/Strange things III Sequences of short sessions that form sessions Example: clicks from 62.209.194.163 (31 Jan 04) 09:40:09 /dt/?c=13654;http://www.shop5.cz/ 09:41:21 /dt/param.php?id=115; 09:41:21 /; 09:41:37 /ls/?id=20;http://www.shop5.cz/dt/?c=13654 09:41:42 /; 09:42:24 /ls/?&id=20&view=1,2,3,8&pozice=20;http://www.shop5.cz/ls/… 09:42:25 /; 09:42:48 /ls/?&id=20&view=1,2,3,8;http://www.shop5.cz/ls/?&id=20& … 09:42:48 /; 09:42:53 /ls/?&id=20&view=1,2,3,8&pozice=40;http://www.shop5.cz/ls/… Each one has another session identifier !!!
7
Fixing the data A new definition of “session”: A chronologically ordered sequence of “clicks” from the same IP-address to the same shop with no gaps longer than 30 minutes Sessions longer than 50 clicks ignored (12.000) Number of sessions dropped: 522.410 281.153
8
Old and New Sessions Session LengthCount OldCount New 1318.52365.258 2 24.76231.821 3 17.35318.828 4 15.35116.332 5 15.36115.509 6 13.45513.448 7 10.95810.883 8 9.045 9.095 9 7.939 8.070 10 7.028 7.091...
9
Visitor Profiling Motivation: On the internet each shop is just “one click away”. If a user is not satisfied with the service he/she just goes to a next one and will likely never return.
10
Visitor Profiling Scheme I.Clustering of user sessions II.Analysis/interpretation of the clusters III.Assign a cluster label to each session IV.Analysis of the profile sequences
11
Clustering Cadez et al. (2001) - predictive profiles from historical transaction data Mixture of multinomials: Full data likelihood: The unknown parameters and are estimated by the expectation maximization (EM) algorithm.
12
Interpretation of the clusters Profile 1 General overview of the products Profile 2 Focused search Profile 3 Potential buyers Profile 4 Parameter based search
13
The transitions of profiles P1P2P3P4 P10.72080.15920.06210.0579 P20.59080.28280.07100.0553 P30.50220.16160.28730.0489 P40.60000.17020.06850.1613
14
Tree of user profiles
15
Tree of potential buyers
16
Conclusion We spot several anomalies background information about pre-processing & data preparation is important Important features were missing (who is a buyer?) Four clear user profiles
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.