Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery.

Similar presentations


Presentation on theme: "Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery."— Presentation transcript:

1 Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal

2 Web server data 7 internet shops (home electronics) 80.000 visitors (IP-addresses) in 25 days 0.5 million sessions 3 million clicks (records in a log file) Example record: 11;1076262912;193.170.198.122; eb5cbe50997fcb7f9155c6c194c832a8;/znacka/?c=162&tisk=ano; http://www.google.com./search?hl=cs&q=Sennheiser+HD+650 &btnG=Vyhledat+Googlem&lr=lang_cs Objective: discover interesting patterns !!!

3 Data Mining Process

4 Anomalies/Strange things I Multiple IP-addresses per session –2 IP-addresses: 3.051 sessions –3 IP-addresses: 362 sessions –4 IP-addresses: 113 sessions –……………… –22 IP-addresses: 1 session –Some sessions involve IP’s from different countries A few sessions (12) refer to multiple shops

5 Anomalies/Strange things II Sessions with long duration –476 sessions longer than 24 hours (up to 18 days) Very Intensive Sessions –2.865 sessions with more than 100 visited pages –19 sessions with more than 1.000 visited pages –2 sessions with more than 10.000 visited pages Frequent IP-addresses with short sessions –E.g.: 29.320 sessions in less than 20 hours from 147.229.205.80 “Parallel sessions” –Overlapping sequences of clicks from the same IP to the same shop within a short period with multiple SIDs (Opening a new window? Making a transaction? )

6 Anomalies/Strange things III Sequences of short sessions that form sessions Example: clicks from 62.209.194.163 (31 Jan 04) 09:40:09 /dt/?c=13654;http://www.shop5.cz/ 09:41:21 /dt/param.php?id=115; 09:41:21 /; 09:41:37 /ls/?id=20;http://www.shop5.cz/dt/?c=13654 09:41:42 /; 09:42:24 /ls/?&id=20&view=1,2,3,8&pozice=20;http://www.shop5.cz/ls/… 09:42:25 /; 09:42:48 /ls/?&id=20&view=1,2,3,8;http://www.shop5.cz/ls/?&id=20& … 09:42:48 /; 09:42:53 /ls/?&id=20&view=1,2,3,8&pozice=40;http://www.shop5.cz/ls/… Each one has another session identifier !!!

7 Fixing the data A new definition of “session”: A chronologically ordered sequence of “clicks” from the same IP-address to the same shop with no gaps longer than 30 minutes Sessions longer than 50 clicks ignored (12.000) Number of sessions dropped: 522.410  281.153

8 Old and New Sessions Session LengthCount OldCount New 1318.52365.258 2 24.76231.821 3 17.35318.828 4 15.35116.332 5 15.36115.509 6 13.45513.448 7 10.95810.883 8 9.045 9.095 9 7.939 8.070 10 7.028 7.091...

9 Visitor Profiling Motivation: On the internet each shop is just “one click away”. If a user is not satisfied with the service he/she just goes to a next one and will likely never return.

10 Visitor Profiling Scheme I.Clustering of user sessions II.Analysis/interpretation of the clusters III.Assign a cluster label to each session IV.Analysis of the profile sequences

11 Clustering Cadez et al. (2001) - predictive profiles from historical transaction data Mixture of multinomials: Full data likelihood: The unknown parameters and are estimated by the expectation maximization (EM) algorithm.

12 Interpretation of the clusters Profile 1 General overview of the products Profile 2 Focused search Profile 3 Potential buyers Profile 4 Parameter based search

13 The transitions of profiles P1P2P3P4 P10.72080.15920.06210.0579 P20.59080.28280.07100.0553 P30.50220.16160.28730.0489 P40.60000.17020.06850.1613

14 Tree of user profiles

15 Tree of potential buyers

16 Conclusion We spot several anomalies  background information about pre-processing & data preparation is important Important features were missing (who is a buyer?) Four clear user profiles


Download ppt "Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery."

Similar presentations


Ads by Google