Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse AxIS Research Team INRIA Sophia Antipolis and Rocquencourt
Motivations To show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge the benefits of our InterSite pre-processing method proposed by Tanasa in his PhD Thesis (2005) And the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs 2 main viewpoints: User and web site charge
Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit « Group of SessionIDs » - first statistical Intersite analysis 2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4 3. Conclusions
Table 1. Format of page requests ShopIDDateIP addressSessionIDPageReferrer dad92c4…84208dca/ ee02ddcff…7655bb9e/ct/?c=148http:// Table 2. Number of requests per shop ShopIDSite name (shop)#Requests 10www.shop1.cz509,688 11www.shop2.cz400,045 12www.shop3.cz645,724 14www.shop4.cz1,290,870 15www.shop5.cz308,367 16www.shop6.cz298,030 17www.shop7.cz164,447 Data pre-processing Initial data:
Data pre-processing Tanasa & Trousse (IEEE Intelligent Systems 2004) Tanasa ‘s Thesis (2005)
Table 3. Transformed log lines DatetimeIPSessionIDURLReferrer :01: dad92c4…84208dcahttp:// :01: ee02ddcff…7655bb9ehttp:// Data pre-processing Data Structuration SessionID a single visit on each shop Towards the notion of user’s intersite visit: we group such SessionIDs that belongs to a single user (same IP) into a « Group of SessionIDs ». We compare the Referer with the URLs previously accessed (in a reasonable time window) 522,,410 SessionIDs into 397,629 Groups, equivalent to a 23.88% reduction; Data fusion, data cleaning
Relational DB model Data summarisation
Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop Data pre-processing Low number of new visits on Saturdays and Sundays during the lunch time The high number of new visits on Tuesdays and Wednesdays Same results a) and b)
Crossed Clustering Aproach for Time Periods/Product Analysis Data: Selection of ls pages in shop 4 (the most used) Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)
Crossed Clustering Aproach for Time Periods/Product Analysis Relational BD model : We add easily a crossed table Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals Column: a multi-categorical variable representing the number of products requested by users into the specific time slice Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)
Crossed Clustering Aproach for Time Periods/Product Analysis Table 4. Quantity of products requested by weekday x hour and registered on shop 4 Weekday x HourProduct (number of requests) Monday_0 Built-in electric hobs (10), Built-in dish washers 60cm (64), Corner single sinks (50),... Monday_1 Free standing combi refrigerators (44), Corner single sinks (50), Built-in hoods (60),... … … Sunday_22 Built-in microwave ovens (27), Built-in dish washers 45cm (38), Built-in dish washers 60cm (85),... Sunday_23 Built-in freezers (56), Kitchen taps with shower (45), Garbage disposers (32),...
Crossed Clustering Aproach for Time Period/Product Analysis Table 5. Confusion table Product_1Product _2Product _3Product _4Product _5Total Period_ Period_ Period _ Period _ Period _ Period _ Period _ Total ,7%
Crossed Clustering Aproach for Time Period/Product Analysis Example of one surprising result: the class Product 5 is defined by one type of products « Free standing combi refrigerators » consulted predominantly on Fridays from 17:00 to 20:00 (class period 6) 57,7% of such a product type requested on this period
Conclusions 1. Intersite Data Pre-Processing - structuration into user’s intersite visits « Group of SessionIDs » - first statistical Intersite analysis - anomalies and recommandations for the dataset 2. Crossed Clustering Approach - first application of such a method on time periods of Web logs and in e-commerce domain - promising results
Data pre-processing Inconsistency problems: - table kategorie: found repeated entries and different entries with same ID - for some page types (dt, df) the given parameter represented actually a specific product, not the given product description (from products table). - extra parameters equivalent to the give ones for some page types: i.e. for ct page type, id is equivalent to the given c parameter - missing values (descriptions) in tables: 3 values in product table and 64 in category table - multiple site SessionIDs: 13 cross-server visits had same SessionID on the visited sites (up to 4 sites); SessionID should change on each new site; - multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than one IP (anonymization proxies ?).