Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte.

Slides:



Advertisements
Similar presentations
Two Components IMAT consists of two components
Advertisements

ASP Date and Time Function List It's important to test that your data is valid before you work with it. These variable testing functions do the trick!
Copyright © DigitalSports. All Rights Reserved. How To Sign-Up/Update Alerts on Your School’s DigitalSports Site.
Chapter 12 Goodness-of-Fit Tests and Contingency Analysis
The value and challenges of micro- component domestic water consumption datasets Jo Parker Working as part of the ESPRC - ARCC water project with the support.
Problem of the Week! Max was in charge of getting the equipment for the 14 people on his baseball team. He made sure he had 8 bats and 38 baseballs. He.
Ain't It A Shame 1-4 Aint it a shame to work on Sunday, Aint it a shame, (a working shame,) Aint it a shame to work on Sunday, Aint it a shame, (a working.
Dave Krause ANRCS Web Action Team.  Data is collected from a web site based on what the user does during the visit.
Chapter 12: Web Usage Mining - An introduction
Creating A Blog Using WordPress at PSU. The Steps to Setup a Blog 1.Create a Blog (this is done only once, after this, login to add, or edit) 2.Login.
Saturday May 02 PST 4 PM. Saturday May 02 PST 10:00 PM.
Introduction Overview Log in Check Browser myWebCT Bookmarks Global Calendar Help Enter a course.
Career Development Event b Electricity b Principles b Concepts b Application.
Time Zones The World is divided up into 24 time zones. Each time zone is approximately 15 degrees of longitude wide. The time within each time zone has.
Writing and Solving Proportions. Proportions Proportion is an equation stating that two ratios are equivalent. Proportional are two quantities that form.
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
F IRST W EEK Administration Guide
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Remapping of Codes (and of course Decodes) in Analysis Data Sets for Electronic Submissions Joerg Guettner, Lead Statistical Analyst Bayer Pharma, Wuppertal,
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
PHP and MySQL for Client-Server Database Interaction Chapter 10.
Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab
FFAVORS WEB Ordering Manual GSA Schedule: GS35F4594G FEDSIM Task Order Number: GSTFMGBPA10001CO05 FEDSIM Project Number: 11047AGM and 11048AGM SRA Project.
Chapter 1 Introduction to Data Mining
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
Welcome to the Minnesota SharePoint User Group. Quick Intro Keynote Presentation – Chris Caposella User Group Kick Off Introductions Goals / Objectives.
NoodleBib Basics Open, Login, Create and Print Lists.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
ArcGIS Server for Administrators
Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague.
What’s new in Kentico CMS 5.0 Michal Neuwirth Product Manager Kentico Software.
User Behavior Analysis of Location Aware Search Engine Third international Conference of MDM, 2002 Takahiko Shintani, Iko Pramudiono NTT Information Sharing.
A Guide to Using Partner Publishers’ Resources (module 3)
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Dwayne Forrester Next  A blog is a discussion or informational site published on the World Wide Web and consisting of discrete entries ("posts") typically.
10/14/10 BR – What type of Chart is this? Be sure to hand in this week’s bellringers!
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
The Semantics of Classification Motivating the New Part 2 Jim Carpenter Bureau of Labor Statistics WG2 Meetings Santa Fe, NM January 27-31, 2003.
Monday, August 20, 2012 NEED TO BRING HEADPHONES TO CLASS Go to and create new pages under ‘TechTips’ section in One Note journal
CSCI 6962: Server-side Design and Programming Java Server Faces Scoping and Session Handling.
Web Time Entry Hours Entry in ESS 04/26/12 1 Banner.
Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery.
Using REDCap (Research Electronic Data Capture) as a tool to perform research studies Abstract ID no. IRIA-1076.
Mean, Median, Mode, & Range Lesson 3-1 Pg. #
ECMM6018 Enterprise Networking for Electronic Commerce Tutorial 7
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
COORDINATOR: KATELYNN BOURASSA Psychology Extra Credit Option.
Visit us on the Web at UM Stats Camp Intro to SPSS for Windows Sam Gordji Spring 2009
1 After completing this lesson, you will be able to: Open and preview a FrontPage-based Web site. Open and preview an individual Web page. Look at a Web.
1 PHP HTTP After this lecture, you should be able to know: How to create and process web forms with HTML and PHP. How to create and process web forms with.
Esri UC 2014 | Demo Theater | Integrating the Census Data API with ArcGIS Web Applications James Tedrick.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
V. 21. Controlling and limiting the creation of photocopy and hold requests according to “service hours”. Rep_Ver Yoel Kortick.
Happy Days by Charles Fox and Norman Gimbel PowerPoint by Camille Page.
Introduction to Predicates and Quantified Statements I Lecture 9 Section 2.1 Wed, Jan 31, 2007.
Presented by Deepak Varghese Reg No: Introduction Application S/W for server load balancing Many client requests make server congestion Distribute.
Lab Report. Title Page Should be a concise statement of the main topic and should identify the actual variables under investigation and the relationship.
COMP3121 E-Commerce Technologies Richard Henson University of Worcester December 2009.
Publishing DDI-Related Topics Advantages and Challenges of Creating Publications Joachim Wackerow EDDI16 - 8th Annual European DDI User Conference Cologne,
Guide to the Clickstream Data
Hello and Welcome! Introduction Syllabus MyStatLab demo
7.01 Apply the elements of a functional kitchen design.
Adding Post Type Archive in WordPress Navigation Menus Guided By: wpglobalsupportwpglobalsupport.
Net Report WMI Dashboard Summary
Information. Knowledge. Decision
The Royal Oak ~ Swayfield
Presentation transcript:

Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse AxIS Research Team INRIA Sophia Antipolis and Rocquencourt

Motivations To show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge the benefits of our InterSite pre-processing method proposed by Tanasa in his PhD Thesis (2005) And the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs 2 main viewpoints: User and web site charge

Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit « Group of SessionIDs » - first statistical Intersite analysis 2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4 3. Conclusions

Table 1. Format of page requests ShopIDDateIP addressSessionIDPageReferrer dad92c4…84208dca/ ee02ddcff…7655bb9e/ct/?c=148http:// Table 2. Number of requests per shop ShopIDSite name (shop)#Requests 10www.shop1.cz509,688 11www.shop2.cz400,045 12www.shop3.cz645,724 14www.shop4.cz1,290,870 15www.shop5.cz308,367 16www.shop6.cz298,030 17www.shop7.cz164,447 Data pre-processing Initial data:

Data pre-processing Tanasa & Trousse (IEEE Intelligent Systems 2004) Tanasa ‘s Thesis (2005)

Table 3. Transformed log lines DatetimeIPSessionIDURLReferrer :01: dad92c4…84208dcahttp:// :01: ee02ddcff…7655bb9ehttp:// Data pre-processing Data Structuration SessionID a single visit on each shop Towards the notion of user’s intersite visit: we group such SessionIDs that belongs to a single user (same IP) into a « Group of SessionIDs ». We compare the Referer with the URLs previously accessed (in a reasonable time window) 522,,410 SessionIDs into 397,629 Groups, equivalent to a 23.88% reduction; Data fusion, data cleaning

Relational DB model Data summarisation

Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop Data pre-processing Low number of new visits on Saturdays and Sundays during the lunch time The high number of new visits on Tuesdays and Wednesdays Same results a) and b)

Crossed Clustering Aproach for Time Periods/Product Analysis Data: Selection of ls pages in shop 4 (the most used) Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)

Crossed Clustering Aproach for Time Periods/Product Analysis Relational BD model : We add easily a crossed table Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals Column: a multi-categorical variable representing the number of products requested by users into the specific time slice Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)

Crossed Clustering Aproach for Time Periods/Product Analysis Table 4. Quantity of products requested by weekday x hour and registered on shop 4 Weekday x HourProduct (number of requests) Monday_0 Built-in electric hobs (10), Built-in dish washers 60cm (64), Corner single sinks (50),... Monday_1 Free standing combi refrigerators (44), Corner single sinks (50), Built-in hoods (60),... … … Sunday_22 Built-in microwave ovens (27), Built-in dish washers 45cm (38), Built-in dish washers 60cm (85),... Sunday_23 Built-in freezers (56), Kitchen taps with shower (45), Garbage disposers (32),...

Crossed Clustering Aproach for Time Period/Product Analysis Table 5. Confusion table Product_1Product _2Product _3Product _4Product _5Total Period_ Period_ Period _ Period _ Period _ Period _ Period _ Total ,7%

Crossed Clustering Aproach for Time Period/Product Analysis Example of one surprising result: the class Product 5 is defined by one type of products « Free standing combi refrigerators » consulted predominantly on Fridays from 17:00 to 20:00 (class period 6) 57,7% of such a product type requested on this period

Conclusions 1. Intersite Data Pre-Processing - structuration into user’s intersite visits « Group of SessionIDs » - first statistical Intersite analysis - anomalies and recommandations for the dataset 2. Crossed Clustering Approach - first application of such a method on time periods of Web logs and in e-commerce domain - promising results

Data pre-processing Inconsistency problems: - table kategorie: found repeated entries and different entries with same ID - for some page types (dt, df) the given parameter represented actually a specific product, not the given product description (from products table). - extra parameters equivalent to the give ones for some page types: i.e. for ct page type, id is equivalent to the given c parameter - missing values (descriptions) in tables: 3 values in product table and 64 in category table - multiple site SessionIDs: 13 cross-server visits had same SessionID on the visited sites (up to 4 sites); SessionID should change on each new site; - multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than one IP (anonymization proxies ?).