Presentation is loading. Please wait.

Presentation is loading. Please wait.

September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical.

Similar presentations


Presentation on theme: "September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical."— Presentation transcript:

1

2 giudici@unipv.it September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical models for data mining (SMDM)

3 giudici@unipv.it A small sample of web clickstream data (from a logfile)

4 Analysis of web clickstream data 1. In data matrix form (Giudici and Castelo, 2001; Blanc and Giudici, 2001): -Association measures -Association models (graphical association models) 2. In transactional data form (in this talk) - Association and sequence rules - Statistical models for sequences

5 giudici@unipv.it Association measures and models Based on data arranged in contingency table form FOR INSTANCE: Odds ratios Graphical loglinear models Recursive logistic regression models For a review, see Giudici, Applied data mining, Wiley, 2003

6 giudici@unipv.it Association and sequence rules Implemented in main Data Mining softwares Based on transactional databases Such databases arise for instance in -Market basket analysis (order does not matter) -Web clickstream analysis (order matters) Aim: search for itemsets (groups of events) that occurr simultaneously with a high frequency

7 giudici@unipv.it A 1,.., A p : p binary random variables. Itemset: logical expression such as A = (A j1 = 1,...,. A jk =1), k< p. Association rule: logical relationship between two itemsets: e.g. if A, then B Example:A= (Milk, Coffee) B=(Bread, Biscuits) Sequence rule: the relationship is determined by a temporal order. Example: A= (Home, Register) B=(P_info) Formally:

8 giudici@unipv.it Interestingness of a rule Support = Confidence = = Lift =Confidence / Support (B) A priori search algorithm (Agrawal et al., 1995): based on the support.

9 giudici@unipv.it Application to real data Data set from a logfile of an e-commerce site, kindly supplied by SAS. Contains the userid (C_VALUE), the time of connection (C_TIME) and the page visualised (C_CALLER). Number of clicks: 21889; Number of visitors (sessions): 1240.

10 giudici@unipv.it Exploratory step (data selected from a cluster of visitors, N. 3) ClusterN.obsVariables Cluster mean Overall mean 18802CLICKS LENGTH start %PURCH 8 6 min h. 18 0.034 10 10 min 14 h 0.072 22859CLICKS LENGTH start %PURCH 22 17 min h. 15 0.241 31240CLICKS LENGTH start %PURCH 18 59min h. 13 0.194 49251CLICKS LENGTH start %PURCH 8 6 min h. 10 0.039

11 giudici@unipv.it Remark Data could have been transformed from transactional to data matrix format. Doing so information on the order of the visited pages would have been lost Data matrix format for the considered data:

12 giudici@unipv.it Application of the apriori algorithm Most frequent indirect sequences of order 2

13 giudici@unipv.it Most frequent indirect sequences of any order

14 giudici@unipv.it Proposal: direct sequences Only “subsequent” visits are being considered We have inserted two fictitious (deterministic) pages: (start_session; end_session)

15 giudici@unipv.it Most frequent direct sequences of order 2

16 giudici@unipv.it Towards a global model: graphical representation of direct association rules

17 giudici@unipv.it Link analysis representation

18 giudici@unipv.it Global models for web mining Sequence rules are an instance of a local model (or pattern, see Hand et al, 2001) of data mining. A local model draws statistical conclusions on parts of the dataset, rather than on the whole. Link analysis is an example of a global descriptive model. We have considered two global inferential models: - probabilistic expert systems - Markov chains

19 giudici@unipv.it Probabilistic expert systems Graphical models that allow to describe (recursive) dependencies between (binary) random variables Can be described by a directed conditional independence graph, that specifies the factorisation of the joint probability distribution. They ARE NOT directly comparable with sequence rules, that are local indexes to study dependencies between events (itemsets) They are built from contingency table data, thus DO NOT model order of visit to pages.

20 giudici@unipv.it Probabilistic expert systems: structural learning

21 giudici@unipv.it Probabilistic expert systems: quantitative learning

22 giudici@unipv.it Markov Chains for web mining Ideal to model dependencies between events. Order of the chain parallels order of a sequence rule. Data have been structured in the following form:

23 giudici@unipv.it Results from Markov chains (entrance to the site- start session)

24 giudici@unipv.it Exit from the site (end session)

25 giudici@unipv.it Most likely paths Progra m HomeStart_session P_info 45,81% 17,80% Product 70,18% 26,73% Markov chains ARE DIRECTLY comparable with direct sequence rules. E.g. for the most likely path: from start_session, the highest confidence is with home (45,81%), then program (20.39,), product ( 78,09% ) and addcart (28,79%). There are small differences, due to the fact that apriori algorithm considers only rules with support higher than a fixed threshold (e.g. 5%).

26 giudici@unipv.it Essential references Agrawal, R., Manilla, H., Srikant, R., Toivonen, H. and Verkamo, A.I. (1995) Fast discovery of association rules, in: Advances in knowledge discovery and data mining, AAAI/MIT Press, Cambridge. Giudici, P. (2003) Applied Data mining. Wiley, London. Giudici, P. and Castelo, R. (2001) Association models for web mining. Journal of Knowledge discovery and data mining, 5, pp. 183-196. Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001).The elements of statistical learning: data mining, inference and prediction. Springer-Verlag. Hand, D.J., Mannilla, H. and Smyth, P (2001) Principles of Data Mining, MIT Press, New York.

27 giudici@unipv.it THANKS FOR THE ATTENTION ! Comments to: giudici@unipv.it www.baystat.it/giudici/index.htm


Download ppt "September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical."

Similar presentations


Ads by Google