1 1 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Web mining, esp. Web usage mining Bettina Berendt Katholieke Universiteit.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Back to Table of Contents
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Web Usage Mining: Processes and Applications
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Discovery of Aggregate Usage Profiles for Web Personalization
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
© Copyright , Blue Martini Software. San Mateo California, USA 1 1 Integrating E-Commerce and Data Mining: Architecture and Challenges Llew Mason.
Data Mining – Intro.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Overview of Web Data Mining and Applications Part I
Website Content, Forms and Dynamic Web Pages. Electronic Portfolios Portfolio: – A collection of work that clearly illustrates effort, progress, knowledge,
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Overview of Web Data Mining and Applications Part II
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Classroom User Training June 29, 2005 Presented by:
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Copyright © 2009 Pearson Education, Inc. Slide 6-1 Chapter 6 E-commerce Marketing Concepts.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Data Mining By Dave Maung.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
MIS2502: Data Analytics Advanced Analytics - Introduction.
Academic Year 2014 Spring Academic Year 2014 Spring.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data mining in web applications
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Web Mining Ref:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
Data Warehousing and Data Mining
Data Warehousing Data Mining Privacy
Web Mining Research: A Survey
Presentation transcript:

1 1 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Web mining, esp. Web usage mining Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science Last update: 29 November 2007

2 2 2 Semi-structured and unstructured data n Unstructured data: has „no“ structure (esp. not a relational one) n Common sources of unstructured data include: l Documents: Word documents, PowerPoint presentations, newsletters, source code, hard-copy documents l Images and graphics n Unstructured data: has „some“ structure (partly structured, partly unstructured) n Common sources of semi-structured data sources include: l s TCP/IP packets l XML data l Images and graphics l Documents (all listed previously)  Web, text as two particularly interesting representatives

3 3 3 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

4 4 4 What Web pages answer my information need?

5 5 5 What Web pages are “good“ (better than others)?

6 6 6 What should I buy?

7 7 7 CRM questions example: Why go to a shop if everything is available on the Internet?

8 8 8 How do people search?

9 9 9 Web Mining Knowledge discovery (aka Data mining): “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1 Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas: Web content mining Web structure mining Web usage mining 1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press

10 Web Usage Mining: Basics and data sources Definition of Web usage mining: n discovery of meaningful patterns from data generated by client-server transactions on one or more Web servers Typical Sources of Data n automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies n e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, purchases) n user profiles and/or user ratings n meta-data, page attributes, page content, site structure This is a slide from

11 Web usage is more than „browsing“: Interactions on the Web Social viewpoint n User – server l Search engine l Online store l Digital library l... n User – user l „Web 2.0“ (and all its precursors) Technical viewpoint n Access content („read“) n Create content („write“) n Navigate

12 Structure of the rest (as always...)

13 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

14 Web Usage Mining Discovery of meaningful patterns from data generated by client- server transactions on one or more Web servers Typical Sources of Data n automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies n e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, etc.) n user profiles and/or user ratings n meta-data, page attributes, page content, site structure

15 Data collection Web server Proxy Client (Browser)

What’s in a typical Web server log … [01/Jun/1999:03:09: ] "GET /Calls/OWOM.html HTTP/1.0" " &maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)" [01/Jun/1999:03:09: ] "GET /Calls/Images/earthani.gif HTTP/1.0" " "Mozilla/4.5 [en] (Win98; I)" [01/Jun/1999:03:09: ] "GET /Calls/Images/line.gif HTTP/1.0" " "Mozilla/4.5 [en] (Win98; I)" [01/Jun/1999:03:12: ] "GET / HTTP/1.0" "" "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:12: ] "GET /Images/line.gif HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:12: ] "GET /Images/red.gif HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:12: ] "GET /Images/earthani.gif HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:13: ] "GET /CP.html HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)“ [01/Jun/1999:03:13: ] "GET /Calls/AWAC.html HTTP/1.0" " "Mozilla/4.5 [en] (Win98; I)" (Requests to

… and what does it mean? [01/Jun/1999:03:09: ] "GET /Calls/OWOM.html HTTP/1.0" " &maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)" [01/Jun/1999:03:09: ] "GET /Calls/Images/earthani.gif HTTP/1.0" " "Mozilla/4.5 [en] (Win98; I)" [01/Jun/1999:03:09: ] "GET /Calls/Images/line.gif HTTP/1.0" " "Mozilla/4.5 [en] (Win98; I)" [01/Jun/1999:03:12: ] "GET / HTTP/1.0" "" "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:12: ] "GET /Images/line.gif HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:12: ] "GET /Images/red.gif HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:12: ] "GET /Images/earthani.gif HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)" [01/Jun/1999:03:13: ] "GET /CP.html HTTP/1.0" " "Mozilla/4.06 [en] (Win95; I)“ [01/Jun/1999:03:13: ] "GET /Calls/AWAC.html HTTP/1.0" " "Mozilla/4.5 [en] (Win98; I)" (Requests to

18 Sources and destinations Logs may extend beyond visits to the site and show where a visitor was before (referrer) [01/Jun/1999:03:09: ] "GET /Calls/OWOM.html HTTP/1.0" " bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"... and where s/he went next (URL rewriting):

19 Raw Usage Data Cleaning Episode Identification User/Session Identification Page View Identification Path Completion Server Session File Episode File Site Structure and Content Usage Statistics Preprocessing of Web Usage Data

20 Raw Usage Data Cleaning Episode Identification User/Session Identification Page View Identification Path Completion Server Session File Episode File Site Structure and Content Usage Statistics Preprocessing of Web Usage Data not always necessary and/or done

21 Data Preprocessing (1) Data cleaning n remove irrelevant references and fields in server logs n remove references due to spider navigation n remove erroneous references n add missing references due to caching (done after sessionization) Data integration n synchronize data from multiple server logs n Integrate semantics, e.g., l meta-data (e.g., content labels) l e-commerce and application server data n integrate demographic / registration data

22 Data Preprocessing (2) Data Transformation n user identification n sessionization / episode identification n pageview identification l a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser Data Reduction n sampling and dimensionality reduction (ignoring certain pageviews / items) Identifying User Transactions (i.e., sets or sequences of pageviews possibly with associated weights)

23 Why sessionize? n Quality of the patterns discovered in KDD depends on the quality of the data on which mining is applied. n In Web usage analysis, these data are the sessions of the site visitors: the activities performed by a user from the moment she enters the site until the moment she leaves it. n Difficult to obtain reliable usage data due to proxy servers and anonymizers, dynamic IP addresses, missing references due to caching, and the inability of servers to distinguish among different visits. n Cookies and embedded session IDs produce the most faithful approximation of users and their visits, but are not used in every site, and not accepted by every user. n Therefore, heuristics are needed that can sessionize the available access data.

24 Mechanisms for User Identification Examples: page tags (use javascript), some browser plugins

25 Examples of “software agents“ Page tagging with Javascript: see also

26 Sessionization strategies: Sessionization heuristics These heuristics are quite accurate! (see Spiliopoulou et al., 2003)

27 Path Completion Refers to the problem of inferring missing user references due to caching. Effective path completion requires extensive knowledge of the link structure within the site Referrer information in server logs can also be used in disambiguating the inferred paths. Problem gets much more complicated in frame-based sites.

28 Why integrate semantics? Basic idea: associate each requested page with one or more domain concepts, to better understand the process of navigation / Web usage Example: a shopping site p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03: ] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05: ] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06: ] "GET /mlesen.html?Item=3456&syn= HTTP/1.0" Search by categorySearch by Category+title Refine search Choose item Look at indiv- idual product From... To...

29 From URLs to topics / concepts: Basics of semantic session modelling n 1 request  1 concept or n concepts n Concepts can concern content or service n Concepts can be part of an ontology (simple case: concept hierarchy) n Session = set / sequence / tree / graph of requests  also possible: n requests  1 concept

30 Ontology-based behaviour modelling – basic ideas (1) The request for a Web page signals interest in the concept(s) and relations dealt with in this page – interest in the obtained content as well as in the requested service. Formally: a request as a (multi)set, or as a vector, of concepts/relations.

31 Resulting format: if the request is the instance Usually flat file (format like Web server log) or database

32 Resulting format: If a session is the instance n What features can a session have? n Refer again to the example: p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03: ] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05: ] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06: ] "GET /mlesen.html?Item=3456&syn= HTTP/1.0" Search by categorySearch by Category+title Refine search Choose item Look at indiv- idual product

customers orders products Operational Database Content Analysis Module Web/Application Server Logs Data Cleaning / Sessionization Module Site Map Site Dictionary Integrated Sessionized Data Integration Module E-Commerce Data Mart Data Mining Engine OLAP Tools Session Analysis / Static Aggregation Pattern Analysis OLAP Analysis Site Content Data Cube Basic Framework for E-Commerce Data Analysis Web Usage and E-Business Analytics

34 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

35 Web Usage and E-Business Analytics n Session Analysis n Static Aggregation and Statistics n OLAP n Data Mining Different Levels of Analysis

36 Session Analysis Simplest form of analysis: examine individual or groups of server sessions and e-commerce data. Advantages: n Gain insight into typical customer behaviors. n Trace specific problems with the site. Drawbacks: n LOTS of data. n Difficult to generalize.

37 Static Aggregation (Reports) Most common form of analysis. Data aggregated by predetermined units such as days or sessions. Generally gives most “bang for the buck.” Advantages: n Gives quick overview of how a site is being used. n Minimal disk space or processing power required. Drawbacks: n No ability to “dig deeper” into the data.

38 Online Analytical Processing (OLAP) Allows changes to aggregation level for multiple dimensions. Generally associated with a Data Warehouse. Advantages & Drawbacks n Very flexible n Requires significantly more resources than static reporting.

39 Data Mining: Going deeper Sequence mining Markov chains Association rules Clustering Session Clustering Classification Prediction of next event Discovery of associated events or application objects Discovery of visitor groups with common properties and interests Discovery of visitor groups with common behaviour Characterization of visitors with respect to a set of predefined classes Card fraud detection

40 KDD Techniques for Web Applications: Examples (1) Calibration of a Web server: n Prediction of the next page invocation over a group of concurrent Web users under certain constraints l Sequence mining, Markov chains Cross-selling of products: n Mapping of Web pages/objects to products n Discovery of associated products l Association rules, Sequence Mining n Placement of associated products on the same page

41 KDD Techniques for Web Applications: Examples (2) Sophisticated cross-selling and up-selling of products: n Mapping of pages/objects to products of different price groups n Identification of Customer Groups l Clustering, Classification n Discovery of associated products of the same/different price categories l Association rules, Sequence Mining n Formulation of recommendations to the end-user l Suggestions on associated products l Suggestions based on the preferences of similar users

42 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

43 CRM questions example: Why go to a shop if everything is available on the Internet?

44 A multi-channel retailer, its business goals, and analysis questions General goals: “Standard e-tailer goals“ – attract users/shoppers and convert them into customers Specific goals: assess the success of the Web site – in relation to other distribution channels  Questions of the evaluation: What business metrics can be calculated from Web usage data, transaction and demographic data for determining online success? Are there cross-channel effects between a company ‘ s e-shop and its physical stores? Background: Internet market shares [BCG 2002]

45 The site

46 Outline of the KDD process Data preparation: n Session IDs; usual data cleaning steps n Linking of sessions & transaction information (anonymized) Modelling / pattern discovery: n Web metrics, cluster analysis, association rules, sequence mining + correlation analysis, questionnaire study, qualitative market analysis Evaluation: Interesting patterns Business underst.: customer buying process Data:  Web server sessions, transaction info. Data understanding – main step:  modelling the semantics of the site in terms of a hierarchy of service concepts

47 Agenda – Case Study Business Understanding Data understanding and preparation Pattern discovery + evaluation: Success metrics Pattern disc. + eval.: Behavioural patterns Pattern disc. + eval.: User types Pattern disc. + eval.: Behaviour & demographics

48 Agenda – Case Study Business Understanding Data understanding and preparation Pattern discovery + evaluation: Success metrics Pattern disc. + eval.: Behavioural patterns Pattern disc. + eval.: User types Pattern disc. + eval.: Behaviour & demographics

49 Description of the site and its services n The retailer operates an e-shop and more than 5000 retail shops in over 10 European countries n It sells a wide range of consumer electronics n Online customers can pay, pick-up/deliver and return both online and offline n Web pages provide for all tasks in the customer buying process

50 Purchase Phases (Page Concepts) at Large MC Retailers 1.Acquisition (home): All Web pages that are semantically related to the initial acquisition of a visitor Home (Acquisition)

51 Purchase Phases (Page Concepts) at Large MC Retailers Home (Acquisition) 2.Catalogue information: pages providing an overview of product categories. Product Impressi on

52 Purchase Phases (Page Concepts) at Large MC Retailers Product Click- Through Home (Acquisition) 3.Information product (infprod): pages displaying information about a specific product Product Impressi on

53 Purchase Phases (Page Concepts) at Large MC Retailers Offlineinf o Home (Acquisition) 4.offline information (offinfo): All pages related to any offline information: store locator (pages for finding physical stores in one’s neighbourhood), information about offline services, offline referrers etc. Product Click- Through Product Impressi on

54 Purchase Phases (Page Concepts) at Large MC Retailers Transacti on Offlineinf o Home (Acquisition) 5.transaction (transact): steps before an actual purchase, starting with a customer entering the order process: check-out, input of customer data, payment and delivery preferences (online or offline), etc. Product Click- Through Product Impressi on

55 Purchase Phases (Page Concepts) at Large MC Retailers Transacti on Purchase Offlineinf o Home (Acquisition) 6.purchase: indicates if a visitor completed the transaction process and bought a product, e.g. invocation of an order confirmation page. Product Click- Through Product Impressi on

56 Agenda – Case Study Business Understanding Data understanding and preparation Pattern disc. + eval.: Behavioural patterns

57 Data and data preparation Data sources and sample: n 92,467 sessions from the company’s Web logs from 21 days in 2002 n anonymized transaction information of 13,653 customers who bought online over a period of 8 months in 2001/02. n 621 transaction records (21 days) were linked to Web-usage records Data preparation: n Sessions were determined by session IDs n Robot visits eliminated, usual data cleaning steps n Each URL request mapped to a service concept from {c 1,...,c n } n Session representation: s = [w 1,...w n ], with w i = weight of c i, indicating whether or not the concept was visited (1/0), or how often it was visited n Customer record: feature vector incl. session and transaction data

58 Site semantics: A service concept hierarchy Any Information Transaction Services Information Product Fulfillment/ Service Customer Data Shopping Cart Payment Company Infos Registration Other Acquisition Offline Referrer AdvertiserOther Store Locator Information Catalog Home Game Offline Service and Support = Multi-Channel Concept 760,535 page requests were mapped onto the concepts from this hierarchy:

59 Types of patterns n Conversion rates (~ confidence of content-specified sequential association rules) for assessing business success n Association rule and sequence analysis for understanding online/offline preferences and their temporal development n Cluster analysis for customer segmentation n Correlation analysis for investigating the relationship between demographic indicators and online/offline preferences

60 >> Session representation Each session represented as a feature vector on the multi-channel concepts Two methods used for definition of new conversion metrics:  weighted-concept method (number of visits to a concept)  dichotomized concept method (whether or not concept was visited) Sessionhomeinfcatinfprodservicetransac t purch.offinfo A B Sessionhomeinfcatinfprodservicetransac t purch.offinfo A B

61 Agenda – Case Study Business Understanding Data understanding and preparation Pattern disc. + eval.: Behavioural patterns

62 “Internal consistency“ of preferences – payment and delivery preferences Online payment  Direct delivery (s=0.27, c=0.97) < 1/3 traditional onl.users! Online payment  In-store pickup (s=0.02, c=0.03) Cash on delivery  Direct delivery (s=0.02, c=0.03) In-store payment  In-store pickup (s=0.69, c=0.94)  Site is primarily used to collect information. s: support, c: confidence of the sequence

63 “Internal consistency“ of preferences – return preferences Return  In-store (s=0.06, c=0.87) Return  Mail-in (s=0.04, c=0.13)  Customers may wish personal assistance. (a result supported by the service mix analysis of different multi- channel retailers and by questionnaire results) s: support, c: confidence of the association rule

64 Development of preferences over time Direct delivery  In-store pickup in  1 following transaction (s=0.001,c=0.15) Direct delivery  Direct delivery in all following transactions (s=0.003,c=0.85) In-store pickup  Direct delivery in  1 foll. transaction (s=0.001, c=0.10) (*) In-store pickup  In-store pickup in all foll. transactions (s=0.004, c=0.90) Results for payment migration are similar.  90% of repeat customers did not change transaction preferences at all.  Rule (*) as an indicator of the development of trust?! s: support, c: confidence of the sequence

65 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

66 Association-rule mining Coenen, F. (2003). Association rule mining and its wider context. AI2003 Association Rule Mining Tutorial, Cambridge, December pp. 5 – 20, covering  What is an association rule?  What are interestingness measures for association rules?  support, confidence, lift (there are also further measures)  cf. the „performance measures“ recall, precision, etc. for classifiers  How is association-rule mining performed?  the basic apriori algorithm

67 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

68 The site Business understanding / problem definition: * How do users search in this online catalog? * Which search criteria are popular? * Which are efficient?

69 The concept hierarchies / site ontology (excerpt) SEITE1-...LI (1st page of a list) or SEITEn-...LI (further page) LA („Land“)SA („Schulart“)SU („Suche“)

70 Sequence mining – one result pattern: successful search for a school in Germany a refinement a repetition a continuation one example pattern select t from node a b, template a * b as t where a.url startswith "SEITE1-" and a.occurrence = 1 and b.url contains "1SCHULE" and b.occurrence = 1 and (b.support / a.support) >= 0.2 (Berendt & Spiliopoulou, VLDB J. 2000) /liste.html?offset=920&ze ilen=20&anzahl=1323&sprac he=de&sw_kategorie=de&ers cheint=&suchfeld=&suchwer t=&staat=de&region=by&sch ultyp=

71 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

72 An overview of the WUM formalism and algorithm Berendt, B. (2007). Web Usage Mining - Modelling: frequent- pattern mining I (sequence mining with WUM, classification and clustering). pp

73 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal

74 The site

75 Understanding the semantics of requests Step 1: Domain ontology community portal ka2portal.aifb.uni-karlsruhe.de ka2portal.aifb.uni-karlsruhe.de ontology-based: Knowledge base in F-Logic Static pages: annotations Dynamic pages: generated from queries Queries also in F-Logic Logs contain these queries affiliation

76 Agenda Intro: Web Mining, specifically Web Usage Mining Data Acquisition, Understanding, and Preparation Forms of analysis; mining techniques Case study 1: A multi-channel retailer  method: Association-rule discovery Case study 2: Search in an educational portal  method: Sequence mining / generalized-sequ. discovery Case study 3: Search in a community portal  method

77 You decide!

78 In the preparation of a log file (recommendations for open-source tools are shown in green) 1. Use qualitative methods for application understanding (read!) 2. Inspect the site and the URLs for data understanding 1. Generate Analog reports for getting base statistics of usage 2. Build concept system / hierarchy and mapping: URLs  concepts (notation: WUMprep regex) 3. Use WUMprep for data preparation 1. Remove unwanted entries (pictures etc.) 2. Sessionize 3. Remove robots 4. Replace URLs by concepts 5. (Build a database) 4. Use WEKA for modelling 1. [ Transform log file into ARFF (WUMprep4WEKA) ] 2. Cluster, classify, find association rules, Use WUM for modelling 6. Select patterns based on objective interestingness measures (support, confidence, lift,...) and on subjective interestingness measures (unexpected? Application-relevant?) 7. Present results in tabular, textual and graphical form (use Excel,...) 8. Interpret the results 9. Make recommendations for site improvement etc.

79 In the case study: 1. Use qualitative methods for application understanding (read!) 2. Inspect the site and the URLs for data understanding 1. Generate Analog reports for getting base statistics of usage 2. Build concept system / hierarchy and mapping: URLs  concepts (notation: WUMprep regex) 3. Use WUMprep for data preparation 1. Remove unwanted entries (pictures etc.) 2. Sessionize 3. Remove robots 4. Replace URLs by concepts 5. (Build a database) 4. Use WEKA for modelling 1. [ Transform log file into ARFF (WUMprep4WEKA) ] 2. Cluster, classify, find association rules, Use WUM for modelling 6. Select patterns based on objective interestingness measures (support, confidence, lift,...) and on subjective interestingness measures (unexpected? Application-relevant?) 7. Present results in tabular, textual and graphical form (use Excel,...) 8. Interpret the results 9. Make recommendations for site improvement etc. done

80 URLs of the tools Analog: WUMprep: WEKA: WUM:

81 Short introductions to WUMprep Lüderitz, S. (2006). Pre-processing of webserver logs for data mining. (pp ) Dettmar, G. (2003). Logfile-Preprocessing using WUMprep.

82 Materials for your case study n Original log n A transformed log (to simplify your work of sessionizing) n Some explanation: ure/OtherSlides//explaining-the-ka2portal-logs.html ure/OtherSlides//explaining-the-ka2portal-logs.html l (original log and transformed log are hyperlinked there) n The ontology l l You can browse this ontology (it is the default ontology, see Wizard) for example with the Ontomat tool: n Unfortunately, the site itself is not running any more! Use to inspect earlier versions

83 To structure your case study: More details in CRISP-DM 1.0. Step-by- step data mining guide.

84 Next lecture Inputs Data preparation Outputs Multirelational data mining Evaluation Algorithm What if the input isn‘t in a table (or even multiple tables)? Mining semi-structured / unstructured data II (text)

85 References / background reading (1) n Data preparation l Cooley, R., B. Mobasher, J. Srivastava Data preparation for mining world wide web browsing patterns. J.of Knowledge and Inform.Systems 1 5– l Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. (2003). A framework for the evaluation of session reconstruction heuristics in Web-usage analyis. INFORMS Journal on Computing, 15, n Web mining l Baldi, P., Frasconi, P., & Smyth, P. (2003). Modeling the Internet and the Web. Probabilistic Methods and Algorithms. Chichester, UK: John Wiley & Sons. l Bing Liu (2006). Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer. n A general overview of Web usage mining l Srivastava, J., Desikan, P., & Kumar, V. (2004). Web Mining - Concepts, Applications and Research Directions. In H. Kargupta, A. Joshi, K. Sivakumar, & Y. Yesha (Eds.), Data Mining: Next Generation Challenges and Future Directions (pp ). Menlo Park, CA: AAAI/MIT Press. (earlier, longer version:

86 References / background reading (2) n Case study 1 l Teltzrow, M., & Berendt, B. (2003). Web-Usage-Based Success Metrics for Multi- Channel Businesses. In Proceedings of the WebKDD 2003 Workshop - Webmining as a Premise to Effective and Intelligent Web Applications.. August 27th, 2003, Washington DC, USA. Held in conjunction with The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.WebKDD 2003 Workshop - Webmining as a Premise to Effective and Intelligent Web Applications. l Teltzrow, M., Berendt, B., & Günther, O. (2003). Consumer behaviour at multi-channel retailers. In Proceedings of the 4th IBM eBusiness Conference, School of Management, University of Surrey, 9th December th IBM eBusiness Conference n Case study 2 l Berendt, B. & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. The VLDB Journal, 9,