Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Data e Web Mining Paolo Gobbo
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Web Mining Research: A Survey
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Web Mining Research: A Survey
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Data Mining By Archana Ketkar.
Discovery of Aggregate Usage Profiles for Web Personalization
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T.
Data Mining – Intro.
Data mining By Aung Oo.
Overview of Web Data Mining and Applications Part I
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
CIS 674 Introduction to Data Mining
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11.
Chapter 1 Introduction to Data Mining
Lecture 9: Knowledge Discovery Systems Md. Mahbubul Alam, PhD Associate Professor Dept. of AEIS Sher-e-Bangla Agricultural University.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Web Usage Patterns Ryan McFadden IST 497E December 5, 2002.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
© Prentice Hall1 CIS 674 Introduction to Data Mining Srinivasan Parthasarathy Office Hours: TTH 4:30-5:25PM DL693.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data mining in web applications
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Web Mining Ref:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Supporting End-User Access
Web Mining Department of Computer Science and Engg.
Course Introduction CSC 576: Data Mining.
Data Mining: Introduction
Web Mining Research: A Survey
CSE591: Data Mining by H. Liu
Presentation transcript:

Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007

Topics What is data mining? The data mining process Web usage mining: basic concepts The robust fuzzy relational clustering algorithm An application to the GoMercer.com web logs Q & A

What is Data Mining? – definition A concise definition Finding hidden information from large datasets A slightly longer version Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules Differences from accessing info in a database The query is not well formed or precisely stated The data needs to be pre-processed before mining The output is new knowledge, which may not be a subset of the database

What is Data Mining? – a historical perspective Data mining is a relatively new field of study. The 1st International Conference on Knowledge Discovery and Data Mining (KDD) was held in 1995 But its roots can be traced back to five areas: Data Mining Statistics Bayes theorem (1700s) Regression (1900s) Classification (1960s) K-means clustering (1970s) Artificial Intelligence Neural networks (1940s) Genetic algorithms (1970s) Decision tree alg.s (1980s) Algorithms Information Retrieval Similarity measures (1960s) Clustering (1960s) SMART IR systems (1970s) Databases Batch reports (1960s) Relational data models (1970s) Data warehousing & OLAP (1990s)

Why Data Mining? The growth of data is the most important factor propelling the growth of data mining In 2003, Wal-Mart captured 20 million transactions per day in a 10-terabyte database (1TB = 10 6 MB) In 1950, the largest companies had only several dozen megabytes The total amount of data that were produced in 2002 was estimated as 5 exabytes (1XB = 10 6 TB) 40% of this was produced in the US When we have more data, we are expecting more sophisticated information from them

Business Intelligence – from data to knowledge Data -Factual information -May be incomplete -Stored in huge amount Information -Relevant data -Well formatted -For targeted audience Knowledge -Models, patterns, and rules -Can be used in prediction Intelligence Using knowledge in decision making

Basic Data Mining Tasks Classification (map data into predefined groups) Regression (map a data item to a real valued prediction variable) Prediction (similar to classification, but deal with a future state) Clustering (similar to classification, but the groups are defined by the data) Association rules (identifies association among data) Sequence discovery (determine sequential patterns in data)

The Data Mining Process – the steps Develop an understanding of the purpose Obtain the dataset to be used Explore, clean, and preprocess the data Reduce the data, if necessary Determine the data mining tasks Choose the data mining techniques to be used Use algorithm to perform the task Interpret the results Deploy the model

Phases in the DM Process CRISP-DM Phases in the DM Process – CRISP-DM

Web Data Mining Web mining: the use of data mining techniques to automatically discover and extract useful and novel information from web docs and services Web mining can be categorized as Content mining: extract model from web contents, such as text, images, video, and semi- structures ( HTML or XML ) or structures documents ( digital libraries ) Structure mining: aims at finding the underlying topology and organization of web resources Usage mining: discover usage patterns from web server log files, user queries, and registration data

goals User Clustering and Profiling – goals Major application areas for web usage mining Personalization System improvement Site modification Business intelligence Usage characterization

process User Clustering and Profiling – process Data cleaning omitting entries about individual objects on a page (such as.gif or.jpg image files) (User and) session identification: including identifying pages, IPs, and agents a session is a sequence of page views accessed through a certain IP using a certain agent within a certain amount of time (set as 45 minutes) Clustering and profiling: Define similarity between page views Categorize user sessions into clusters based on similarity of the pages visited

Web Log File Entries Web log files keep track of the following data Date and time (e.g., ) Client IP address (e.g., ) Server IP address (e.g., or ) URI stem (web page or a specific file requested, e.g., /choose-mercer/apply-online.aspx ) User Agent (browser used by the user, e.g., Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT +5.1;+SV1;+.NET+CLR ) ) Referrer (the previous page visited) Cookie Etc

Data Model User Session Web Page Web Browser IP Address 1 * * 1 ** Within 45 minutes User Cluster 1 *

Session Identification 1.Use original web server log files as input 2.Parse log entries to omit individual objects (such as images), and a.Keep track of unique client IPs, URIs of interest, and user agents b.Keep track of date/time and identifiers for IP, URI, and agent for each entry of interest 3.For each entry of interest a.add it to an existing session with the same {IP, URI, agent} identifiers and within 45 minutes b.create a new session with it 4.Persist the session information to a file (or DB)

Sample Session Information

8 6 6 Inter-cluster distance (gap used here) Clustering – a one-dimensional example Classification: Map data into pre- defined groups Clustering: Just specify # of groups, which are defined by data Intra-cluster distance Maximize the inter-cluster distance and minimize the intra-cluster distance

Page and Session (Dis-)Similarity The “syntactic” similarity between (the URL’s of) the i th and j th pages, is defined as the smaller of 1 and the ratio of the overlap of the two and the larger of the two lengths S u (i, j) = min(1, |p i ^p j |/max(1, max(|p i |, |p j |)) For instance, the similarity score for /mercer-411/contact.aspx and /mercer-411/ask- a-student.aspx is 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml is 0 Dissimilarity is defined as (1 - S u (i, j)) 2 Dissimilarity between two clusters is then calculated by summing up pair-wise dissimilarity scores

Medoid and Membership Each cluster is represented by a medoid, which is a centrally located session in the cluster The affiliation of a session to a cluster is represented as a membership score, or the similarity to the corresponding medoid A session is not considered to exclusively belong to a single cluster The affiliation is determined by the highest membership score in a given iteration

Relational Clustering Algorithm 1.Use identified sessions as input 2.Specify number of clusters, C and maximum number of iterations, M to be used 3.Choose an initial medoid for each cluster i in [1, C] 4.Compute membership u ij for each session j in [1, N] with regard to each cluster i (using the similarity measure) 5.Store the old medoids 6.Compute the new medoids to minimize overall intra-cluster distances 7.Repeat steps 4 through 6 until the medoids do not change or the maximum number of iterations M is reached

Application to GoMercer.com Meeting w/ Rob Saxon Obtain & read Web log files Preliminary study using CSC data Parsing data for sessions Clustering w/ FCMdd Data analysis & visualization On going

Results – summary of log files 148 files (one per day from 09/29/06 to 02/23/07), totaling about 2.5 GB File sizes for Oct 2006 and Feb 2007 as shown Session counts in the same periods present similar patterns

Results – frequencies by URI type User client programs (or browsers used) Main page ASP scripts Breakdown for /accepted, /choose-mercer, and /mercer-411 Flash videos Individual videos Combined by topic /accepted /choose-mercer /mercer-411

Results – user cluster and profiles

Questions and Discussions

References Data mining for business intelligence, by Shmuli et al, Wiley Inter-Science, 2007 Data mining, by Dunham, Prentice Hall, 2003 Web mining: applications and techniques, Scime (ed.), IDEA group, 2005 What is data mining? by Squier, ( ncr.org/Library/ Laura%20Squier.ppt ) ncr.org/Library/ Laura%20Squier.ppt Automatic web user profiling and personalization using robust fuzzy relational clustering, by Nasraoui et al, 1999 Web usage mining: discovery and application of interesting patterns from web data, by Cooley, PhD thesis, Univ. of Minnesota, 2000