Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007.

Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007

Topics What is data mining? The data mining process Web usage mining: basic concepts The robust fuzzy relational clustering algorithm An application to the GoMercer.com web logs Q & A

What is Data Mining? – definition A concise definition Finding hidden information from large datasets A slightly longer version Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules Differences from accessing info in a database The query is not well formed or precisely stated The data needs to be pre-processed before mining The output is new knowledge, which may not be a subset of the database

What is Data Mining? – a historical perspective Data mining is a relatively new field of study. The 1st International Conference on Knowledge Discovery and Data Mining (KDD) was held in 1995 But its roots can be traced back to five areas: Data Mining Statistics Bayes theorem (1700s) Regression (1900s) Classification (1960s) K-means clustering (1970s) Artificial Intelligence Neural networks (1940s) Genetic algorithms (1970s) Decision tree alg.s (1980s) Algorithms Information Retrieval Similarity measures (1960s) Clustering (1960s) SMART IR systems (1970s) Databases Batch reports (1960s) Relational data models (1970s) Data warehousing & OLAP (1990s)

Why Data Mining? The growth of data is the most important factor propelling the growth of data mining In 2003, Wal-Mart captured 20 million transactions per day in a 10-terabyte database (1TB = 10 6 MB) In 1950, the largest companies had only several dozen megabytes The total amount of data that were produced in 2002 was estimated as 5 exabytes (1XB = 10 6 TB) 40% of this was produced in the US When we have more data, we are expecting more sophisticated information from them

Business Intelligence – from data to knowledge Data -Factual information -May be incomplete -Stored in huge amount Information -Relevant data -Well formatted -For targeted audience Knowledge -Models, patterns, and rules -Can be used in prediction Intelligence Using knowledge in decision making

Basic Data Mining Tasks Classification (map data into predefined groups) Regression (map a data item to a real valued prediction variable) Prediction (similar to classification, but deal with a future state) Clustering (similar to classification, but the groups are defined by the data) Association rules (identifies association among data) Sequence discovery (determine sequential patterns in data)

The Data Mining Process – the steps Develop an understanding of the purpose Obtain the dataset to be used Explore, clean, and preprocess the data Reduce the data, if necessary Determine the data mining tasks Choose the data mining techniques to be used Use algorithm to perform the task Interpret the results Deploy the model

Phases in the DM Process CRISP-DM Phases in the DM Process – CRISP-DM

Web Data Mining Web mining: the use of data mining techniques to automatically discover and extract useful and novel information from web docs and services Web mining can be categorized as Content mining: extract model from web contents, such as text, images, video, and semi- structures ( HTML or XML ) or structures documents ( digital libraries ) Structure mining: aims at finding the underlying topology and organization of web resources Usage mining: discover usage patterns from web server log files, user queries, and registration data

goals User Clustering and Profiling – goals Major application areas for web usage mining Personalization System improvement Site modification Business intelligence Usage characterization

process User Clustering and Profiling – process Data cleaning omitting entries about individual objects on a page (such as.gif or.jpg image files) (User and) session identification: including identifying pages, IPs, and agents a session is a sequence of page views accessed through a certain IP using a certain agent within a certain amount of time (set as 45 minutes) Clustering and profiling: Define similarity between page views Categorize user sessions into clusters based on similarity of the pages visited

Web Log File Entries Web log files keep track of the following data Date and time (e.g., 2006-10-01@00:01:01 ) Client IP address (e.g., 70.168.242.49 ) Server IP address (e.g., www.GoMercer.com, or 192.168.1.52 ) www.GoMercer.com URI stem (web page or a specific file requested, e.g., /choose-mercer/apply-online.aspx ) User Agent (browser used by the user, e.g., Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT +5.1;+SV1;+.NET+CLR+1.1.4322) ) Referrer (the previous page visited) Cookie Etc

Data Model User Session Web Page Web Browser IP Address 1 * * 1 ** Within 45 minutes User Cluster 1 *

Session Identification 1.Use original web server log files as input 2.Parse log entries to omit individual objects (such as images), and a.Keep track of unique client IPs, URIs of interest, and user agents b.Keep track of date/time and identifiers for IP, URI, and agent for each entry of interest 3.For each entry of interest a.add it to an existing session with the same {IP, URI, agent} identifiers and within 45 minutes b.create a new session with it 4.Persist the session information to a file (or DB)

Sample Session Information

8 6 6 Inter-cluster distance (gap used here) Clustering – a one-dimensional example Classification: Map data into predefined groups Clustering: Just specify # of groups, which are defined by data Intra-cluster distance 3 4 2.133.33 Maximize the inter-cluster distance and minimize the intra-cluster distance

Page and Session (Dis-)Similarity The “syntactic” similarity between (the URL’s of) the i th and j th pages, is defined as the smaller of 1 and the ratio of the overlap of the two and the larger of the two lengths S u (i, j) = min(1, |p i ^p j |/max(1, max(|p i |, |p j |)) For instance, the similarity score for /mercer-411/contact.aspx and /mercer-411/ask- a-student.aspx is 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml is 0 Dissimilarity is defined as (1 - S u (i, j)) 2 Dissimilarity between two clusters is then calculated by summing up pair-wise dissimilarity scores

Medoid and Membership Each cluster is represented by a medoid, which is a centrally located session in the cluster The affiliation of a session to a cluster is represented as a membership score, or the similarity to the corresponding medoid A session is not considered to exclusively belong to a single cluster The affiliation is determined by the highest membership score in a given iteration

Relational Clustering Algorithm 1.Use identified sessions as input 2.Specify number of clusters, C and maximum number of iterations, M to be used 3.Choose an initial medoid for each cluster i in [1, C] 4.Compute membership u ij for each session j in [1, N] with regard to each cluster i (using the similarity measure) 5.Store the old medoids 6.Compute the new medoids to minimize overall intra-cluster distances 7.Repeat steps 4 through 6 until the medoids do not change or the maximum number of iterations M is reached

Application to GoMercer.com Meeting w/ Rob Saxon Obtain & read Web log files Preliminary study using CSC data Parsing data for sessions Clustering w/ FCMdd Data analysis & visualization On going

Results – summary of log files 148 files (one per day from 09/29/06 to 02/23/07), totaling about 2.5 GB File sizes for Oct 2006 and Feb 2007 as shown Session counts in the same periods present similar patterns

Results – frequencies by URI type User client programs (or browsers used) Main page ASP scripts Breakdown for /accepted, /choose-mercer, and /mercer-411 Flash videos Individual videos Combined by topic /accepted /choose-mercer /mercer-411

Results – user cluster and profiles 279 128 156 278 267 145 305 399 190 320 268 279 158 251 225 162 147 166 263 112 150 345 206 233 281 291 151 186 229

Questions and Discussions

References Data mining for business intelligence, by Shmuli et al, Wiley Inter-Science, 2007 Data mining, by Dunham, Prentice Hall, 2003 Web mining: applications and techniques, Scime (ed.), IDEA group, 2005 What is data mining? by Squier, ( www.dama- ncr.org/Library/2001.11.14-Laura%20Squier.ppt ) www.dama- ncr.org/Library/2001.11.14-Laura%20Squier.ppt Automatic web user profiling and personalization using robust fuzzy relational clustering, by Nasraoui et al, 1999 Web usage mining: discovery and application of interesting patterns from web data, by Cooley, PhD thesis, Univ. of Minnesota, 2000

Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007.

Similar presentations

Presentation on theme: "Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007.

Similar presentations

Presentation on theme: "Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007."— Presentation transcript:

Similar presentations

About project

Feedback