SpeedTracer: A Web usage mining and analysis tool

Slides:



Advertisements
Similar presentations
Data e Web Mining Paolo Gobbo
Advertisements

Interception of User’s Interests on the Web Michal Barla Supervisor: prof. Mária Bieliková.
Digital Marketing Analytics v10. Introduction  Name / job role  What company are you with  How much experience do you have using Webtrends  Create.
SIUG Annual Meeting 2010 UNC Charlotte January 28, 2010 SIUG Annual Meeting 2010 Web Logs: Finally! Now What Do We Do With Them? Dan Pfohl, UNC Wilmington.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Chapter 12: Web Usage Mining - An introduction
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
 Proxy Servers are software that act as intermediaries between client and servers on the Internet.  They help users on private networks get information.
E-insights, LLC © 2000 All rights reserved. Understanding Web Traffic Michael Whelan Part 1 of 2.
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web logs Data Engineering Lab 성 유 진.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
Web mining Web mining deals with mining of patterns from web and e-commerce data. Web data –Web pages –Web structures –Web logs –E-commerce sites – .
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Windows Internet Explorer 9 Chapter 1 Introduction to Internet Explorer.
JavaScript, Fourth Edition
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab
Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.
1 Lies, damn lies and Web statistics A brief introduction to using and abusing web statistics Paul Smith, ILRT July 2006.
©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Bringing It All Together Analyzing Web Server Log Files Eric Landrieu Lead Developer, PerfMan for Web Servers The Information.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Self-Organized Web Usage Regularities. Problems of foraging information on WWW Slow accession Difficulty in finding useful information is related to balkanization.
Web Measurement. The Web is Different from other Commuication Media More precise measurement of activity on Web sites is available More precise measurement.
Advanced Topics in Data Mining: Web Mining. Web Mining.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,
Science data sharing user behavior mining: an approach combining Web Usage Mining and GIS Mo Wang, Juanle Wang, Yongqing Bai Institute of Geographic Sciences.
Zaap Visualization of web traffic from http server logs.
Basic Internet Skills. What is the internet? A large group of computers connected to one another Its purpose is to send information back and forth to.
Our Topic: Web Usage Mining Presented by: Wenzhen Xing & Kun Gao With Guide of: Dr. Bettina Berendt For seminar: Web Mining.
Smart Miner: A New Framework for Mining Large Scale Web Usage Data
Sorting Lower Bound 4/25/2018 8:49 PM
The Intranet.
Effective Prediction of Web-user Accesses: A Data Mining Approach
Web Routing Designing an Interface
Entry Ticket: Algorithms and Program Construction
Database Management System
De-anonymizing the Internet Using Unreliable IDs
Internet Networking recitation #12
A Brief Introduction to the Internet
MG4J – Managing GigaBytes for Java Introduction
Lin Lu, Margaret Dunham, and Yu Meng
Chapter 12: Automated data collection methods
(2,4) Trees 11/15/2018 9:25 AM Sorting Lower Bound Sorting Lower Bound.
Mining Sequential Patterns
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
(2,4) Trees 12/4/2018 1:20 PM Sorting Lower Bound Sorting Lower Bound.
Mining Sequential Patterns
Lecture 2- Query Processing (continued)
Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛
Effective Prediction of Web-user Accesses: A Data Mining Approach
Common Log Format Field Description
(2,4) Trees 2/28/2019 3:21 AM Sorting Lower Bound Sorting Lower Bound.
Discovery of Significant Usage Patterns from Clickstream Data
Presentation transcript:

SpeedTracer: A Web usage mining and analysis tool 성신여대 전산학과 DE Lab 최 영 란

Abstract To understand user surfing behavior SpeedTracer first, identifies user sessions then, data mining algorithms the top N most frequented user traversal paths the top N groups of pages most frequently visited together user reports, path reports and group reports 2019-01-18 DE Lab 최 영 란

Introduction Web server log file IP / time stamp / method / URL / HTTP version / return code / bytes transferred / referrer page URL / agent Peo-ill-21.ix.netcom.com - - [24/Feb/1997:00:00:21 +0000] “GET /images/nudge.gif HTTP/1.0” 200 37 “http://www.internet.ibm.com/” “Mozilla/2.0 (compatible; MSIE 3.01; Windows NT)” Table 1. Sample log entry from an NCSA HTTPd 2019-01-18 DE Lab 최 영 란

Difficulties of the user identification proxy servers and firewalls masks cached Web pages  so, “cookies” and log-ins are needed. But, users want not to be asked for registration or not to use cookies. SpeerTracer referrer page and URL as a traversal step are used Mapping each identified user session into a transaction 2019-01-18 DE Lab 최 영 란

SESSION IDENTIFICATION ACCESS LOG REFERRER LOG AGENT LOG MERGE LOG REDORDS SESSION IDENTIFICATION MINING TOP FREQUENT TRAVERSAL PATHS GROUP OF PAGES PATH REPORTS USER REPORTS GROUP REPORTS Figure 1. Flow diagram of SpeedTracer implementation 2019-01-18 DE Lab 최 영 란

Comparison sequential pattern frequent traversal paths don’t need to be consecutive frequent traversal paths collection of consecutive URL pages in a Web presentation frequently visited group of pages similar to association rule no ordering 2019-01-18 DE Lab 최 영 란

User session identification Use five key pieces IP, Timestamp, URL, Referral, and Agent  hyperlink access pair, representing a step in a user traversal path cached pages no corresponding log entries so these missing access pairs need to be added back during session identification 2019-01-18 DE Lab 최 영 란

Session identification S: (x1  y1), (x2  y2), … , (xn  yn) where xi+1 = yi, 1  i  n new pair (xj  yj) be appended to S if xj = yk, 1  k  n, or x1 = xj unless xj = yn, backward access path (yn  xn), … , (yk+1  xk+1) must first be added back to S For example, if (b  d) is to be appended to Si: (a  b), (b  c) new Si: (a  b), (b  c), (c  b), (b  d) 2019-01-18 DE Lab 최 영 란

Table 2. Key information in log used for session identification Record IP Timestamp URL Referral Agent 1 indigo.sungshin.ac.kr 08:30:00 a - Mozillar/2.0; AIX 4.1.4 2 indigo.sungshin.ac.kr 08:30:01 b e Mozillar/2.0; AIX 4.1.4 3 indigo.sungshin.ac.kr 08:30:01 c b Mozillar/2.0; AIX 4.1.4 4 indigo.sungshin.ac.kr 08:30:01 b - Mozillar/2.0; Win95 5 indigo.sungshin.ac.kr 08:30:02 c b Mozillar/2.0; Win95 6 indigo.sungshin.ac.kr 08:30:03 f - Mozillar/2.0; Win95 7 indigo.sungshin.ac.kr 08:30:04 b a Mozillar/2.0; AIX 4.1.4 8 indigo.sungshin.ac.kr 08:30:05 g b Mozillar/2.0; AIX 4.1.4 Table 2. Key information in log used for session identification S1: (-  a), (a  b), (b  g) from log 1, 7 S2: (e  b), (b  c) from log 2, 3 S3: (-  b), (b  c) from log 4, 5 S4: (-  f) from log 6 2019-01-18 DE Lab 최 영 란

a b f c e d Traversal steps from log files EXIT Traversal steps from log files (-  a), (a  b), (b  c), (c  d), (b  e), (a  f) Traversal path representing a user session (-  a), (a  b), (b  c), (c  d), (d  c), (c  b), (b  e), (e  b), (b  a), (a  f) Figure 2. An example of user traversal path in a surfing session 2019-01-18 DE Lab 최 영 란

In our session identification, Eliminate “gif” or “jpg” file On a reload, the repeated access pair is discarded bookmark or directly typing in the URL as the beginning of a new session 2019-01-18 DE Lab 최 영 란

Mining frequent traversal paths Collection of consecutive URL pages Only interested in forward traversal subpaths 1. Maximum forward path a sequence of maximum connected pages, where no page is previously visited 2. Large traversal path a sequence of consecutive pages that appeared in the maximal forward paths of a sufficient number of sessions 2019-01-18 DE Lab 최 영 란

Example of finding maximum forward paths: { a, b, c, d, c, b, e, b, a, f } Step Xi Subpath Flag Maximum {y1, …, yj-1} forward path 1 a {a} YES 2 b {a, b} YES 3 c {a, b, c} YES 4 d {a, b, c, d} YES 5 c {a, b, c, d} NO {a, b, c, d} 6 b {a, b} NO 7 e {a, b, e} YES 8 b {a, b} NO {a, b, e} 9 a {a} NO 10 f {a, f} YES 11 {a, f} YES {a, f} 2019-01-18 DE Lab 최 영 란

for (j = 1; j < m - k + 1; j++) { for each Fi { for each {x1, x2, … , xm} in Fi { if (m  k) { for (j = 1; j < m - k + 1; j++) { if ({xj, … , xj+k-1} is already in LPk)) increase its corresponding count else if ((support of {xj, … , xj+k-2}  Sk-1) and (support of {xj, … , xj+k-2}  Sk-1)) insert {xj, … , xj+k-1} into LPk } Figure 4. Algorithm for discovering large traversal path set LPk 2019-01-18 DE Lab 최 영 란

For example, assume session S1 contains two maximum forward path: {A, B, C, D, E}, {G, H} for CP3, test {A, B, C}, {B, C, D}, {C, D, E} if {A, B}, {B, C} are in LP2, {A, B, C} is CP3 2019-01-18 DE Lab 최 영 란

Mining groups of pages most frequently visited First, Any duplication of pages was eliminated in each session Sort the groups in LGk-1 in lexicographical order; for each group {x1, … , xk-1} in LGk-1 { for each group {y1, … , yk-1} in LGk-1 such that x2 = y1, … , yk-1 = yk-2 { construct a new group G = {x1, … , xk-1, yk -1}; test all other combinations of subgroups of G with size (k - 1); if (all such subgroups are among the top M groups in LGk-1) add G into CGk; } Figure 5. Algorithm for generation candidate groups CGk 2019-01-18 DE Lab 최 영 란