Download presentation
Presentation is loading. Please wait.
Published byGlenna Darmadi Modified over 5 years ago
1
SpeedTracer: A Web usage mining and analysis tool
성신여대 전산학과 DE Lab 최 영 란
2
Abstract To understand user surfing behavior SpeedTracer
first, identifies user sessions then, data mining algorithms the top N most frequented user traversal paths the top N groups of pages most frequently visited together user reports, path reports and group reports DE Lab 최 영 란
3
Introduction Web server log file
IP / time stamp / method / URL / HTTP version / return code / bytes transferred / referrer page URL / agent Peo-ill-21.ix.netcom.com - - [24/Feb/1997:00:00: ] “GET /images/nudge.gif HTTP/1.0” “ “Mozilla/2.0 (compatible; MSIE 3.01; Windows NT)” Table 1. Sample log entry from an NCSA HTTPd DE Lab 최 영 란
4
Difficulties of the user identification
proxy servers and firewalls masks cached Web pages so, “cookies” and log-ins are needed. But, users want not to be asked for registration or not to use cookies. SpeerTracer referrer page and URL as a traversal step are used Mapping each identified user session into a transaction DE Lab 최 영 란
5
SESSION IDENTIFICATION
ACCESS LOG REFERRER LOG AGENT LOG MERGE LOG REDORDS SESSION IDENTIFICATION MINING TOP FREQUENT TRAVERSAL PATHS GROUP OF PAGES PATH REPORTS USER REPORTS GROUP REPORTS Figure 1. Flow diagram of SpeedTracer implementation DE Lab 최 영 란
6
Comparison sequential pattern frequent traversal paths
don’t need to be consecutive frequent traversal paths collection of consecutive URL pages in a Web presentation frequently visited group of pages similar to association rule no ordering DE Lab 최 영 란
7
User session identification
Use five key pieces IP, Timestamp, URL, Referral, and Agent hyperlink access pair, representing a step in a user traversal path cached pages no corresponding log entries so these missing access pairs need to be added back during session identification DE Lab 최 영 란
8
Session identification
S: (x1 y1), (x2 y2), … , (xn yn) where xi+1 = yi, 1 i n new pair (xj yj) be appended to S if xj = yk, 1 k n, or x1 = xj unless xj = yn, backward access path (yn xn), … , (yk+1 xk+1) must first be added back to S For example, if (b d) is to be appended to Si: (a b), (b c) new Si: (a b), (b c), (c b), (b d) DE Lab 최 영 란
9
Table 2. Key information in log used for session identification
Record IP Timestamp URL Referral Agent 1 indigo.sungshin.ac.kr 08:30: a Mozillar/2.0; AIX 4.1.4 2 indigo.sungshin.ac.kr 08:30: b e Mozillar/2.0; AIX 4.1.4 3 indigo.sungshin.ac.kr 08:30: c b Mozillar/2.0; AIX 4.1.4 4 indigo.sungshin.ac.kr 08:30: b Mozillar/2.0; Win95 5 indigo.sungshin.ac.kr 08:30: c b Mozillar/2.0; Win95 6 indigo.sungshin.ac.kr 08:30: f Mozillar/2.0; Win95 indigo.sungshin.ac.kr 08:30: b a Mozillar/2.0; AIX 4.1.4 8 indigo.sungshin.ac.kr 08:30: g b Mozillar/2.0; AIX 4.1.4 Table 2. Key information in log used for session identification S1: (- a), (a b), (b g) from log 1, 7 S2: (e b), (b c) from log 2, 3 S3: (- b), (b c) from log 4, 5 S4: (- f) from log 6 DE Lab 최 영 란
10
a b f c e d Traversal steps from log files
EXIT Traversal steps from log files (- a), (a b), (b c), (c d), (b e), (a f) Traversal path representing a user session (- a), (a b), (b c), (c d), (d c), (c b), (b e), (e b), (b a), (a f) Figure 2. An example of user traversal path in a surfing session DE Lab 최 영 란
11
In our session identification,
Eliminate “gif” or “jpg” file On a reload, the repeated access pair is discarded bookmark or directly typing in the URL as the beginning of a new session DE Lab 최 영 란
12
Mining frequent traversal paths
Collection of consecutive URL pages Only interested in forward traversal subpaths 1. Maximum forward path a sequence of maximum connected pages, where no page is previously visited 2. Large traversal path a sequence of consecutive pages that appeared in the maximal forward paths of a sufficient number of sessions DE Lab 최 영 란
13
Example of finding maximum forward paths: { a, b, c, d, c, b, e, b, a, f }
Step Xi Subpath Flag Maximum {y1, …, yj-1} forward path 1 a {a} YES 2 b {a, b} YES 3 c {a, b, c} YES 4 d {a, b, c, d} YES 5 c {a, b, c, d} NO {a, b, c, d} 6 b {a, b} NO 7 e {a, b, e} YES 8 b {a, b} NO {a, b, e} 9 a {a} NO 10 f {a, f} YES 11 {a, f} YES {a, f} DE Lab 최 영 란
14
for (j = 1; j < m - k + 1; j++) {
for each Fi { for each {x1, x2, … , xm} in Fi { if (m k) { for (j = 1; j < m - k + 1; j++) { if ({xj, … , xj+k-1} is already in LPk)) increase its corresponding count else if ((support of {xj, … , xj+k-2} Sk-1) and (support of {xj, … , xj+k-2} Sk-1)) insert {xj, … , xj+k-1} into LPk } Figure 4. Algorithm for discovering large traversal path set LPk DE Lab 최 영 란
15
For example, assume session S1 contains two maximum forward path: {A, B, C, D, E}, {G, H} for CP3, test {A, B, C}, {B, C, D}, {C, D, E} if {A, B}, {B, C} are in LP2, {A, B, C} is CP3 DE Lab 최 영 란
16
Mining groups of pages most frequently visited
First, Any duplication of pages was eliminated in each session Sort the groups in LGk-1 in lexicographical order; for each group {x1, … , xk-1} in LGk-1 { for each group {y1, … , yk-1} in LGk-1 such that x2 = y1, … , yk-1 = yk-2 { construct a new group G = {x1, … , xk-1, yk -1}; test all other combinations of subgroups of G with size (k - 1); if (all such subgroups are among the top M groups in LGk-1) add G into CGk; } Figure 5. Algorithm for generation candidate groups CGk DE Lab 최 영 란
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.