SpeedTracer: A Web usage mining and analysis tool 성신여대 전산학과 DE Lab 최 영 란
Abstract To understand user surfing behavior SpeedTracer first, identifies user sessions then, data mining algorithms the top N most frequented user traversal paths the top N groups of pages most frequently visited together user reports, path reports and group reports 2019-01-18 DE Lab 최 영 란
Introduction Web server log file IP / time stamp / method / URL / HTTP version / return code / bytes transferred / referrer page URL / agent Peo-ill-21.ix.netcom.com - - [24/Feb/1997:00:00:21 +0000] “GET /images/nudge.gif HTTP/1.0” 200 37 “http://www.internet.ibm.com/” “Mozilla/2.0 (compatible; MSIE 3.01; Windows NT)” Table 1. Sample log entry from an NCSA HTTPd 2019-01-18 DE Lab 최 영 란
Difficulties of the user identification proxy servers and firewalls masks cached Web pages so, “cookies” and log-ins are needed. But, users want not to be asked for registration or not to use cookies. SpeerTracer referrer page and URL as a traversal step are used Mapping each identified user session into a transaction 2019-01-18 DE Lab 최 영 란
SESSION IDENTIFICATION ACCESS LOG REFERRER LOG AGENT LOG MERGE LOG REDORDS SESSION IDENTIFICATION MINING TOP FREQUENT TRAVERSAL PATHS GROUP OF PAGES PATH REPORTS USER REPORTS GROUP REPORTS Figure 1. Flow diagram of SpeedTracer implementation 2019-01-18 DE Lab 최 영 란
Comparison sequential pattern frequent traversal paths don’t need to be consecutive frequent traversal paths collection of consecutive URL pages in a Web presentation frequently visited group of pages similar to association rule no ordering 2019-01-18 DE Lab 최 영 란
User session identification Use five key pieces IP, Timestamp, URL, Referral, and Agent hyperlink access pair, representing a step in a user traversal path cached pages no corresponding log entries so these missing access pairs need to be added back during session identification 2019-01-18 DE Lab 최 영 란
Session identification S: (x1 y1), (x2 y2), … , (xn yn) where xi+1 = yi, 1 i n new pair (xj yj) be appended to S if xj = yk, 1 k n, or x1 = xj unless xj = yn, backward access path (yn xn), … , (yk+1 xk+1) must first be added back to S For example, if (b d) is to be appended to Si: (a b), (b c) new Si: (a b), (b c), (c b), (b d) 2019-01-18 DE Lab 최 영 란
Table 2. Key information in log used for session identification Record IP Timestamp URL Referral Agent 1 indigo.sungshin.ac.kr 08:30:00 a - Mozillar/2.0; AIX 4.1.4 2 indigo.sungshin.ac.kr 08:30:01 b e Mozillar/2.0; AIX 4.1.4 3 indigo.sungshin.ac.kr 08:30:01 c b Mozillar/2.0; AIX 4.1.4 4 indigo.sungshin.ac.kr 08:30:01 b - Mozillar/2.0; Win95 5 indigo.sungshin.ac.kr 08:30:02 c b Mozillar/2.0; Win95 6 indigo.sungshin.ac.kr 08:30:03 f - Mozillar/2.0; Win95 7 indigo.sungshin.ac.kr 08:30:04 b a Mozillar/2.0; AIX 4.1.4 8 indigo.sungshin.ac.kr 08:30:05 g b Mozillar/2.0; AIX 4.1.4 Table 2. Key information in log used for session identification S1: (- a), (a b), (b g) from log 1, 7 S2: (e b), (b c) from log 2, 3 S3: (- b), (b c) from log 4, 5 S4: (- f) from log 6 2019-01-18 DE Lab 최 영 란
a b f c e d Traversal steps from log files EXIT Traversal steps from log files (- a), (a b), (b c), (c d), (b e), (a f) Traversal path representing a user session (- a), (a b), (b c), (c d), (d c), (c b), (b e), (e b), (b a), (a f) Figure 2. An example of user traversal path in a surfing session 2019-01-18 DE Lab 최 영 란
In our session identification, Eliminate “gif” or “jpg” file On a reload, the repeated access pair is discarded bookmark or directly typing in the URL as the beginning of a new session 2019-01-18 DE Lab 최 영 란
Mining frequent traversal paths Collection of consecutive URL pages Only interested in forward traversal subpaths 1. Maximum forward path a sequence of maximum connected pages, where no page is previously visited 2. Large traversal path a sequence of consecutive pages that appeared in the maximal forward paths of a sufficient number of sessions 2019-01-18 DE Lab 최 영 란
Example of finding maximum forward paths: { a, b, c, d, c, b, e, b, a, f } Step Xi Subpath Flag Maximum {y1, …, yj-1} forward path 1 a {a} YES 2 b {a, b} YES 3 c {a, b, c} YES 4 d {a, b, c, d} YES 5 c {a, b, c, d} NO {a, b, c, d} 6 b {a, b} NO 7 e {a, b, e} YES 8 b {a, b} NO {a, b, e} 9 a {a} NO 10 f {a, f} YES 11 {a, f} YES {a, f} 2019-01-18 DE Lab 최 영 란
for (j = 1; j < m - k + 1; j++) { for each Fi { for each {x1, x2, … , xm} in Fi { if (m k) { for (j = 1; j < m - k + 1; j++) { if ({xj, … , xj+k-1} is already in LPk)) increase its corresponding count else if ((support of {xj, … , xj+k-2} Sk-1) and (support of {xj, … , xj+k-2} Sk-1)) insert {xj, … , xj+k-1} into LPk } Figure 4. Algorithm for discovering large traversal path set LPk 2019-01-18 DE Lab 최 영 란
For example, assume session S1 contains two maximum forward path: {A, B, C, D, E}, {G, H} for CP3, test {A, B, C}, {B, C, D}, {C, D, E} if {A, B}, {B, C} are in LP2, {A, B, C} is CP3 2019-01-18 DE Lab 최 영 란
Mining groups of pages most frequently visited First, Any duplication of pages was eliminated in each session Sort the groups in LGk-1 in lexicographical order; for each group {x1, … , xk-1} in LGk-1 { for each group {y1, … , yk-1} in LGk-1 such that x2 = y1, … , yk-1 = yk-2 { construct a new group G = {x1, … , xk-1, yk -1}; test all other combinations of subgroups of G with size (k - 1); if (all such subgroups are among the top M groups in LGk-1) add G into CGk; } Figure 5. Algorithm for generation candidate groups CGk 2019-01-18 DE Lab 최 영 란