SpeedTracer: A Web usage mining and analysis tool 성신여대 전산학과 DE Lab 최 영 란
Abstract To understand user surfing behavior SpeedTracer first, identifies user sessions then, data mining algorithms the top N most frequented user traversal paths the top N groups of pages most frequently visited together user reports, path reports and group reports 2019-01-18 DE Lab 최 영 란
Introduction Web server log file IP / time stamp / method / URL / HTTP version / return code / bytes transferred / referrer page URL / agent Peo-ill-21.ix.netcom.com - - [24/Feb/1997:00:00:21 +0000] “GET /images/nudge.gif HTTP/1.0” 200 37 “http://www.internet.ibm.com/” “Mozilla/2.0 (compatible; MSIE 3.01; Windows NT)” Table 1. Sample log entry from an NCSA HTTPd 2019-01-18 DE Lab 최 영 란
Difficulties of the user identification proxy servers and firewalls masks cached Web pages so, “cookies” and log-ins are needed. But, users want not to be asked for registration or not to use cookies. SpeerTracer referrer page and URL as a traversal step are used Mapping each identified user session into a transaction 2019-01-18 DE Lab 최 영 란
Comparison sequential pattern frequent traversal paths don’t need to be consecutive frequent traversal paths collection of consecutive URL pages in a Web presentation frequently visited group of pages similar to association rule no ordering 2019-01-18 DE Lab 최 영 란
User session identification Use five key pieces IP, Timestamp, URL, Referral, and Agent hyperlink access pair, representing a step in a user traversal path cached pages no corresponding log entries so these missing access pairs need to be added back during session identification 2019-01-18 DE Lab 최 영 란
Session identification S: (x1 y1), (x2 y2), … , (xn yn) where xi+1 = yi, 1 i n new pair (xj yj) be appended to S if xj = yk, 1 k n, or x1 = xj unless xj = yn, backward access path (yn xn), … , (yk+1 xk+1) must first be added back to S For example, if (b d) is to be appended to Si: (a b), (b c) new Si: (a b), (b c), (c b), (b d) 2019-01-18 DE Lab 최 영 란
Table 2. Key information in log used for session identification Record IP Timestamp URL Referral Agent 1 indigo.sungshin.ac.kr 08:30:00 a - Mozillar/2.0; AIX 4.1.4 2 indigo.sungshin.ac.kr 08:30:01 b e Mozillar/2.0; AIX 4.1.4 3 indigo.sungshin.ac.kr 08:30:01 c b Mozillar/2.0; AIX 4.1.4 4 indigo.sungshin.ac.kr 08:30:01 b - Mozillar/2.0; Win95 5 indigo.sungshin.ac.kr 08:30:02 c b Mozillar/2.0; Win95 6 indigo.sungshin.ac.kr 08:30:03 f - Mozillar/2.0; Win95 7 indigo.sungshin.ac.kr 08:30:04 b a Mozillar/2.0; AIX 4.1.4 8 indigo.sungshin.ac.kr 08:30:05 g b Mozillar/2.0; AIX 4.1.4 Table 2. Key information in log used for session identification S1: (- a), (a b), (b g) from log 1, 7 S2: (e b), (b c) from log 2, 3 S3: (- b), (b c) from log 4, 5 S4: (- f) from log 6 2019-01-18 DE Lab 최 영 란
a b f c e d Traversal steps from log files EXIT Traversal steps from log files (- a), (a b), (b c), (c d), (b e), (a f) Traversal path representing a user session (- a), (a b), (b c), (c d), (d c), (c b), (b e), (e b), (b a), (a f) Figure 2. An example of user traversal path in a surfing session 2019-01-18 DE Lab 최 영 란
In our session identification, Eliminate “gif” or “jpg” file On a reload, the repeated access pair is discarded bookmark or directly typing in the URL as the beginning of a new session 2019-01-18 DE Lab 최 영 란
Mining frequent traversal paths Collection of consecutive URL pages Only interested in forward traversal subpaths 1. Maximum forward path a sequence of maximum connected pages, where no page is previously visited 2. Large traversal path a sequence of consecutive pages that appeared in the maximal forward paths of a sufficient number of sessions 2019-01-18 DE Lab 최 영 란
Example of finding maximum forward paths: { a, b, c, d, c, b, e, b, a, f } Step Xi Subpath Flag Maximum {y1, …, yj-1} forward path 1 a {a} YES 2 b {a, b} YES 3 c {a, b, c} YES 4 d {a, b, c, d} YES 5 c {a, b, c, d} NO {a, b, c, d} 6 b {a, b} NO 7 e {a, b, e} YES 8 b {a, b} NO {a, b, e} 9 a {a} NO 10 f {a, f} YES 11 {a, f} YES {a, f} 2019-01-18 DE Lab 최 영 란
for (j = 1; j < m - k + 1; j++) { for each Fi { for each {x1, x2, … , xm} in Fi { if (m k) { for (j = 1; j < m - k + 1; j++) { if ({xj, … , xj+k-1} is already in LPk)) increase its corresponding count else if ((support of {xj, … , xj+k-2} Sk-1) and (support of {xj, … , xj+k-2} Sk-1)) insert {xj, … , xj+k-1} into LPk } Figure 4. Algorithm for discovering large traversal path set LPk 2019-01-18 DE Lab 최 영 란
For example, assume session S1 contains two maximum forward path: {A, B, C, D, E}, {G, H} for CP3, test {A, B, C}, {B, C, D}, {C, D, E} if {A, B}, {B, C} are in LP2, {A, B, C} is CP3 2019-01-18 DE Lab 최 영 란
Mining groups of pages most frequently visited First, Any duplication of pages was eliminated in each session Sort the groups in LGk-1 in lexicographical order; for each group {x1, … , xk-1} in LGk-1 { for each group {y1, … , yk-1} in LGk-1 such that x2 = y1, … , yk-1 = yk-2 { construct a new group G = {x1, … , xk-1, yk -1}; test all other combinations of subgroups of G with size (k - 1); if (all such subgroups are among the top M groups in LGk-1) add G into CGk; } Figure 5. Algorithm for generation candidate groups CGk 2019-01-18 DE Lab 최 영 란