Download presentation
Presentation is loading. Please wait.
Published byDorothy Price Modified over 9 years ago
1
Advanced Topics in Data Mining: Web Mining
2
Web Mining
3
Applications are ported to the Web at rapid pace On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web How Amazon does it? Understanding Web user behavior is important –It can improve Web page organization –It can increase Web server performance –It can exploit Web advertising –It can increase business opportunity
4
Amazon Web Page Association Rules
5
More Information Desired Collect statistical information (page hits) only, which is insufficient since: –The hit frequency of a page depends not only on its content but also on its location –The number of users accessing a page is not available –Information on what pages accessed together is not available Data mining in the Web (Web Mining) –Web Access Pattern Collection –Web User Pattern Mining
6
Web Access Pattern Collection Server-Based Data Collection –Who are visiting a given Web site and what are they doing Agent-Based Data Collection –What are the Web sites a particular user has visited?
7
Server-Based Data Collection Examine the logs collected by HTTPd –Access Log (IP, Time, Access Data), Referred Log (A B), Error Log, … –We can combining some of them for our use if necessary Problems –The use of proxy servers –The effect of caching
8
Server-Based Data Collection
9
Access Log IP/Domain NameTimeAccess Data
10
Referred Log 不考慮 Caching 的問題
11
Server-Based Data Collection Have to be done in accordance with technology advances –The use of Active Server Pages (Session ID available) The use of proxy servers The effect of caching –HTTPd 1.1 Limitation –Can only capture the user behavior when they are within this site
12
Agent-Based Data Collection Understanding individual Web behavior needs client-based data collection Results are useful –Better Personalized Service –Improved Web Page Organization –Better Pricing Policies Methods –Applets can only read/write files in their source servers a big security constraint –Using Active Components (ActiveX Control) and PlugIns APCS (Access Pattern Collection Server)
13
APCS
18
Agent-Based Data Collection Very difficult to do for non-registered users in the current Web environment –We have to be conducted with users’ consent Very dependent upon available Web technologies
19
Web User Pattern Mining Web user pattern mining is to discover user access patterns in Web servers Pattern discovery and analysis tools –Some existing Web tools provide mechanisms for reporting user activity in the servers –Web Trends (http://www.webtrends.com.tw/) –Open Market (http://www.openmarket.com/) –Net.Genesis (http://www.netgen.com/)
20
Path Traversal Patterns Mining Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access Solution procedure consists of three steps: –Convert the original sequence of log data into a set of maximal forward references (MF) Filter out the effect of some backward references –Mainly made for ease of traveling and concentrate on mining meaningful user access sequences –Some objects are visited because of their locations rather than their content –Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained –Determine the maximal reference sequences from large reference sequences (Trivial)
21
Step1: MF References Suppose the traversal log contains the following traversal path for a user: –A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV} When backward references occur, a forward reference path terminate.
22
Step1: Another Example
23
Step1: Arrange Database Encoding
24
Step1: Database Reduction Database Reduction
25
Step2: Find Frequent Reference Sequences Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences) –Full-Scan (FS) Algorithm FS utilizes key ideas of the DHP algorithm –Selective-Scan (SS) Algorithm SS reduces the number of database scans
26
Full-Scan (FS) Algorithm Scan DB-1 Generate L1 & Hash Table
27
Scan DB-1 h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17
28
Generate C2
29
Generate L2 & Reduce DB Scan DB-2
30
Generate L2 & Reduce DB Scan DB-2
31
Generate C3, L3 & Reduce DB Scan DB-3
32
Generate C4, L4 & Reduce DB Scan DB-4
33
Selective-Scan (SS) Algorithm Scan DB-3
34
Step 3: Generate Frequent Traversal Patterns Maximal Reference Sequences
35
WAP-Mine Algorithm The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure Given Web Access Sequence database WAS and a support threshold , mine the complete set of -patterns of WAS User IDWeb Access Sequence 100abdac 200eaebcac 300babfaec 400afbacfc WAS
36
WAP-Mine Algorithm (1)Scan WAS once,find all frequent-1 events (2)Scan WAS again,construct a WAP-tree (3)Recursively mine the WAP-tree using conditional search Access patterns
37
Find All Frequent-1 Events User IDWeb Access Sequence 100abdac 200eaebcac 300babfaec 400afbacfc ItemSupportFrequency a4 b4 c4 d1 e2 f2 Min_Sup=75% User IDWeb Access SequenceFrequent Subsequence 100abdacabac 200eaebcacabcac 300babfaecbabac 400afbacfcabacc
38
WAP-Tree Construction Using frequent events to register all count information for further mining User IDFrequent Subsequence 100abac 200abcac 300babac 400abacc
39
Mining Web Access Patterns from WAP-Tree SequenceCount aba2 ab1 abca1 ab baba1 abac1 aba Conditional Sequence Based on c SequenceCount aba1 abca1 baba1 abac1 ItemSupFrequency a4 b4 c2 Generate Web Access Patterns: ac, bc
40
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on ac SequenceCount ab3 b1 bab1 b SequenceCount ab3 bab1 ItemSupFrequency a4 b4 Generate Web Access Patterns: aac, bac
41
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on bac SequenceCount a3 ba1 ItemSu p Frequent a4 b1 Generate Web Access Patterns: abac
42
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on abac No Web Access Patterns are Generated SequenceCount a4
43
Mining for Web Transactions To capture Web customer buying behavior –It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules) –It is not just Web user travel patterns (Path Traversal Patterns) –It is an extension from path traversal patterns Exploring the relationship between traveling and buying
44
Mining for Web Transactions Web Transaction Algorithm WR (Web-transaction-Record) Web Transaction Records Algorithm WTM, MTS PJ, MTS PC Frequent Transaction Patterns Web Transaction Association Rules
45
Mining for Web Transactions Web-transaction-Record (WR) Algorithm –Extract meaningful Web transaction records from the given Web transaction WTM (Web Transaction Mining) Algorithm –Mining Web Transaction Patterns MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM
46
Mining for Web Transactions
48
WTM Algorithm It joins the purchased itemsets for generating candidate transaction patterns WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns –WTM hashes not only each item but also each purchase in the path
49
WTM Algorithm S{i7}, J{i8}, Q{i10}ASJLQ G{i5}ABFG D{i3}ABD 400 S{i7}, J{i8}, L{i9}ASJL B{i1}, G{i5}ABFG B{i1}, E{i4}ABCE 300 S{i7}, Q{i10}ASJLQ B{i1}, C{i2}, E{i4}ABCE 200 S{i7}, L{i9}ASJL B{i1}, H{i6}ABFGH B{i1}, C{i2}, E{i4}ABCE 100 PurchasePathWT_ID Web Transaction DATABASE
50
Support Count WT_IDPathPurchase 100 ABCEB{i1}, C{i2}, E{i4} ABFGHB{i1}, H{i6} ASJLS{i7}, L{i9} 200 ABCEB{i1}, C{i2}, E{i4} ASJLQS{i7}, Q{i10} PathPurchaseSupport Count ABB{i1}2 ABCC{i2}2
51
WTM Algorithm 2Q{i10}ASJLQ 2L{i9}ASJL 2J{i8}ASJ 4S{i7}AS 1H{i6}ABFGH 2G{i5}ABFG 3E{i4}ABCE 1D{i3}ABD 2C{i2}ABC 3B{i1}AB Sup.PurchasePath C1 PathPurchaseSup. ABB{i1}3 ABCC{i2}2 ABCEE{i4}3 ABFGG{i5}2 ASS{i7}4 ASJJ{i8}2 ASJLL{i9}2 ASJLQQ{i10}2 T1 Support Count >= 2
52
WTM Algorithm 3B{i1} E{i4}ABCE 2B{i1} C{i2}ABC Sup.PurchasePath 0B{i1} J{i8}ASJ 0B{i1} S{i7}AS 0L{i9} Q{i10}ASJLQ 1J{i8} Q{i10}ASJLQ C2 2C{i2} E{i4}ABCE 2S{i7} Q{i10}ASJLQ 2S{i7} L{i9}ASJL 2S{i7} J{i8}ASJ 3B{i1} E{i4}ABCE 2B{i1} C{i2}ABC Sup.PurchasePath T2 Support Count >= 2
53
WTM Algorithm 2B{i1} C{i2} E{i4}ABCE Sup.PurchasePath T3 2B{i1} C{i2} E{i4}ABCE Sup.PurchasePath C3 Support Count >= 2
54
WTM Disadvantages WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns This will degrade the performance
55
MTS PJ Algorithm Algorithm MTS PJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem MTS PJ generalizes candidate transaction patterns only when the leaf node of the Web transaction tree is reached
56
MTS PJ Algorithm S{i7}, J{i8}, Q{i10}ASJLQ G{i5}ABFG D{i3}ABD 400 S{i7}, J{i8}, L{i9}ASJL B{i1}, G{i5}ABFG B{i1}, E{i4}ABCE 300 S{i7}, Q{i10}ASJLQ B{i1}, C{i2}, E{i4}ABCE 200 S{i7}, L{i9}ASJL B{i1}, H{i6}ABFGH B{i1}, C{i2}, E{i4}ABCE 100 PurchasePathWT_ID Web Transaction DATABASE A F E C BS D H G Q L J
57
MTS PJ Algorithm 2Q{i10}ASJLQ 2L{i9}ASJL 2J{i8}ASJ 4S{i7}AS 1H{i6}ABFGH 2G{i5}ABFG 3E{i4}ABCE 1D{i3}ABCD 2C{i2}ABC 3B{i1}AB Sup.PurchasePath C1 PathPurchaseSup. ABB{i1}3 ABCC{i2}2 ABCEE{i4}3 ABFGG{i5}2 ASS{i7}4 ASJJ{i8}2 ASJLL{i9}2 ASJLQQ{i10}2 T1 Support Count >= 2 F G J L Q S E C A B
58
MTS PJ Algorithm B{i1} C{i2} E{i4} ABCE Maximal Transaction Segment B{i1} G{i5}ABFG Sup.PurchasePath C2 S{i7} J{i8} L{i9} Q{i10} ASJLQ Maximal Transaction Segment C2 Sup.PurchasePath L{i9} Q{i10}ASJLQ J{i8} Q{i10}ASJLQ S{i7} Q{i10}ASJLQ J{i8} L{i9}ASJL S{i7} L{i9}ASJL S{i7} J{i8}ASJ C2 B{i1} C{i2}ABC C{i2} E{i4}ABCE B{i1} E{i4}ABCE Sup.PurchasePathB{i1} G{i5} ABFG Maximal Transaction Segment F G J L Q S E C A B 232232 1 221210221210
59
MTS PJ Algorithm C2 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ABFGB{i1} G{i5}1 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLJ{i8} L{i9}1 ASJLQS{i7} Q{i10}2 ASJLQJ{i8} Q{i10}1 ASJLQL{i9} Q{i10}0 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLQS{i7} Q{i10}2 T2
60
MTS PJ Algorithm J L Q S E C A B B{i1} C{i2} E{i4} ABCE Maximal Transaction Segment 2 B{i1} C{i2} E{i4} ABCE Sup.PurchasePath C3
61
MTS PC Algorithm 2Q{i10}ASJLQ 2L{i9}ASJL 2J{i8}ASJ 4S{i7}AS 1H{i6}ABFGH 2G{i5}ABFG 3E{i4}ABCE 1D{i3}ABCD 2C{i2}ABC 3B{i1}AB Sup.PurchasePath C1 PathPurchaseSup. ABB{i1}3 ABCC{i2}2 ABCEE{i4}3 ABFGG{i5}2 ASS{i7}4 ASJJ{i8}2 ASJLL{i9}2 ASJLQQ{i10}2 T1 Support Count >= 2 F G J L Q S E C A B MTSPC utilizes the LC (Large Count) to Filter Candidates
62
MTS PC Algorithm F G J L Q S E C A B 1E{i4} 1C{i2} 1B{i1} ABCE LCItemMaximal Path Maximal Transaction Segment K=1 |I| = 3 > 1 (K-1) C2 2B{i1} C{i2}ABC 2 C{i2} E{i4}ABCE 3 B{i1} E{i4}ABCE Sup.PurchasePath Maximal Transaction Segment Maximal PathItemLC ASJLQ S{i7}1 J{i8}1 L{i9}1 Q{i10}1 |I| = 4 > 1 C2 Sup.PurchasePath 0L{i9} Q{i10}ASJLQ 1J{i8} Q{i10}ASJLQ 2S{i7} Q{i10}ASJLQ 1J{i8} L{i9}ASJL 2S{i7} L{i9}ASJL 2S{i7} J{i8}ASJ Maximal Transaction Segment Maximal PathItemLC ABFG B{i1}1 G{i5}1 |I| = 2 > 1 1B{i1} G{i5}ABFG Sup.PurchasePath C2
63
MTS PC Algorithm C2 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ABFGB{i1} G{i5}1 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLJ{i8} L{i9}1 ASJLQS{i7} Q{i10}2 ASJLQJ{i8} Q{i10}1 ASJLQL{i9} Q{i10}0 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLQS{i7} Q{i10}2 T2
64
MTS PC Algorithm Maximal Transaction Segment Maximal PathItemLC ASJLQ S{i7}3 J{i8}1 L{i9}1 Q{i10}1 |I| = 3 > 2 2E{i4} 2C{i2} 2B{i1} ABCE LCItemMaximal Path Maximal Transaction Segment K=2 |I| = 1 < 2 J L Q S E C A B B{i1} C{i2} E{i4} ABCE PurchasePath C3 No Generations PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLQS{i7} Q{i10}2 T2
65
Mining for Web Transactions = 2 = 3 We can derive E{4}> –support_count( E{4}>) = 2 –confidence( E{4}>) =
66
Summary Data mining in the Web is an area of growing importance –In particular, the emerging of EC –More and more applications will benefit from the knowledge from data mining Web Mining = Web Data Collection + Traditional Data Mining? Important Issues –Incremental Web Mining
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.