Download presentation
1
Data e Web Mining 825368 Paolo Gobbo
Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan
2
Data Mining on Web Web Mining
discover and retrieve useful and interesting pattern from large web dataset web content mining web structure mining web usage mining real data in web pages data describes the organization of the content data describes the pattern of usage of web pages text and multimedia documents hyperlink structure web log records Data e Web Mining
3
Session Identification
PreProcessing INPUT Site File Access Log Referrer Log Agent Log Registration Data Cleaning Path Completion Site Crawler Session Identification User Identification SQL Query User Session File PREPROCESISNG Site Topology Transaction Identification Transaction File Data e Web Mining
4
Session Identification
partitioning each user’s activities into sequence (session) of entries from web request logs time oriented heuristics navigation oriented heuristics temporal boundaries link between web pages session length page-stay Data e Web Mining
5
Association Mining with the order of transactions
Sequential Mining Sequential Mining Association Mining with the order of transactions Given a set of data sequences find all sequences with a user-specified minimum support items : itemset/element : : sequence : : is itemset sequence size : number of itemsets/elements sequence length : number of items subsequence : Data e Web Mining
6
Sequential Mining algorithms
Sort Phase Transforms customer transaction into custumer sequences LargeItemSet Phase Generates set of large itemset Transformation Phase Represents customer sequences based on large itemset Sequence Phase Derives large k-sequences based on large (k-1)-sequences Maximal Phase Prunes non maximal sequences APrioriAll GSP APrioriSome Data e Web Mining
7
Smart-SRA session Smart-SRA session Path
timestamp ordering (time oriented) rule (session) topology (navigation oriented) rule (path in the web site) maximality rule (path in the web site) Data e Web Mining
8
Sequencial AprioriAll FREQUENT ACCESS PATTERN
Smart Miner DATA STREAM Candidate Session SMART-SRA SESSION CONSTRUCTION Smart Session SEQUENCIAL MINING Sequencial AprioriAll FREQUENT ACCESS PATTERN Data e Web Mining
9
Smart Miner: First Phase Smart SRA
Candidate session construction time oriented heuristics session length page-stay no backward movement P1 P13 Page P1 P20 P13 P49 P34 P23 TimeStamp 6 9 12 14 15 P23 P20 Page P13 P20 P23 P49 TimeStamp 5 9 10 P34 P49 Web Site Graph Candidate Session Data e Web Mining
10
Smart Miner: Second Phase Smart SRA
Smart session construction time oriented heuristics inherithed session length re-check page-stay no backward movement maximality topology rule P1 P13 Page P1 P20 P13 P49 P34 P23 TimeStamp 6 9 12 14 15 P23 P20 [P1, P13, P34, P23] [P1, P13, P49, P23] [P1, P20, P23] P34 P49 Web Site Graph Smart Session Data e Web Mining
11
Smart Miner: Second Phase Smart
SMART SESSION RECONSTRUCTION foreach CanditateSession in CandSessionSet NewSessionSet={} while CanditateSession ≠Ø TSessionSet = {}; TPageSet = {}; foreach Pagei in CandSession StartPageFlag = TRUE foreach Pagej in CandidateSession with j<i if (Link[Pagej,Pagei] and TimeDiff(Pagei,Pagej)≤σ then StartPageFlag = FALSE endfor if StartPageFlag then TPageSet = TPageSet U {Pagei} CandSession = TPageSet U {Pagei} if NewSessionSet = {} then foreach Pagei in TPageSet TSessionSet = TSessionSet U {[Pagei]} else foreach Sessionj in NewSessionSet if (Link[Last(Sessionj),Pagei] and TimeDiff(Last(Sessionj),Pagei)≤σ) then TSession = Sessionj TSession.mark = UNEXTENDED TSession = TSession • Pagei TSessionSet = TSessionSet U {TSession} Sessionj.mark = EXTENDED endif foreach SessionJ in New SessionSet if SessionJ.mark ≠ EXTENDED then TSessionSet = TSessionSet U {SessionJ} end for NewSessionSet = TSessionSet end while page with no incoming link session set construction session set extension session set extension with no extended Data e Web Mining
12
Session Construction Example
Iteration CandidateSession TPageSet NewSessionSet 1 [ P1, P20, P13, P49, P34, P23 ] { P1 } [ P1 ] 2 [ P20, P13, P49, P34, P23 ] { P20, P13 } [ P1, P20 ] [ P1, P13] 3 [ P49, P34, P23 ] { P49, P34 } [ P1, P13, P34 ] [ P1, P13, P49 ] [ P1, P20 ] 4 [ P23 ] { P23 } [ P1, P13, P34, P23 ] [ P1, P13, P49, P23] [ P1, P20, P23 ] P1 P13 P23 P20 P34 P49 Data e Web Mining
13
Sequential APrioriAll
Pruning during candidate sequence generation before calculating their support topological constraint every subsequent pair of pages in a sequence the former one must have a hyperlink to the latter one string matching costraint session S supports a pattern P if and only if P is a subsequence of S not violating string matching <1,2,3> support <1,2> <1,2,3> not support <1,3> Data e Web Mining
14
Support Support one scan through the transaction database by keeping candidate session in hashmap I : pattern S : user reconstructed sessions Data e Web Mining
15
Sequential Apriori Algorithm
INPUT: minimum support frequency : δ reconstructed sessions : S topology information : Link set of all web pages : P OUTPUT: set of maximal frequent patterns : Max L1 = {} for i = 1 to |P| do L1 = L1 U [Pi] | if Support([Pi],S)> δ for k = 1 to N-1 do if Lk = Ø then Halt else Lk+1 = {} foreach Ii in Lk foreach Pj in P if Link[Last(Ii),Pj] then T = Ii • Pj // append page if Support(T,S)> δ then T.maximal = true Ii.maximal = false V = [T2,T3,…, T|T|] if V in Lk then V.maximal = false lk+1 = lk+1 U {T} endif endfor max = {} for k=1 to N-1 do max = max U {S|S in Lk and S.maximal = true } length-1 candidate pattern generation no further generation length-k+1 candidate pattern generation joining step pruning step topological rule support rule maximality rule union of the sets of maximal patterns Data e Web Mining
16
Accuracy Metric : frequent maximal pattern of the agent simulator
: frequent maximal pattern of the heuristic recall precision accuracy Data e Web Mining
17
Agent Simulator Agent Simulator Parameters STP :
Session Termination Probability probability of terminating session LPP : Link from Previous page Probability probability of referring next page from one of the previously accessed pages except the most recently accessed one LPC : Link from Current page Probability probability of referring next page from the most recently visited page NIP : New Initial page Probability probability of selecting one of the starting pages of a web site during the navigation Data e Web Mining
18
Simulated Data Web topology number of web pages from 10 to 1000
number users from 1000 to 10000 Agent simulator parameters 49 different cases NIP/STP , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0 LPC/LPP , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0 Support parameter Values , , , 0,0075 , 0.01 Runs of agent simulator 10 random different runs Data e Web Mining
19
Results on Simulated Data
NIP : New Initial Page Probability NIP : New Initial Page Probability STP : Session Termination Probability STP : Session Termination Probability NO : navigation oriented TO : time oriented SSRA : Smart SRA Data e Web Mining
20
Results on Simulated Data
NO : navigation oriented TO : time oriented SSRA : Smart SRA Data e Web Mining
21
Real Data AGMLAB’s company web site 4 months user activity 3801 users
30 minutes session time-out 10 web pages link graph densely connected User Activity action tracking program cookies cookie information recorded to a server log file Data e Web Mining
22
Results on Real Data NO : navigation oriented TO : time oriented SSRA
Smart SRA Data e Web Mining
23
Performance with 50 nodes
Scalability Performance on 100 GB Data Performance with 50 nodes MAP/REDUCE paradigm each node process a block of session database computing the local frequency of each candidate patterns Data e Web Mining
24
Sitologia/Bibliografia
M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New Framework for Mining Larga Scale Web Usage Data R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World Wide Web J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage Mining R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995 Data e Web Mining
25
GSP C1 = Init_Pass L1 = {<{f}>|f in C1, with minimum support}
GSP – GENERALIZED SEQUENTIAL PATTERN C1 = Init_Pass L1 = {<{f}>|f in C1, with minimum support} for (k=2; Lk-1≠Ø; k++) do begin Ck = Candidate-gen-SPM Lk-1 foreach sequence s in the database D do foreach candidate c in Ck if (c in s) then update candidate c Lk= candidated c in Ck with minimum support end result = Uk(Lk) CANDIDATE-GEN-SPM (join step) foreach p in Lk-1 foreach q in Lk-1 if ( ) then Ck = Ck U {p1,…,pk-1,qk-1 } foreach s in Ck if exists(r | ˄ ) then Ck = Ck - s (prune step) Data e Web Mining
26
GSP Example Candidate 4-sequences (join step) Candidate 4-sequences
(prune step) L3-sequences <{1,2},{4}> <{1,2},{4,5}> <{1,2},{4,5}> <{1,2},{5}> <{1,2},{4},{6}> <{1},{4,5}> <{1,4},{6}> <{2},{4,5}> <{2},{4},{6}> <{1},{4},{6}> Data e Web Mining
27
APrioriAll L1 = {large 1-sequences} for (k=2; Lk-1≠Ø; k++) do begin
Ck = Apriori-generate function Lk-1 foreach sequence c in the database D do update candidates in Ck that are contained in c Lk= candidated in Ck with minimum support end result = maximal sequences in Uk(Lk) APRIORI-GENERATE (join step) foreach p in Lk-1 foreach q in Lk-1 if (p.x1=q.x1) ˄ (p.x2=q.x2) ˄ … ˄ (p.xk-2=q.xk-2) then Ck = Ck U {<p.x1,…,p.xk-1,q.xk-1>} foreach s in Ck if exists(r | ˄ ) then Ck = Ck - s (prune step) Data e Web Mining
28
APrioriAll Example Candidate 4-sequences Candidate 4-sequences
(join step) Candidate 4-sequences (prune step) L3-sequences <1,2,3> <1,2,3,4> <1,2,3,4> <1,2,4> <1,2,4,3> <1,3,4> <1,3,4,5> <1,3,5> <1,3,5,4> <2,3,4> Data e Web Mining
29
APrioriSome //Forward Phase
L1 = {large 1-sequences}; C1 = L1 ; last = 1; for (k=2; Ck-1≠Ø; k++) do begin if (Lk-1 known) then Ck = Apriori-generate function Lk-1 else Ck = Apriori-generate function Ck-1 if (k=next(last)) then foreach sequence c in the database D do update candidates in Ck that are contained in c Lk= candidated in Ck with minimum support; last = k end //Backword Phase for (k--; k>=1; k--) do begin if (Lk not found) then delete all sequences in Ck contained in some Li, i>k Lk= candidated in Ck with minimum support else delete all sequences in Lk contained in some Li, i>k result = maximal sequences in Uk(Lk) Data e Web Mining
30
Sequential Mining Algorithm
Customer ID Transaction Time Items 1 June 25 ’93 June 25 ‘93 30 90 2 June 10 ’93 June 15 ’93 June 20 ‘93 10,20 40,60,60 3 30,50,70 4 June 30 ‘93 July 25 ‘93 40,70 5 June 12 ’93 Large itemset Mapped to Customer ID Customer Sequence (30) 1 (40) 2 (70) 3 (40 70) 4 (90) 5 1 <(30)(90)> 2 <(10 20) (30) ( )> 3 <(30) (50 (70))> 4 <(30) (40 70) (90)> 5 <(90)> Customer ID Customer Sequence 1 <{1} {5}> 2 <{1} {2, 3, 4}> 3 <{1, 3}> 4 <{1} {2, 3, 4} {5}> 5 <{5}> Data e Web Mining
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.