Download presentation
Presentation is loading. Please wait.
Published bySilvia Sharp Modified over 9 years ago
1
Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19
2
AI lab. OutlineOutline Introduction Background Related work Purposed System Experiment and Result Conclusion and Future work
3
AI lab. IntroductionIntroduction Web Log Mining Process Viewing news Web Site Visitor Logged data - IP -OS, Agent - Time - URL - Refer page - Date -Cookie - Method - Status - UserID - bytes - … DB Visualization tools Knowledge Query Intelligent Agents E-Mail download shopping Auction Data Analysis Saved Web Log Data in Web Server Preprocessing Pattern Discovery Pattern Analysis My research area: Web log preprocessing
4
AI lab. Background ( 1/4 ) Log format : – Client IP - 210.126.19.93 – Date - 23/Jan/2005 – Accessed time - 13:37:12 – Method - GET (to request page ), POST, HEAD (send to server) – Protocol - HTTP/1.1 – Status code - 200 (Success), 401,301,500 (error) – Size of file - 2705 – Agent type - Mozilla/4.0 – Operating system - Windows NT http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225 → http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225 → http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=8225 http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=8225 A visitor (210.126.19.93) after to view the news who send it to friend. 210.126.19.93 - - [23/Jan/2005:13:37:12 -0800] “GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" 200 2705 "http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ … 285014 lines record
5
AI lab. Session Identification Background ( 2/4 ) - User identification, Session Identification Cleaning Log User Identification Path Completion Formatting User Identification is identifying each user accessing Web site User Identification is identifying each user accessing Web site User IP+Browser ( UserID+IP+OS or cookie )=> Identify the users User IP+Browser ( UserID+IP+OS or cookie )=> Identify the users Session identification is to find each user’s access pattern and frequency path. IP, Browser User Identification 202.131.3.100 Mozilla/5.0(Windows NT) 202.131.3.100 Mozilla/4.0 (Win2000) 210.126.19.93 Mozilla/4.0(Windows NT) IPBrowser A,B,C,D,F,A,L A,B,G,L N,O Visited pages Session Identification 202.131.3.100 Mozilla/4.0 (Win2000) A,B,C,D,F A,B,G,L N,O 202.131.3.100 A,L Mozilla/5.0(Windows NT) 202.131.3.100 Mozilla/5.0(Windows NT) 210.126.19.93
6
AI lab. Missed Page Views at Server Background ( 3/4 ) Server Log and Caching If client must request every web page from the server slower. The solution to this problem is caching. Clients and Proxy Servers save local copies of pages back” and “forward Client Cache Server Request P4 Send P4 P4 Request P3 Send 5 P3 Request P6 Send P4 P5 Never logged by server P3 … Request P3
7
AI lab. Cleaning Log User Identification Session Identification Path Completion Formatting Topological Structure Path completion A.html B.html G.html L.html C.html F.html N.html D.html E.html H.html I.html K.html O.html M.html P.html J.html Q.html A,B,C,D,FA,B,C,D,C,B,F A,L A,B,G,IA,B,A,G,I N,O Before.. After Background ( 4/4 ) - Path completion Not all requested pages are recorded in Web log. Due to caching problem.
8
AI lab. Related work Related works Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12]OOO Login, IP, Agent O 1996 [8] Olympics site XOXCookieX Yan, Jacobsen [5] Yan, Jacobsen [5]XOX IP, Agent X Pitkow [7] OOX Session ID O Shahabi [2]XOX Session ID O Chen, Park [3] XOX Login, IP X X – not used O – used
9
AI lab. Purposed System( 1/7 ) ( preprocessing ) Data cleaning (eliminate irrelevant info) Result Web site’s topological structure (find the hyperlink relation between web pages) between web pages) User Identification, session Identification, (identify each user, find each user’s access pattern) After session Identification and path completion User grouping User Identify Construct the site topological structure by web log data in server Why preprocessing? Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong conclusions. Path completion Path completion User Grouping
10
AI lab. Purposed System ( 2/7 ) Make the site topological structure Helps solving data preprocessing and analysis: - user identification - path completion Goal of purposed system Discover Similar user group, Relevant page group and Frequency accessing paths
11
AI lab. Purposed System ( 3/7 ) begin end Not end of Log file Enter URL to URL_Queue URL Queue Not empty Get head, define depth Find “http” data To add link to the Topo_Str_DB Is there other Record? No Yes Algorithm of Topological Structure Make Topological Structure
12
AI lab. Purposed System ( 4/7 )- Make the topological structure Topological Structure - input: URL path and link - output: complete sitemap (tree) link, path, depth and referrers queue 0. Index.html (A) 1. L.html (referrer) 2. Sport/Team/football.html 2. Sport/News/Mongolia.html 1. Sport.html 2. Sport/Team/ 3. Sport/Team/football.html 2. Sport/Advice/... Sport/Advice Index.html (A) Sport.html Sport/News/Mongolia.html L.html Sport/Team/ Sport/Team/football.html X 0 1 2 3 Depth olloo.mn/L.html olloo.mn/L.html Sport/Team/football.html olloo.mn/L.html Sport/News/Mongolia.html olloo.mn/Sport.html olloo.mn/Sport.html /Team/ football.html olloo.mn/Sport.html /Advice/
13
AI lab. Flow chart of User Identification algorithm Begin Not end of log DB IF current IP’s Agent and OS same End Yes NoNo NoNo IP not in IPSet Yes No Save the IP, Agent and OS Is there other Records? NoNo Assign to the User Set, Increase User counter Yes Purposed System ( 5/7 ) - User Identification.. for similar user group
14
AI lab. Purposed System ( 6/7 )- Session identification Begin not end of log DB refer page empty? End Yes IP not in User Set? YesNo Start new Session Is there other Records? NoNo A page append to the session Yes time taken >25.5? go to path Completion No Yes Flow chart of Session Identification algorithm
15
AI lab. Purposed System ( 7/7 ) - Path completion Flow chart of Path completion algorithm Begin Not end of Session set End Yes A page in a Session contains next page in that session Yes No check to the next page NoNo Complete the path Search that page from site map
16
AI lab. Experiment ( 1/4 ) URLs in Web server logwww.olloo.mnwww.olloo.mn Raw log data
17
AI lab. Experiment (2/4) Topological Structure
18
AI lab. Experiment ( 3/4 ) Data cleaning
19
AI lab. Experiment ( 4/4 )
20
AI lab.ResultResult This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM. User group Path completion
21
AI lab. Interface of Path Completion Preprocessing System (PCPS) Start the new project.
22
AI lab. Interface of Path Completion Preprocessing System (PCPS) Giving the project name and folder
23
AI lab. Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult) Add the log file to project
24
AI lab. Interface of Path Completion Preprocessing System (PCPS) Choose the log file to add
25
AI lab. Interface of Path Completion Preprocessing System (PCPS) Asking to remove the image files (files) Should to analyze… (files) Should to clean …
26
AI lab. Interface of Path Completion Preprocessing System (PCPS) Cleaned log and information The pages and files that wanted to analyze
27
AI lab. Interface of Path Completion Preprocessing System (PCPS) Topological Structure
28
AI lab. Interface of Path Completion Preprocessing System (PCPS) Browser
29
AI lab. Interface of Path Completion Preprocessing System (PCPS) System
30
AI lab. Comparing other preprocessing approach to Purposed System Related works Creation of Topol. Structure Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12]XOOO Login, IP, Agent O 1996 [8] Olympics site XXOXCookieX Yan, Jacobsen [5] Yan, Jacobsen [5]XXOX IP, Agent X Pitkow [7] XOOX Session ID O Shahabi [2]XXOX Session ID O Chen, Park [3] XXOX Login, IP X Purposed System OOOO IP,Agent, Grouping O O- used, X – not used
31
AI lab. ConclusionConclusion Approach Identified number of access Identified number of Users Identified number of Session Not used path completion 18019281210407 Purposed System 18019306111019 My work focus on preprocessing of Web log mining and enhance the My work focus on preprocessing of Web log mining and enhance the discovering patterns. discovering patterns. 3061 – 2812 = 249 users neglected. 3061 – 2812 = 249 users neglected. This paper presented some new approach and practicable algorithm. This approach can be better precision than some existence approaches. This approach can be better precision than some existence approaches.
32
AI lab. ReferenceReference [1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and Pattern Discovery on the World Wide Web” 1998 [2] C. Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking,”2001 [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environment. 1996 [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence. 1996 [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking. 1996. [6]. J. Pitkow. In search of reliable usage data on the WWW. 1997. [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. 1996 [8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site. [9]. Open Market Inc. Open Market Web reporter. http://www.openmarket.com,1996. [10]. net.Genesis. net.analysis desktop http://www.netgen.com,1996 [11]. Doru Tanasa, Brigitte Trousse “Advanced data preprocessing for intersites Web Usage Mining “2004 [12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.