Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab
AI lab. OutlineOutline Introduction Background Related work Purposed System Experiment and Result Conclusion and Future work
AI lab. IntroductionIntroduction Web Log Mining Process Viewing news Web Site Visitor Logged data - IP -OS, Agent - Time - URL - Refer page - Date -Cookie - Method - Status - UserID - bytes - … DB Visualization tools Knowledge Query Intelligent Agents download shopping Auction Data Analysis Saved Web Log Data in Web Server Preprocessing Pattern Discovery Pattern Analysis My research area: Web log preprocessing
AI lab. Background ( 1/4 ) Log format : – Client IP – Date - 23/Jan/2005 – Accessed time - 13:37:12 – Method - GET (to request page ), POST, HEAD (send to server) – Protocol - HTTP/1.1 – Status code (Success), 401,301,500 (error) – Size of file – Agent type - Mozilla/4.0 – Operating system - Windows NT → → A visitor ( ) after to view the news who send it to friend [23/Jan/2005:13:37: ] “GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ … lines record
AI lab. Session Identification Background ( 2/4 ) - User identification, Session Identification Cleaning Log User Identification Path Completion Formatting User Identification is identifying each user accessing Web site User Identification is identifying each user accessing Web site User IP+Browser ( UserID+IP+OS or cookie )=> Identify the users User IP+Browser ( UserID+IP+OS or cookie )=> Identify the users Session identification is to find each user’s access pattern and frequency path. IP, Browser User Identification Mozilla/5.0(Windows NT) Mozilla/4.0 (Win2000) Mozilla/4.0(Windows NT) IPBrowser A,B,C,D,F,A,L A,B,G,L N,O Visited pages Session Identification Mozilla/4.0 (Win2000) A,B,C,D,F A,B,G,L N,O A,L Mozilla/5.0(Windows NT) Mozilla/5.0(Windows NT)
AI lab. Missed Page Views at Server Background ( 3/4 ) Server Log and Caching If client must request every web page from the server slower. The solution to this problem is caching. Clients and Proxy Servers save local copies of pages back” and “forward Client Cache Server Request P4 Send P4 P4 Request P3 Send 5 P3 Request P6 Send P4 P5 Never logged by server P3 … Request P3
AI lab. Cleaning Log User Identification Session Identification Path Completion Formatting Topological Structure Path completion A.html B.html G.html L.html C.html F.html N.html D.html E.html H.html I.html K.html O.html M.html P.html J.html Q.html A,B,C,D,FA,B,C,D,C,B,F A,L A,B,G,IA,B,A,G,I N,O Before.. After Background ( 4/4 ) - Path completion Not all requested pages are recorded in Web log. Due to caching problem.
AI lab. Related work Related works Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12]OOO Login, IP, Agent O 1996 [8] Olympics site XOXCookieX Yan, Jacobsen [5] Yan, Jacobsen [5]XOX IP, Agent X Pitkow [7] OOX Session ID O Shahabi [2]XOX Session ID O Chen, Park [3] XOX Login, IP X X – not used O – used
AI lab. Purposed System( 1/7 ) ( preprocessing ) Data cleaning (eliminate irrelevant info) Result Web site’s topological structure (find the hyperlink relation between web pages) between web pages) User Identification, session Identification, (identify each user, find each user’s access pattern) After session Identification and path completion User grouping User Identify Construct the site topological structure by web log data in server Why preprocessing? Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong conclusions. Path completion Path completion User Grouping
AI lab. Purposed System ( 2/7 ) Make the site topological structure Helps solving data preprocessing and analysis: - user identification - path completion Goal of purposed system Discover Similar user group, Relevant page group and Frequency accessing paths
AI lab. Purposed System ( 3/7 ) begin end Not end of Log file Enter URL to URL_Queue URL Queue Not empty Get head, define depth Find “http” data To add link to the Topo_Str_DB Is there other Record? No Yes Algorithm of Topological Structure Make Topological Structure
AI lab. Purposed System ( 4/7 )- Make the topological structure Topological Structure - input: URL path and link - output: complete sitemap (tree) link, path, depth and referrers queue 0. Index.html (A) 1. L.html (referrer) 2. Sport/Team/football.html 2. Sport/News/Mongolia.html 1. Sport.html 2. Sport/Team/ 3. Sport/Team/football.html 2. Sport/Advice/... Sport/Advice Index.html (A) Sport.html Sport/News/Mongolia.html L.html Sport/Team/ Sport/Team/football.html X Depth olloo.mn/L.html olloo.mn/L.html Sport/Team/football.html olloo.mn/L.html Sport/News/Mongolia.html olloo.mn/Sport.html olloo.mn/Sport.html /Team/ football.html olloo.mn/Sport.html /Advice/
AI lab. Flow chart of User Identification algorithm Begin Not end of log DB IF current IP’s Agent and OS same End Yes NoNo NoNo IP not in IPSet Yes No Save the IP, Agent and OS Is there other Records? NoNo Assign to the User Set, Increase User counter Yes Purposed System ( 5/7 ) - User Identification.. for similar user group
AI lab. Purposed System ( 6/7 )- Session identification Begin not end of log DB refer page empty? End Yes IP not in User Set? YesNo Start new Session Is there other Records? NoNo A page append to the session Yes time taken >25.5? go to path Completion No Yes Flow chart of Session Identification algorithm
AI lab. Purposed System ( 7/7 ) - Path completion Flow chart of Path completion algorithm Begin Not end of Session set End Yes A page in a Session contains next page in that session Yes No check to the next page NoNo Complete the path Search that page from site map
AI lab. Experiment ( 1/4 ) URLs in Web server logwww.olloo.mnwww.olloo.mn Raw log data
AI lab. Experiment (2/4) Topological Structure
AI lab. Experiment ( 3/4 ) Data cleaning
AI lab. Experiment ( 4/4 )
AI lab.ResultResult This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM. User group Path completion
AI lab. Interface of Path Completion Preprocessing System (PCPS) Start the new project.
AI lab. Interface of Path Completion Preprocessing System (PCPS) Giving the project name and folder
AI lab. Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult) Add the log file to project
AI lab. Interface of Path Completion Preprocessing System (PCPS) Choose the log file to add
AI lab. Interface of Path Completion Preprocessing System (PCPS) Asking to remove the image files (files) Should to analyze… (files) Should to clean …
AI lab. Interface of Path Completion Preprocessing System (PCPS) Cleaned log and information The pages and files that wanted to analyze
AI lab. Interface of Path Completion Preprocessing System (PCPS) Topological Structure
AI lab. Interface of Path Completion Preprocessing System (PCPS) Browser
AI lab. Interface of Path Completion Preprocessing System (PCPS) System
AI lab. Comparing other preprocessing approach to Purposed System Related works Creation of Topol. Structure Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12]XOOO Login, IP, Agent O 1996 [8] Olympics site XXOXCookieX Yan, Jacobsen [5] Yan, Jacobsen [5]XXOX IP, Agent X Pitkow [7] XOOX Session ID O Shahabi [2]XXOX Session ID O Chen, Park [3] XXOX Login, IP X Purposed System OOOO IP,Agent, Grouping O O- used, X – not used
AI lab. ConclusionConclusion Approach Identified number of access Identified number of Users Identified number of Session Not used path completion Purposed System My work focus on preprocessing of Web log mining and enhance the My work focus on preprocessing of Web log mining and enhance the discovering patterns. discovering patterns – 2812 = 249 users neglected – 2812 = 249 users neglected. This paper presented some new approach and practicable algorithm. This approach can be better precision than some existence approaches. This approach can be better precision than some existence approaches.
AI lab. ReferenceReference [1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and Pattern Discovery on the World Wide Web” 1998 [2] C. Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking,”2001 [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environment [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking [6]. J. Pitkow. In search of reliable usage data on the WWW [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web [8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site. [9]. Open Market Inc. Open Market Web reporter. [10]. net.Genesis. net.analysis desktop [11]. Doru Tanasa, Brigitte Trousse “Advanced data preprocessing for intersites Web Usage Mining “2004 [12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.