Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19.

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

Data e Web Mining Paolo Gobbo
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Progress Report 11/1/01 Matt Bridges. Overview Data collection and analysis tool for web site traffic Lets website administrators know who is on their.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
12/11/01 Matt Bridges Advisor: Ralph Morelli. What is Web Analytics? In traditional commerce, store owners can observe their customers habits: What time.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
Web Mining: An Overview Of Web Analytics with Examples Donghui Wu, Ph.D. Oracle Corporation April 16 th 2003.
The Internet & Web Browsers Business Webpage Design Kelly Seale.
Evaluating Web Server Log Analysis Tools David Strom SD’98 2/13/98.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
Spying and security on the Internet Some tricks to know.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
With Internet Explorer 8© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 Go! with Internet Explorer 8 Getting Started.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
Internet Information ISYS 105B. What is the Internet? Comprised of network of computers Started in 1969 by U.S. Defense Dept.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Web Site Performance An analytical approach for benchmarking and tuning.
Exploring Web Page Design. What is a Web Page?  A web page is a multimedia file which can be stored on a web server.  It can include text, graphics,
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
TCP/IP Protocols Dr. Sharon Hall Perkins Applications World Wide Web(HTTP) Presented by.
Introduction to Course MMIS 656 Web Design Technologies.
1 A Static Analysis Approach for Automatically Generating Test Cases for Web Applications Presented by: Beverly Leung Fahim Rahman.
1 Lies, damn lies and Web statistics A brief introduction to using and abusing web statistics Paul Smith, ILRT July 2006.
©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Second Line Intrusion Detection Using Personalization DISA Sponsored GWU-CS.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
Web- and Multimedia-based Information Systems Lecture 2.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
 Shopping Basket  Stages to maintain shopping basket in framework  Viewing Shopping Basket.
Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.
Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007.
Web Measurement. The Web is Different from other Commuication Media More precise measurement of activity on Web sites is available More precise measurement.
Users are moving towards web applications Content on the web is more personal & meaningful Development on the web is easier than the OS.
JavaScript and Ajax (Internet Background) Week 1 Web site:
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
1 Chapter 22 World Wide Web (HTTP) Chapter 22 World Wide Web (HTTP) Mi-Jung Choi Dept. of Computer Science and Engineering
Google Analytics Graham Triggs Head of Repository Systems, Symplectic.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
The Internet & Web Browsers Business Webpage Design Created by Kelly Seale Adapted by Jill Einerson.
© Janice Regan, CMPT 128, Jan 2007 CMPT 371 Data Communications and Networking HTTP 0.
Zaap Visualization of web traffic from http server logs.
Chapter 1 Introduction to HTML.
CNIT 131 Internet Basics & Beginning HTML
A Brief Introduction to the Internet
Chapter 12: Automated data collection methods
SpeedTracer: A Web usage mining and analysis tool
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Presentation transcript:

Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab

AI lab. OutlineOutline  Introduction  Background  Related work  Purposed System  Experiment and Result  Conclusion and Future work

AI lab. IntroductionIntroduction Web Log Mining Process Viewing news Web Site Visitor Logged data - IP -OS, Agent - Time - URL - Refer page - Date -Cookie - Method - Status - UserID - bytes - … DB Visualization tools Knowledge Query Intelligent Agents download shopping Auction Data Analysis Saved Web Log Data in Web Server Preprocessing Pattern Discovery Pattern Analysis My research area: Web log preprocessing

AI lab. Background ( 1/4 )  Log format : – Client IP – Date - 23/Jan/2005 – Accessed time - 13:37:12 – Method - GET (to request page ), POST, HEAD (send to server) – Protocol - HTTP/1.1 – Status code (Success), 401,301,500 (error) – Size of file – Agent type - Mozilla/4.0 – Operating system - Windows NT → → A visitor ( ) after to view the news who send it to friend [23/Jan/2005:13:37: ] “GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ … lines record

AI lab. Session Identification Background ( 2/4 ) - User identification, Session Identification Cleaning Log User Identification Path Completion Formatting User Identification is identifying each user accessing Web site User Identification is identifying each user accessing Web site User IP+Browser ( UserID+IP+OS or cookie )=> Identify the users User IP+Browser ( UserID+IP+OS or cookie )=> Identify the users Session identification is to find each user’s access pattern and frequency path. IP, Browser User Identification Mozilla/5.0(Windows NT) Mozilla/4.0 (Win2000) Mozilla/4.0(Windows NT) IPBrowser A,B,C,D,F,A,L A,B,G,L N,O Visited pages Session Identification Mozilla/4.0 (Win2000) A,B,C,D,F A,B,G,L N,O A,L Mozilla/5.0(Windows NT) Mozilla/5.0(Windows NT)

AI lab. Missed Page Views at Server Background ( 3/4 ) Server Log and Caching If client must request every web page from the server  slower. The solution to this problem is caching. Clients and Proxy Servers save local copies of pages  back” and “forward Client Cache Server Request P4 Send P4 P4 Request P3 Send 5 P3 Request P6 Send P4 P5 Never logged by server P3 … Request P3

AI lab. Cleaning Log User Identification Session Identification Path Completion Formatting Topological Structure Path completion A.html B.html G.html L.html C.html F.html N.html D.html E.html H.html I.html K.html O.html M.html P.html J.html Q.html A,B,C,D,FA,B,C,D,C,B,F A,L A,B,G,IA,B,A,G,I N,O Before.. After Background ( 4/4 ) - Path completion Not all requested pages are recorded in Web log. Due to caching problem.

AI lab. Related work Related works Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12]OOO Login, IP, Agent O 1996 [8] Olympics site XOXCookieX Yan, Jacobsen [5] Yan, Jacobsen [5]XOX IP, Agent X Pitkow [7] OOX Session ID O Shahabi [2]XOX Session ID O Chen, Park [3] XOX Login, IP X X – not used O – used

AI lab. Purposed System( 1/7 ) ( preprocessing ) Data cleaning (eliminate irrelevant info)     Result Web site’s topological structure (find the hyperlink relation between web pages) between web pages) User Identification, session Identification, (identify each user, find each user’s access pattern) After session Identification and path completion  User grouping  User Identify Construct the site topological structure by web log data in server Why preprocessing? Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong conclusions. Path completion Path completion  User Grouping 

AI lab. Purposed System ( 2/7 ) Make the site topological structure  Helps solving data preprocessing and analysis: - user identification - path completion Goal of purposed system  Discover Similar user group, Relevant page group and Frequency accessing paths

AI lab. Purposed System ( 3/7 ) begin end Not end of Log file Enter URL to URL_Queue URL Queue Not empty Get head, define depth Find “http” data To add link to the Topo_Str_DB Is there other Record? No Yes Algorithm of Topological Structure Make Topological Structure

AI lab. Purposed System ( 4/7 )- Make the topological structure  Topological Structure - input: URL  path and link - output: complete sitemap (tree) link, path, depth and referrers  queue 0. Index.html (A) 1. L.html (referrer) 2. Sport/Team/football.html 2. Sport/News/Mongolia.html 1. Sport.html 2. Sport/Team/ 3. Sport/Team/football.html 2. Sport/Advice/... Sport/Advice Index.html (A) Sport.html Sport/News/Mongolia.html L.html Sport/Team/ Sport/Team/football.html X Depth olloo.mn/L.html olloo.mn/L.html  Sport/Team/football.html olloo.mn/L.html  Sport/News/Mongolia.html olloo.mn/Sport.html olloo.mn/Sport.html  /Team/  football.html olloo.mn/Sport.html  /Advice/

AI lab.  Flow chart of User Identification algorithm Begin Not end of log DB IF current IP’s Agent and OS same End Yes NoNo NoNo IP not in IPSet Yes No Save the IP, Agent and OS Is there other Records? NoNo Assign to the User Set, Increase User counter Yes Purposed System ( 5/7 ) - User Identification.. for similar user group

AI lab. Purposed System ( 6/7 )- Session identification Begin not end of log DB refer page empty? End Yes IP not in User Set? YesNo Start new Session Is there other Records? NoNo A page append to the session Yes time taken >25.5? go to path Completion No Yes  Flow chart of Session Identification algorithm

AI lab. Purposed System ( 7/7 ) - Path completion  Flow chart of Path completion algorithm Begin Not end of Session set End Yes A page in a Session contains next page in that session Yes No check to the next page NoNo Complete the path Search that page from site map

AI lab. Experiment ( 1/4 )  URLs in Web server logwww.olloo.mnwww.olloo.mn Raw log data

AI lab. Experiment (2/4) Topological Structure

AI lab. Experiment ( 3/4 ) Data cleaning

AI lab. Experiment ( 4/4 )

AI lab.ResultResult This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM. User group Path completion

AI lab. Interface of Path Completion Preprocessing System (PCPS)  Start the new project.

AI lab. Interface of Path Completion Preprocessing System (PCPS)  Giving the project name and folder

AI lab. Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult)  Add the log file to project

AI lab. Interface of Path Completion Preprocessing System (PCPS)  Choose the log file to add

AI lab. Interface of Path Completion Preprocessing System (PCPS)  Asking to remove the image files (files) Should to analyze… (files) Should to clean …

AI lab. Interface of Path Completion Preprocessing System (PCPS)  Cleaned log and information The pages and files that wanted to analyze

AI lab. Interface of Path Completion Preprocessing System (PCPS)  Topological Structure

AI lab. Interface of Path Completion Preprocessing System (PCPS) Browser

AI lab. Interface of Path Completion Preprocessing System (PCPS)  System

AI lab. Comparing other preprocessing approach to Purposed System Related works Creation of Topol. Structure Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12]XOOO Login, IP, Agent O 1996 [8] Olympics site XXOXCookieX Yan, Jacobsen [5] Yan, Jacobsen [5]XXOX IP, Agent X Pitkow [7] XOOX Session ID O Shahabi [2]XXOX Session ID O Chen, Park [3] XXOX Login, IP X Purposed System OOOO IP,Agent, Grouping O O- used, X – not used

AI lab. ConclusionConclusion Approach Identified number of access Identified number of Users Identified number of Session Not used path completion Purposed System My work focus on preprocessing of Web log mining and enhance the My work focus on preprocessing of Web log mining and enhance the discovering patterns. discovering patterns – 2812 = 249 users neglected – 2812 = 249 users neglected. This paper presented some new approach and practicable algorithm. This approach can be better precision than some existence approaches. This approach can be better precision than some existence approaches.

AI lab. ReferenceReference [1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and Pattern Discovery on the World Wide Web” 1998 [2] C. Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking,”2001 [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environment [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking [6]. J. Pitkow. In search of reliable usage data on the WWW [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web [8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site. [9]. Open Market Inc. Open Market Web reporter. [10]. net.Genesis. net.analysis desktop [11]. Doru Tanasa, Brigitte Trousse “Advanced data preprocessing for intersites Web Usage Mining “2004 [12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.