CS 401 Paper Presentation Praveen Inuganti

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Data e Web Mining Paolo Gobbo
Interception of User’s Interests on the Web Michal Barla Supervisor: prof. Mária Bieliková.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Discovery of Aggregate Usage Profiles for Web Personalization
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
The Internet & The World Wide Web Notes
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Web mining Web mining deals with mining of patterns from web and e-commerce data. Web data –Web pages –Web structures –Web logs –E-commerce sites – .
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
Lecturer: Ghadah Aldehim
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Chapter 8 The Internet: A Resource for All of Us.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.
Web Page Design I Basic Computer Terms “How the Internet & the World Wide Web (www) Works”
TCP/IP Protocols Dr. Sharon Hall Perkins Applications World Wide Web(HTTP) Presented by.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
STATE MANAGEMENT.  Web Applications are based on stateless HTTP protocol which does not retain any information about user requests  The concept of state.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Web Site Statistics A Metric for Measuring Engagement.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
The Intranet.
CMPS 435 F08 These slides are designed to accompany Web Engineering: A Practitioner’s Approach (McGraw-Hill 2008) by Roger Pressman and David Lowe, copyright.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Chapter 29 World Wide Web & Browsing World Wide Web (WWW) is a distributed hypermedia (hypertext & graphics) on-line repository of information that users.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Web Browsing *TAKE NOTES*. Millions of people browse the Web every day for research, shopping, job duties and entertainment. Installing a web browser.
Secondary Evidence for User Satisfaction With Community Information Systems Gregory B. Newby University of North Carolina at Chapel Hill ASIS Midyear Meeting.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
Science data sharing user behavior mining: an approach combining Web Usage Mining and GIS Mo Wang, Juanle Wang, Yongqing Bai Institute of Geographic Sciences.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Web Mining (Web Usage Mining). Web Mining – The Idea In recent years the growth of the World Wide Web exceeded all expectations. Today there are several.
Introduction to Information Systems SSD1: Introduction to Information Systems Unit 1. The World Wide Web Unit 2. Introduction to Java and Object- Oriented.
Our Topic: Web Usage Mining Presented by: Wenzhen Xing & Kun Gao With Guide of: Dr. Bettina Berendt For seminar: Web Mining.
Data mining in web applications
4.01 How Web Pages Work.
4.01 How Web Pages Work.
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Introducing the World Wide Web
Chapter 12: Automated data collection methods
SpeedTracer: A Web usage mining and analysis tool
4.01 How Web Pages Work.
Presentation transcript:

CS 401 Paper Presentation Praveen Inuganti Data Preparation for Mining World Wide Web Browsing Patterns Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava CS 401 Paper Presentation Praveen Inuganti

Overview Introduction Architecture of WEBMINER system Browsing behavior models Preprocessing Advantages and disadvantages Conclusion Data cleaning User identification Session identification Path completion

Introduction The WWW continues to grow at an astounding rate resulting in increase of complexity of tasks such as web site design, web server design and of simply navigating through a web site An important input to these design tasks is analysis of how a web site is used. Usage information can be used to restructure a web site in order to better serve the needs of users of a site Web usage mining is the application of data mining techniques to large web data repositories in order to produce results that can be used in these design tasks. Some of the data mining algorithms that are commonly used in web usage mining are: i) Association rule generation: Association rule mining techniques discover unordered correlations between items found in a database of transactions. e.g. 45% of the visitors who accessed the CS home page also accessed Sanjay Madria’s home page ii) Sequential Pattern generation: This is concerned with finding intertransaction patterns such that the presence of a set of items is followed by another item in the time-stamp ordered transaction set.

Introduction e.g. 25% of the site visitors accessed the sports main page followed by the news main page iii) Clustering: Clustering analysis allows one to group together users or data items that have similar characterstics The input for the web usage mining process is a file, referred to as a user session file, that gives an exact accounting of who accessed the web site, what pages were requested and in what order, and how long each page was viewed Web server log does not reliably represent a user session file. Hence, several preprocessing tasks must be performed prior to applying data mining algorithms to the data collected from server logs.

Architecture of WEBMINER System

Browsing Behaviour Models In some respects, web usage mining is the process of reconciling the web site developer’s view of how the site should be used with the way the users are actually browsing the site Therefore the two inputs that are required for the web usage mining process are an encoding of the site developer’s view of browsing behavior and an encoding of the actual browsing behaviors i)Developer’s model: The web site developer’s view of how the site should be used is inherent in the structure of the site * each link between pages exists because the developer believes that the pages are related in some way * the content of the pages themselves provide information about how the developer expects the site to be used Hence, an integral step of preprocessing phase is the classifying of the site pages and extracting the site topology from the HTML files that make up the web site 66

Browsing Behavior Models The WEBMINER system recognizes five main types of pages:- -Head Page: a page whose purpose is to be the first page the users visit web site is providing -Content Page: a page that contains a portion of the information content that the -Navigation Page: a page whose purpose is to provide links to guide users on to content pages -Look-up Page: a page used to provide a definition or acronym expansion -Personal Page: a page used to present information of biographical nature Each of these types of pages is expected to exhibit certain physical characteristics ii) Users’ Model: Analogous to each of the common physical characterstics of the different page types, there is expected to be common usage characterstics among different users

Browsing Behavior Models For the purposes of association rule discovery, it is really the content page references that are of interest. The other pages are just to facilitate the browsing of a user while searching for information, and are referred to as auxiliary pages Transactions can be defined in two ways using the concept of auxiliary and content page references. Auxiliary content transaction consists of all the auxiliary references up to and including each content reference for a given user . Mining these would give the common traversal paths through the website to a given content page Content only transaction consists of all the content references for a given user. Mining these would give association between the content pages of a site, without any information as to the path taken between uses.

Preprocessing Two of the biggest impediments to collecting reliable usage data are local caching and proxy servers Hence to input reliable and more accurate data to the mining algorithms the following preprocessing tasks are to be done on the web server log data:- In order to improve performance and minimize network traffic, most web browsers cache the pages that have been requested. As a result, when a user hits a ‘back’ button, the cached page is displayed and the web server is not aware of the repeat page access Proxy servers provide an intermediate level of caching and create even more problems with identifying site usage. In a web server log, all requests potentially represent more than one user. Also due to proxy server level caching, a single request from the server could actually be viewed by multiple users through an extended period of time Data cleaning User identification Session identification Path completion

Data Cleaning Techniques to clean a server log to eliminate irrelevant items are of importance for any type of web log analysis The discovered associations or reported statistics are only useful if data represented in the server log gives an accurate picture of the user access to the web site Problem: The HTTP protocol requires a separate connection for every file that is requested from the web server. Therefore, a user’s request to view a particular page often results in several log entries since graphics and scripts are downloaded in addition to the HTML file. In most cases, only the the log entry of the HTML file request is relevant and should be kept for the user session file Solution: Elimination of items deemed irrelevant can be reasonably accomplished by checking the suffix of URL name. All log entries with filename suffixes such as gif, jpeg,GIF,JPEG,JPG,jpg and map can be removed. However, the list can be modified depending on the site being analyzed

User Identification User identification is the process of associating page references, even with those with same IP addresses, with different users. Problem:This task is greatly complicated by the existence of local caches, corporate firewalls and proxy servers. Solution: Even if the IP address is same, if the agent shows a change in browser software or operating system, a reasonable assumption to make is that each different agent type for an IP address represents a different user Solution: If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user,it implies there is another user with the same IP address For the sample log, three unique users are identified with browsing paths of A-B-F-O-G-A-D, A-B-C-J, and L-R, respectively

Session Identification & Path Completion Session identification takes all of the page references for a given user in a log and breaks them up into user sessions Problem: For logs that span long periods of time, it is very likely that users will visit the website more than once. The goal of session identification is to divide the page accesses of each user into individual sessions Solution: The simplest method of achieving this is through a timeout, where if the time between page requests exceeds a certain limit (a time out of 25.5 minutes was established based on empirical data), it is assumed that the user is starting a new session Path completion fills in page references that are missing due to browser and proxy server caching Problem: To identify important accesses that are not recorded in the access log Solution: If a page request is made that is not directly linked to the last page a user requested , the referrer log can be checked to see what page the request came from. If the page is in the user’s recent history, the assumption is that the user backtracked with the ‘back’ button, calling up cached versions of the pages until a new page was requested. If the referrer log was not clear the site topology can be used to the same effect

Advantages The preprocessing tasks described in this paper have several advantages over current methods of collecting information like the use of cookies and cache busting Cache busting is the practice of preventing browsers from using stored local versions of a page from the server every time it is viewed Cache busting defeats the speed advantage that caching was created to provide Cookies can be deleted or disabled by the user These problems can be overcome by applying the mentioned preprocessing tasks to the web server log

Disadvantages Two users with the same IP address that use the same browser on the same type of machine can easily be confused as a single user if they are looking at same set of pages A single user with two different browsers running, or who types in URL’s directly without using a sites link structure can be mistaken for multiple users While computing missing page references, we can be misled, by the fact that the user might have known the URL for a page and typed it in directly. (it is assumed that this does not occur often enough to affect mining algorithms)

Conclusion This paper presents several data preparation techniques that can be used in order to convert raw web server logs into user session files in order to perform web usage mining The specific contributions include :- i) development of models to encode both the web site developer’s and users’ view of how a web site should be used ii) discussion of heuristics, that can be used to identify web site users, user sessions and page accesses that are missing from a web server log According to the authors, future work includes tests to verify the browsing behavior model discussed