WIRED - Web Analytics Week WIRED System Evaluations due now Web Logs overview Web Analytics - Understanding Queries - Tracking Users Web Log Reliability.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Collecting, Analyzing and Using Visitor Data Chapter 12.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Project 1 Introduction to HTML.
Chapter 12: Web Usage Mining - An introduction
Web Mining Research: A Survey
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Introduction to eValid Presentation Outline What is eValid? About eValid, Inc. eValid Features System Architecture eValid Functional Design Script Log.
1st Project Introduction to HTML.
Business Intelligence
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. 1.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
INTRODUCTION TO WEB DATABASE PROGRAMMING
Computer Concepts 2014 Chapter 7 The Web and .
1 Web Server Concepts Dr. Awad Khalil Computer Science Department AUC.
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. Revised 1/12/2015 by William Pegram 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Information Architecture & Design Construction of IA and Web Rosenfeld Chapters Other Readings Presentations.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
 2001 Prentice Hall, Inc. All rights reserved. 1 Chapter 21 - Web Servers (IIS, PWS and Apache) Outline 21.1 Introduction 21.2 HTTP Request Types 21.3.
Web Usage Patterns Ryan McFadden IST 497E December 5, 2002.
©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.
2440: 141 Web Site Administration Web Server Monitoring and Analysis Instructor: Enoch E. Damson.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Data Mining By Dave Maung.
1 Welcome to CSC 301 Web Programming Charles Frank.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Web Measurement. The Web is Different from other Commuication Media More precise measurement of activity on Web sites is available More precise measurement.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Chapter 22 World Wide Web (HTTP) Chapter 22 World Wide Web (HTTP) Mi-Jung Choi Dept. of Computer Science and Engineering
Web Servers & Log Analysis What can we learn from looking at Web server logs? - What server resources were requested - When the files were requested -
Chapter 8: Web Analytics, Web Mining, and Social Analytics
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
National College of Science & Information Technology.
Data mining in web applications
What is Google Analytics?
Distributed Control and Measurement via the Internet
Chapter 1 Introduction to HTML.
Web Development Web Servers.
DATA MINING © Prentice Hall.
The Internet.
Project 1 Introduction to HTML.
Web Mining Ref:
Web Engineering.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Chapter 12: Automated data collection methods
Chapter 27 WWW and HTTP.
Data Warehousing Data Mining Privacy
Presentation transcript:

WIRED - Web Analytics Week WIRED System Evaluations due now Web Logs overview Web Analytics - Understanding Queries - Tracking Users Web Log Reliability Web Log Data Mining & KDD

Web Analytics Evaluation of Web Information Retrieval (& Web Information Seeking) What can we learn? - IR systems use - Web server administration Who are the users? - Types of users - User situations How does it affect or help IR?

Web Server Overview Any application that can serve files using the HTTP protocol - Text, HTML, XHTML, XML… - Graphics - CGI, applets, serlets - other media & MIME types Apache or MS IIS that serve primarily Web pages Servers create ASCII text log files showing: - Date, time, bytes transferred, (cache status) - Status/error codes, user IP address, (domain name) - Server method, URI, misc comments

Web Log Overview Access Log - Logs information such as page served or time served Referer Log - Logs name of the server and page that links to current served page - Not always - Can be from any Web site Agent Log - Logs browser type and operating system Mozilla Windows

What can we learn from Web logs? Every time a Web browser requests a file, it gets logged - Where the user came from - What kind of browser used to access the server - Referring URL Every time a page gets served, it gets logged - Request time, serve time, bytes transferred, URI, status code

Web Log Analysis in Action UT Web log reports (Figures in parentheses refer to the 7 days to 28-Mar :00). Successful requests: 39,826,634 (39,596,364) Average successful requests per day: 5,690,083 (5,656,623) Successful requests for pages: 4,189,081 (4,154,717) Average successful requests for pages per day: 598,499 (593,530) Failed requests: 442,129 (439,467) Redirected requests: 1,101,849 (1,093,606) Distinct files requested: 479,022 (473,341) Corrupt logfile lines: 427 Data transferred: Gbytes ( Gbytes) Average data transferred per day: Gbytes ( Gbytes)

Problems with Web Servers Actual user or intent not known Paths difficult to determine Infrequent access challenging to uncover No State Information Server Hits not Representative - Counters inaccurate DOS, Floods, Bandwidth can Stop “intended” usage Robots, etc. ISP Proxy servers “5.3 Unsound inferences from data that is logged” Haigh & Megarity, 1998.

Web Server Configuration Unique file & directory names = “at a glance analysis” Hierarchical directory structure Redirect CGI to find referrer Use a database - store web content - record usage data with context of content logged Create state information with programming - Servlets, ActiveX, Javascript - Custom server or log format Log rollover, report frequency, special case testing

Log File Format Extended Log File Format - W3C Working Draft WD-logfile W3C Working Draft WD-logfile [24/Jul/1998:00:00: ] "GET /10/3/a3-160-e.html HTTP/1.0" " bnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i= " "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" Every server generates slightly different logs - Versions & operating system issues - Admin tweaks to log formats Extended Log Format most common - WWW Consortium Standards (= apache)

Let’s Look at some logs monthly.htmlhttp:// monthly.html weekly.htmlhttp:// weekly.html

Log Analysis Tools Analog Webalizer Sawmill WebTrends AWStats WWWStat GetStats Perl Scripts Data Mining & Business Intelligence tools

WebTrends A whole industry of analytics Most popular commercial application

Measuring Web Site Usage Now that the Web is a primary source, understanding its use is critical Little external cues that the Web site is being used What - pages and their content/subject How - browsers Who - userid or IP When - trends, daily, weekly, yearly Where - the user is and what page they came from

What you can’t measure? Who the user is - Always - If the user’s needs have changed If they’re using the information - Browsing vs. Reading vs. Acting on the information Changes to site and how they affect each user Pages not used at all - and why

Analysis of a Very Large Search Log What kinds of patterns can we find? Request = query and results page 280 GB – Six Weeks of Web Queries - Almost 1 Billion Search Requests, 850K valid, 575K queries Million User Sessions (cookie issues) - Large volume, less trendy - Why are unique queries important? Web Users: - Use Short Queries in short sessions % one request - Mostly Look at the First Ten Results only - Seldom Modify Queries Traditional IR Isn’t Accurately Describing Web Search Phrase Searching Could Be Augmented Silverstein, Henzinger, Marais, Moricz (1998)

Analysis of a Very Large Search Log 2.35 Average Terms Per Query - 0 = 20.6% (?) - 1 = 25.8% - 2 = 26.0% = 72.4% Operators Per Query - 0 = 79.6% Terms Predictable First Set of Results Viewed Only = 85% Some (Single Term Phrase) Query Correlation - Augmentation - Taxonomy Input - Robots vs. Humans

Web Analytics and IR? Knowing access patterns of users Lists of search terms - Numbers of words - Words, concepts to add (synonyms) - Types of queries Success of searching a site - Was a result link clicked on? - How many pp/user after a search? Is a new or better search interface needed?

Real Life Information Retrieval Real Life Information Retrieval 51K Queries from Excite (1997) Search Terms = 2.21 Number of Terms - 1 = 31% 2 = 31% 3 = 18% (80% Combined) Logic & Modifiers (by User) - Infrequent - AND, “+”, “-” Logic & Modifiers (by Query) - 6% of Users - Less Than 10% of Queries - Lots of Mistakes Uniqueness of Queries - 35% successive - 22% modified - 43% identical

Real Life Information Retrieval Queries per user 2.8 Sessions - Flawed Analysis (User ID) - Some Revisits to Query (Result Page Revisits) Page Views - Accurate, but not by User Use of Relevance Feedback (more like this) - Not Used Much (~11%) Terms Used Typical & frequent Mistakes - Typos - Misspellings - Bad (Advanced) Query Formulation Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)

KDD for Extracting Knowledge Knowledge extraction, information discovery, information extraction, data archeology, data pattern processing, OLAP, HV statistical analysis Sounds as if “knowledge” is there to be found. User and usage context help find the knowledge Hypothesis before analysis Why KDD, why now? - Data storage, analysis costs - Visualization

KDD Process Database for structured data and queries - How structured, alorithms for queries - How results can be understood and visualized - Iterative & Interactive, hypothesis driven & hypothesis generating

KDD Efforts Data Cleaning Formulating the Questions “Finding useful features to represent the data” p30 Models: - Classification to fit data into pre-defined classes - Regressions to fit predictions & values - Clustering to class sets found in data - Summarization to briefly describe data - Dependency discovery of variable relationships - Sequence analysis for time or interaction patterns

Data Prep for Mining the WWW Processing the data before mining WEBMINER system - site toplogy - Cleaning - User identification - Session identification (episodes) - Path completion

Web Usage Mining VL Verification Data Mining to Discover Patterns of Use - Pre-Processing - Pattern Discovery - Pattern Analysis Site Analysis, Not User Analysis Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N

Web Usage Discovery - Content Text Graphics Features - Structure Content Organization Templates and Tags - Usage Patterns Page References Dates and Times - User Profile Demographics Customer Information

Web Usage Collection Types of Data - Web Servers - Proxies - Web Clients Data Abstractions - Sessions - Episodes - Clickstreams - Page Views The Tools for Web Use Verification

Web Usage Preprocessing Usage Preprocessing - Understanding the Web Use Activities of the Site - Extract from Logs Content Preprocessing - Converting Content Into Formats for Processing - Understanding Content (Working with Dev Team) Structure Preprocessing - Mining Links and Navigation from Site - Understanding Page Content and Link Structures

Web Usage Pattern Discovery Clustering for Similarities - Pages - Users - Links Classification - Mapping Data to Pre-defined Classes - Rule Discovery - Rule Rules - Computation Intensive - Many Paths to the Similar Answers Pattern Detection - Ordering By Time - Predicting Use With Time

Web Usage Mining as Evaluation? Mining Goals - Improved Design - Improved Delivery - Improved Content Personalization (XMod Data) System Improvement (Tech Data) Site Modification (IA Data) Business Intelligence (Market Data) Usage Characterization (User Behavior Data)

Web Analytics Wrap-up What can we learn about users? What can we learn about services? How can we help users improve their use? How can IR models benefit from this analysis? What kind of improvements in Web IR systems and their interfaces can be take from this?