Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Servers & Log Analysis What can we learn from looking at Web server logs? - What server resources were requested - When the files were requested -

Similar presentations


Presentation on theme: "Web Servers & Log Analysis What can we learn from looking at Web server logs? - What server resources were requested - When the files were requested -"— Presentation transcript:

1 Web Servers & Log Analysis What can we learn from looking at Web server logs? - What server resources were requested - When the files were requested - Who requested them (where IP address = who) - How they requested them (browser types & OS) Some assumptions - A request for a resource means the user did receive it - A resource is viewable & understandable to each user - Users are identified within a loose set of parameters How does knowing request patterns affect or help IA?

2 Types of Web Server Logs Proxy-based - Web access servers to control access or cache popular files Client-based - Local cache files - Browser History file(s) Network-based - Routers, firewalls & access points Server-based - Web servers to serve content

3 Using Web Servers The Apache Software Foundation Microsoft Internet Information Server (Services)Microsoft Internet Information Server (Services) These applications “Serve” - Text - HTML, XML, plain text - Graphics - jpeg, gif, png - CGI, servlets, XMLHttpRequest & other logicXMLHttpRequest - other MIME types such as movies & soundMIME types Most servers can log these files - Daily, weekly or monthly - Can not always log CGI or related logic (specifically or “out of the box”)

4 How Servers Work Hypertext Transfer Protocol - http 1.A file is requested from the browser 2.The request is transferred via the network 3.The server receives the request (& logs it) 4.The server provides the file (& logs it) 5.The browser displays the file Almost all Web servers work this way

5 Types of Server Logs Access Log - Logs information such as page served or time served Referer Log - Logs name of the server and page that links to current served page - Not always - Can be from any Web site Agent Log - Logs browser type and operating system Mozilla Windows

6 Log File Format Extended Log File Format - W3C Working Draft WD-logfile-960323 W3C Working Draft WD-logfile-960323 key advantage: - computer storage cost decreases while paper cost rises every server generates slightly different logs

7 Extended Log File Formats WWW Consortium Standards Will automatically record much of what is programmatically done now. - faster - more accurate - standard baselines for comparison - graphics standards

8 What is a log file? A delimited, text file with information about what the server is doing - IP Address or Domain name - Date/Time - Method used & Page Requested - Protocol, Response Code & Bytes Returned - Referring Page (sometimes) - UserAgent & Operating System p0016c74ea.us.kpmg.com - - [01/Sep/2004:08:17:21 -0500] "GET /images/sanchez.jpg HTTP/1.1" 200 - "http://www.ischool.utexas.edu/research/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)"

9 In search of Reliable Data Not as Foolproof as Paper - You can see when someone is reading a page - You can know the page is turned - You can know the book is checked out No State Information - The same person or another person could be reading pages 1 then page 2 - You really can’t tell how many users you have Server Hits not perfectly Representative - Counters inaccurate - Caching & Robots can influence + & - Floods/Bandwidth can Stop “intended” usage

10 What is a “hit”? Technically, a hit is simply any file requested from the server - That is logged - That represents (usually) part of a request to “see” a whole Web page Hits combine to represent a “page view” Page views combine to represent an “episode” or “session” - Episode is one activity or question a user perfoms or requests on a Web site - Session is a series of episodes that embodies all the interactions a user undertakes using a Web site (per time, based on averages around 30 min.)

11 Making Servers More Reliable Keep system setups simple - unique file and directory names - clear, consistent structure Configure CMS for logging/serving Use an FTP server for file transfer - frees up logs and server! Judicious use of links Wise MIME types - some hard/impossible to log

12 Clever Web Server Setup Redirect CGI to find referrer Use a database - store web content - record usage data create state information with programming - NSAPI - ActiveX Have contact information Have purpose statements

13 Managing Log Files Backup Store Results or Logs? Beginning New Logs Posting Results

14 Log Analysis Tools Analog Webalizer Sawmill WebTrends AWStats WWWStat GetStats Perl Scripts Data Mining & Business Intelligence tools

15 WebTrends A whole industry of analytics Most popular commercial application

16 Log Analysis Cumulative Sample Program started at Tue-03-Dec-2005 01:20 local time. Analysed requests from Thu-28-Jul-2004 20:31 to Mon- 02-Dec-1996 23:59 (858.1 days). Total successful requests: 4 282 156 (88 952) Average successful requests per day: 4 990 (12 707) Total successful requests for pages: 1 058 526 (17 492) Total failed requests: 88 633 (1 649) Total redirected requests: 14 457 (197) Number of distinct files requested: 9 638 (2 268) Number of distinct hosts served: 311 878 (11 284) Number of new hosts served in last 7 days: 7 020 Corrupt logfile lines: 262 Unwanted logfile entries: 976 Total data transferred: 23 953 Mbytes (510 619 kbytes) Average data transferred per day: 28 582 kbytes (72 946 kbytes)

17 How about the iSchool Web site? Our server files are collected constantly - Daily Daily - Weekly Weekly - Monthly Monthly - Even yearlyyearly What does a quick look tell us? - How well is the server working? Uptime, server errors, logging errors - How popular is our site? Number of hits, popular files - Who is visiting the site? Countries, types of companies - What searches led people here?

18 UT & its Web server logs UT Web log reports (Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00). Successful requests: 39,826,634 (39,596,364) Average successful requests per day: 5,690,083 (5,656,623) Successful requests for pages: 4,189,081 (4,154,717) Average successful requests for pages per day: 598,499 (593,530) Failed requests: 442,129 (439,467) Redirected requests: 1,101,849 (1,093,606) Distinct files requested: 479,022 (473,341) Corrupt logfile lines: 427 Data transferred: 278.504 Gbytes (276.650 Gbytes) Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)

19 Neat Analysis Tricks use a search engine to find references - “link:www.ischool.utexas.edu/~donturn” key to using unique names - use many engines update times different blocking mechanisms are different use Web searches (or Yahoo, Bloglines…) - look for references - look for IP addresses of users

20 Neat Tricks, cont. Walking up the Links - follow URL’s upward Reverse Sort - look for relations Use your own robot to index - Test

21 Web Surveys, an alternative Surveys actually ask users what they did, what they sought & if it helped GVU, Nielsen and GNN - Qualitative questions phone web forms - Self-selected sample problems random selection oversample

22 Analysis of a Very Large Search Log What kinds of patterns can we find? Request = query and results page 280 GB – Six Weeks of Web Queries - Almost 1 Billion Search Requests, 850K valid, 575K queries - 285 Million User Sessions (cookie issues) - Large volume, less trendy - Why are unique queries important? Web Users: - Use Short Queries in short sessions - 63.7% one request - Mostly Look at the First Ten Results only - Seldom Modify Queries Traditional IR Isn’t Accurately Describing Web Search Phrase Searching Could Be Augmented Silverstein, Henzinger, Marais, Moricz (1998)

23 Analysis of a Very Large Search Log 2.35 Average Terms Per Query - 0 = 20.6% (?) - 1 = 25.8% - 2 = 26.0% = 72.4% Operators Per Query - 0 = 79.6% Terms Predictable First Set of Results Viewed Only = 85% Some (Single Term Phrase) Query Correlation - Augmentation - Taxonomy Input - Robots vs. Humans

24 Real Life Information Retrieval Real Life Information Retrieval 51K Queries from Excite (1997) Search Terms = 2.21 Number of Terms - 1 = 31% 2 = 31% 3 = 18% (80% Combined) Logic & Modifiers (by User) - Infrequent - AND, “+”, “-” Logic & Modifiers (by Query) - 6% of Users - Less Than 10% of Queries - Lots of Mistakes Uniqueness of Queries - 35% successive - 22% modified - 43% identical

25 Real Life Information Retrieval Queries per user 2.8 Sessions - Flawed Analysis (User ID) - Some Revisits to Query (Result Page Revisits) Page Views - Accurate, but not by User Use of Relevance Feedback (more like this) - Not Used Much (~11%) Terms Used Typical & frequent Mistakes - Typos - Misspellings - Bad (Advanced) Query Formulation Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)

26 Downie & Web Usage Server logs are like library usage User-based analyses - who - where - what File-based analyses - amount Request analyses - conform (loosely) to Zipf’s Law Byte-based analyses

27 Web use analysis & IA? Another tool to begin to understand how people use your Web provided resources With a small amount of setup, you can learn a large amount Server use can be integrated into site usage for users - Lists of popular pages & more interlinking pages - Adding search terms that found the page to related pages - Adjust metadata to reflect searches that find pages - Add pages to the site index or site map First-cut usability information - Pages 1 & 2 were accessed, but not 3 - Why? - Navigation usage, link ordering and design understanding - Knowing what browsers & OS helps tailor design and media types

28 BREAK! No Presentation this week - Next week: Asset management, content management & version control Break up media development work Examine current pages, style sheets & designs Set up next set of pair & individual deliverables

29 Media Development work We need to find & create graphics for the new site Content about: - Austin - UT - iSchool - People at the iSchool - Students at work in the iSchool (classes, labs) Screen grab from videos Search the Web for copyright free images Take our own pictures

30 Current Pages & Designs First version of main iSchool page template and CSS completeiSchool page template Secondary page template & CSS complete - Some secondary pages already built Some secondary pages already built Index page template set Site map page initially set - Big Map - Main pages map

31 Next steps In class - Test & evaluate current CSS and templates - Improvise secondary home page based on initial design - Examine new Alumni section - Examine new Course Listing pagenew Course Listing page For homework - Complete secondary page migration to new design - Rotate design work Alumni Site Map Home page design ideas - Picture/Media creation work


Download ppt "Web Servers & Log Analysis What can we learn from looking at Web server logs? - What server resources were requested - When the files were requested -"

Similar presentations


Ads by Google