Web Log, Text, and Other Data Mining Wayne Kao. What is Data Mining? “Automated extraction of hidden predictive information from large databases” -Kurt.

Slides:



Advertisements
Similar presentations
Chapter 3 – Web Design Tables & Page Layout
Advertisements

WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Module 2 Navigation.     Homepage Homepage  Navigation pane that holds the Applications and Modules  Click the double down arrow on the right of.
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Cognitive Walkthrough More evaluation without users.
© by Pearson Education, Inc. All Rights Reserved.
Small Displays Nicole Arksey Information Visualization December 5, 2005 My new kitty, Erwin.
Information Retrieval in Practice
Chapter 12: Web Usage Mining - An introduction
Visualizing Association Rules for Text Mining - Sangjik Lee Pak Chung Wong, Paul Whitney, Jim Thomas Pacific Northwest National Laboratory.
DEPARTMENT OF COMPUTER SCIENCE SOFTWARE ENGINEERING, GRAPHICS, AND VISUALIZATION RESEARCH GROUP 15th International Conference on Information Visualisation.
1 of 6 Parts of Your Notebook Below is a graphic overview of the different parts of a OneNote 2007 notebook. Microsoft ® OneNote ® 2007 notebooks are digital.
Analyzing Web Logs Sarah Waterson 18 April 2002 SIMS 213 Group for User Interface Research.
Version 4 for Windows NEX T. Welcome to SphinxSurvey Version 4,4, the integrated solution for all your survey needs... Question list Questionnaire Design.
© 2004 Keynote Systems Customer Experience Management (CEM) Bonny Brown, Ph.D. Director, Research & Public Services.
Academic Computing Services 2010 Microsoft ® Office Visio ® 2007 Training Get to know Visio.
WebQuilt and Mobile Devices: A Web Usability Testing and Analysis Tool for the Mobile Internet Tara Matthews Seattle University April 5, 2001 Faculty Mentor:
Toll Free: Project Manager Tutorial.
Website Content, Forms and Dynamic Web Pages. Electronic Portfolios Portfolio: – A collection of work that clearly illustrates effort, progress, knowledge,
Towards Appraising Online Stores SEPI Research Group, Department of Computer Science and AI University of Malta 1 CSAW 2004 Towards Appraising Online Stores.
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
2012 National BDPA Technology Conference Creating Rich Data Visualizations using the Google API Yolanda M. Davis Senior Software Engineer AdvancED August.
CS 275Tidwell Course NotesPage 33 Chapter 3: Getting Around In complex software applications, it is critical to reveal where the user currently is, as.
© Ms. Masihi.  The Dreamweaver Welcome Screen first opens when you start Dreamweaver.  This screen gives you quick access to previously opened files,
TERMS TO KNOW. Desktop This does not mean a computer desktop vs. a laptop. You probably keep a number of commonly used items on your desk at home such.
Jump to first page Tracking users Analyzing how people use your site by Dylan Tweney
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Put it to the Test: Usability Testing of Library Web Sites Nicole Campbell, Washington State University.
Online, Remote Usability Testing  Use web to carry out usability evaluations  Two main approaches agent-based evaluation (e.g., WebCritera)  model automatically.
New Features in Release 9.2 (July 27, 2009). 2 Release 9.2 New Features Updated Shopping Experience Home/Shop page Shop at the top search New Hosted Supplier.
10/4/2015Tables1 Spring, 2008 Modified by Linda Kenney 4/2/08.
Tutorial 4: Working with Hyperlinks. Objectives Session 4.1 – Place bookmarks on a Web page – Create a link to a bookmark – Create a link to another Web.
Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.
©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.
Designing Interface Components. Components Navigation components - the user uses these components to give instructions. Input – Components that are used.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
IAT 814 Trees Chapter 3.2 of Spence ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS +
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
0 eCPIC User Training: Dependency Mapper These training materials are owned by the Federal Government. They can be used or modified only by FESCOM member.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Lecture 5: Writing the Project Documentation Part III.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
A Case Study of Interaction Design. “Most people think it is a ludicrous idea to view Web pages on mobile phones because of the small screen and slow.
INTERFACE DESIGN DMS 546/446 DESIGNING INTERFACES - JENIFER TIDWELL CHAPTER 1.
1 SY DE 542 Navigation and Organization Prototyping Basics Feb 28, 2005 R. Chow
CMPS 435 F08 These slides are designed to accompany Web Engineering: A Practitioner’s Approach (McGraw-Hill 2008) by Roger Pressman and David Lowe, copyright.
Evaluating & Maintaining a Site Domain 6. Conduct Technical Tests Dreamweaver provides many tools to assist in finalizing and testing your website for.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Visualizing Massive Multi-Digraphs James Abello Jeffrey Korn Information Visualization Research Shannon Laboratories, AT&T Labs-Research All the graphs.
Visualization Four groups Design pattern for information visualization
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Software AS Module Heathcote Ch. 22. Importance of Information  Information technology is fundamental to the success of any business  The information.
Structure and Function: IA for Web Applications. Innovate - For What’s Next™ ©1999 Scient, Proprietary and Confidential Page 2 Structure - IA with content.
Web Browsing *TAKE NOTES*. Millions of people browse the Web every day for research, shopping, job duties and entertainment. Installing a web browser.
Program Assessment User Session Experts (PAUSE) Information Sessions: RSS & Subscription Services October , 2006.
Secondary Evidence for User Satisfaction With Community Information Systems Gregory B. Newby University of North Carolina at Chapel Hill ASIS Midyear Meeting.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Information Architecture & Design Week 9 Schedule - Web Research Papers Due Now - Questions about Metaphors and Icons with Labels - Design 2- the Web -
Quality Is in the Eye of the Beholder: Meeting Users ’ Requirements for Internet Quality of Service Anna Bouch, Allan Kuchinsky, Nina Bhatti HP Labs Technical.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Unit 13 – Website Development FEATURES OF WEBSITES.
Learning Aim A.  Websites are constructed on many different features.  It can be useful to think about these when designing your own websites.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Chapter 12: Automated data collection methods
Applications Software
Tutorial 7 – Integrating Access With the Web and With Other Programs
Presentation transcript:

Web Log, Text, and Other Data Mining Wayne Kao

What is Data Mining? “Automated extraction of hidden predictive information from large databases” -Kurt Thearling “Quickly and thoroughly explore mountains of data, isolating the valuable, usable information -- the business intelligence” -SPSS site

Possible Questions (Chi) Usage –How has info been accessed? How frequently? What’s popular? –How do people enter the site? Where do people spend time? How long do they spend there? –How do people travel within a site? What are the [un]popular paths? –Who are the people accessing the site? From what geographical location? From what domains?

Possible Questions (cont) Structural –What information has been added? Modified? Remained the same but moved? Usage + Structural –How is new info accessed? When does it become popular? –How does introducing new information change navigation patterns? Can people still navigate there to the desired info? –Do people look for deleted information?

Usability Testing Common usability testing techniques: Interviews Ethnographic and/or lab-style observations Surveys Focus groups Good qualitative data Problems with these techniques: Time and effort are costly Small sample sizes – quantitative results? (Spool) How can we get usability testing more involved in the design cycles, so we can find problems and potential problems earlier? Design Evaluate Prototype

Remote Usability (Waterson) Analyze clickstreams in the context of the task and user intentions Human observers not present Want methods that are –Easy to deploy on any website –Compatible with range of OS and browsers Mobile computing adds further usability challenges –Small screen sizes –Limited and/or new interaction techniques –Devices are used in environments beyond the desktop

Apache Web Log [29/Mar/2002:03:58: ] "GET /~sophal/whole5.gif HTTP/1.0" " "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)" [29/Mar/2002:03:59: ] "GET /~alexlam/resume.html HTTP/1.0" "-" "Mozilla/5.0 (Slurp/cat; [29/Mar/2002:03:00: ] "GET /~tahir/indextop.html HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ [29/Mar/2002:03:00: ] "GET /~tahir/animate.js HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“

Analog - One traditional tool Reports number of requests, info about client machines, entry/exit points, charts (Chi et al.) Generated on a daily basis Typical stats Prettier stats

Readings “Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998 “Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999 “VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999 “Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001 “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002

Readings “Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998 “Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999 “VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999 “Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001 “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002

Evolution of Web Ecologies Rather than hits, focus intermediate representation on (C)ontent, (U)sage, and (T)opology, sorted by URL. –URL1: {day1: …} {day2: …} –URL2: {day1: …} Visualize an entire web site in a small amount of space Show temporal changes

Disk Tree Visualization Breadth first traversal Each ring represents a tree level All leaf nodes guaranteed some angular space (360 / # leaves) Tree linksline mark in X and Y Page access frequency line size/brightness Lifecycle stagecolor: new, continued, deleted

Disk Tree Visualization (cont) Pros –No occlusion problems since it’s 2D plane –Can use the 3rd dimension for other info (e.g. time) –Aesthetically pleasing to the eye (?) Cons –Difficult to see any page-level detail –Confusing color choices

Time Tube Visualization Put Disk Trees along spatial axis Rotated so that each slice gets equal screen area Focus+context Animation: Can fly through tube, mapping time onto time

Interaction Model Can rotate slices with a button click Can focus a slice by clicking on it Flicking gestures move slices around Right-clicking zooms to an area Mouseovers display more information about a node in a side window Can bring up pages in the browser Animation of slices

Real-world Analyzes Deadwood: Shows pages becoming [un]popular Shows effects of a redesign

Real-world Analyzes (cont) Added items are being used Deleted items aren’t negatively impacting the rest of the site

Comments Gives only a broad view of the data with no real way to get at the specifics Interaction seems very advanced Not sure how intuitive the whole idea of a circular tree is – seems kind of gratuitous

Readings “Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998 “Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999 “VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999 “Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001 “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002

Association Rule? Quantitative rule that describes associations between sets of items –Not qualitative because no domain knowledge necessary for text mining Implication X  Y where –X: set of antecedent items –Y: consequent item Example: 80% of people who buy diapers and baby powder also buy baby oil.

Association Rule? (cont) Support/predictability/conditional probability –Percentage of items in the total set that satisfies the union of items in the antecedent and in the consequent item Confidence/prevalence/joint probability –Percentage of articles that satisfy both the antecendent and the consequent item

Association Rule Visualization Must visualize –Antecedent items & consequent items –Associations between antecedent and consequent –Rules' support –Confidence Traditional ways of visualizing it –2D matrix –Directed graph

2D Matrix (figure 1) Antecedent and consequent items on axes Metadata icons in the cells that connect the antecedent to consequent contain support and confidence values Association rule: B  C

2D Matrix (cont) Pros: one-to-one binary relationships Cons: –Hard to see association rules in many-to-one relationships (A+B  C or A  C and B  C) –Grouping antecedents adds complexity –Object occulusion

Directed graph nodes = items edges = associations Cons: –Dozen or more items  tangled display –Selecting edges to display multiple rules requires significant human interaction

Confusing?

“Novel” Technique Matrix: rule-to-item –rows = topics –columns = item associations –blue/red = antecedent and consequent Bar graph = confidence/support Can use queries to filter Mouse zooming to support context/focus

“Novel” Technique Advantages Handles hundreds of multiple antecedent association rules View topics and associations simultaneously Individual items clearly shown No antecedent groups Few occulusions because metadata is plotted at the far end and bar graph is scaled No screen swapping, animation, or serious interaction required

“Novel” Technique Demo Demo shows scalability ~9 MB news article corpus of 100,000+ documents Use word and concept-based text engines Words evaluated on whether they’re interesting depending on their position in documents Suffices removed and common prepositions, pronouns, adj’s, gerunds ignored Build a table of antecedents, consequents, confidences, and supports -> feed into viz

Conclusions Rule-to-item association Very clear visualization if limited to a few dozen rules Most web log visualizations jump to using a graph; this paper forces you to think twice.

Readings “Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998 “Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999 “VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999 “Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001 “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002

VISVIP Captures individual movement between pages rather than aggregates Shows paths - sequence of URLs

Topology Directed graph Force-directed algorithm –Spring-like force –Nodes repel each other with force inversely proportional to the distance between them (i.e. closer nodes means closer pages) –Final force pulls nodes toward center

Content URLs abbreviated – bd.gif  ge/abd Color-coded by content type Mouseover reveals all the abbreviated information

Simplification Common problems –Noise nodes not significant to paths - image and mailto nodes –Over-connectivity - link back to home page or company logo Solutions –Delete all edges connected to a node –Make one node the graph root –Focus on a subset of the graph

Path Sequence Showing subject paths as straight lines didn't work –Hard to follow single jagged path –Multiple paths overlapped Spline representation –Each path is a smooth curve overlaid on the graph –Colors for groups of subjects (e.g. novices)

Path Sequence (cont) User path-oriented layouts –Simpler structure than when path is laid over a graph of the entire site

Path Timing Vertical bar with base on node, its height proportional to time spent on page Animation runs through pages at times real-time Select a node to get detailed stats

Comments Capturing individual movements pretty innovative Curved user paths and reorienting the layout based on user paths Overall graph viz not too clear Good tips for creating a web log mining viz

Readings “Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998 “Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999 “VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999 “Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001 “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002

Clickstream Visualizer Aggregate nodes using an icon (e.g. all the checkout pages) Edges represent transitions –Wider means more transitions

Customer Segments Collect –Clickstream –Purchase history –Demographic data Associates customer data with their clickstream (scary...) Different color for each customer segment

Filtering Using the mouse or table control, can filter by –Edge weight –Node selection Example: select checkout nodes and see if users are exiting from nodes

Layout Using third party Tom Sawyer package 1.Hierarchical from higher-out degree to higher-in degree –Mirrors actual flow of site users –The default 2.Circular –Puts related nodes into circles –Shows relationships between groups of pages

Layout (cont) Aggregation based on file system path (good idea?)

Initial Findings Gender shopping differences (intriguing...)

Initial Findings (cont) Checkout process analysis Newsletter hurting sales

Comments Visualizing clickstreams with demographic data Grouping pages by type Best use of color Icons an interesting way of reducing complexity

Readings “Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998 “Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999 “VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999 “Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001 “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002

System Design Log data with proxy Infer actions Aggregate data Layout graph Display interactive visualization

Capturing Interaction Typical HTTP request… Client BrowserWeb Server

Capturing Interaction (cont) WebQuilt captures interaction with a proxy –Proxies have typically been used for caching and firewalls WebQuilt Log Proxy Client BrowserWeb Server

Capturing Interaction (cont) If a page says: Change it to:

Capturing Interaction (cont) Pros: –Don’t need access to servers –Can analyze sites without permission from the server –Can gather clickstreams from a variety of devices including PDAs, phones, desktop computers Cons: –No access direct to the client

Visualization Interactive, zoomable directed graph Nodes = web pages Edges = aggregate traffic between pages Java-based SATIN toolkit for gesturing & zooming interaction Image rendering of web pages: JacoZoom Java callable wrappers around an ActiveX component MSIE window

Directed graph Nodes: visited pages –Color marks entry and exit nodes Arrows: traversed links –Thicker: more heavily traversed –Color Red/yellow: Time spend before clicking Blue: optimal path chosen by designer

Controls Slider: Zoom in and out Checkboxes: Filter paths to display

Pages Zooming in shows page thumbnails Arrows –Originate from actual links or the Back button –Translucent & don’t cover details

Layout Layout system flexible… 1.Edge-weighted depth-first traversal –Most visited path along top –Recursively place less followed paths below 2.Grid positioning –Organizes distance between nodes –Avoid overlapping nodes

Interaction Selecting nodes Zooming in and out Navigational gestures

Inferring & Aggregating Take log files and infer actions, such as when the back button is pressed –Can infer back button pressed, but not combinations of back and forward –Extensible framework to add other inferred actions Aggregate information, preserving individual paths

Running a WebQuilt Remote Usability Test Recruit users Design and distribute tasks (via ) Auto-collect! Watch and wait as users perform tasks and proxy logs data Visualize, analyze Use the results to change design

Pilot Usability Study Edmunds.com PDA web site Visor Handspring equipped with a OmniSky wireless modem 10 users asked to find… –Anti-lock brake information on the latest Nissan Sentra model –The Nissan dealer closest to them.

In the Lab vs. Out in the Wild Comparing in-lab usability testing with WebQuilt remote usability testing 5 users were tested in the lab 5 were given the device and asked to perform the task at their convenience All task directions, demographic data, and follow up questionnaire data was presented and collected in web forms as part of the WebQuilt testing framework.

Classifying Usability Issues Lab: Tester observations, participant comments and questionnaire data Remote: WebQuilt visualization and questionnaire data Four categories of issues Browser Device Test design Site design Six severity levels 0 indicates comment 1-5 where 1 is a very minor issue and 5 is a critical issue

Findings

WebQuilt methodology is promising for uncovering site design related issues. 1/3 of the issues were device or browser related. Browser and device issues can not be captured automatically with WebQuilt unless they cause an interaction with the server can be revealed via the questionnaire data.

Testing Concerns What to do when problems with running the test occur? Understanding user motivation is still ambiguous: Curiosity vs. confusion? Gathering qualitative feedback on mobile devices is difficult –PDA input difficult –Phones have potential for audio

Comments Zooming/filtering great for showing overview and page-level details –Can put screenshots directly into the viz Layout in relation to intended path Study compares remote usability tests to traditional tests - promising Proxy logging very cool

Future Work Expanded mobile device interaction capture, specifically net-enabled cell phones Improve filtering capabilities, integrating questionnaire and demographic data Clever algorithms to simplify graph layout Improved quantitative reporting Improved controls/interaction More rigorous evaluation with designers and usability experts

Concluding Comments Many incremental improvements in web log/data mining viz (using a graph, using demographic data, etc.) Would be really good to see a study of usability engineers and web developers comparing the tools themselves