1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University.

Slides:



Advertisements
Similar presentations
Swets Information Services Arie Jongejan 8 th International Bielefeld Conference January 2006 Scholarly Information on the Web A Vision for the Future.
Advertisements

1 Web Search and Web Search Overlap: What the Deal? Amanda Spink Queensland University of Technology.
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
Chapter 12: Web Usage Mining - An introduction
Access 2007 Product Review. With its improved interface and interactive design capabilities that do not require deep database knowledge, Microsoft Office.
Amanda Spink : Analysis of Web Searching and Retrieval Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Focus Groups. Contents What is a focus group and why use it Methods When to use Focus Groups Advantages and Disadvantages Example.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Market and industry analysis
Living in a Digital World Discovering Computers 2010.
Gender Issues in Systems Design and User Satisfaction for e- testing software Prepared by Sahel AL-Habashneh. Department of Business information systems.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Discovering Computers Fundamentals, 2012 Edition Your Interactive Guide to the Digital World.
Discovering Computers 2010
Lecture-8/ T. Nouf Almujally
Discovering Computers Fundamentals, 2011 Edition Living in a Digital World.
CGS 1000 Introduction to Computers and Technology.
Professor Michael J. Losacco CIS 1110 – Using Computers Application Software Chapter 3.
What are search engines? Tools used for locating web pages Automated software programs known as spiders or bots to survey the Web and build their databases.
Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.
Chapter 10 Publishing and Maintaining Your Web Site.
Choosing Your Primary Research Method What do you need to find out that your literature did not provide?
Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law: any public.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Chapter 3 Application Software.
The Toolbox Presented by:. Video in Various Skins Click to see an example.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
User Searching Behaviors (and Interactive Retrieval Techniques) within a Library Gateway William H. Mischo Mary C. Schlembach David S. Vess University.
Introduction to SDLC: System Development Life Cycle Dr. Dania Bilal IS 582 Spring 2009.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Evaluation of Adaptive Web Sites 3954 Doctoral Seminar 1 Evaluation of Adaptive Web Sites Elizabeth LaRue by.
Put it to the Test: Usability Testing of Library Web Sites Nicole Campbell, Washington State University.
Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.
Understanding Virtual Users: Connecting Research to Practice Lynn Silipigni Connaway Consulting Research Scientist Clifton Snyder Software Engineer October.
This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner.
©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.
Human Computer Interaction
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
Usability and Accessibility CIS 376 Bruce R. Maxim UM-Dearborn.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
1 Week 5 Application Software. Objectives Overview Identify the four categories of application software Describe characteristics of a user interface Identify.
1 CS430: Information Discovery Lecture 18 Usability 3.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Planning an Applied Research Project Chapter 3 – Conducting a Literature Review © 2014 by John Wiley & Sons, Inc. All rights reserved.
Evaluating Web Pages Techniques to apply and questions to ask.
Ads Jim Jansen College of Information Sciences and Technology The Pennsylvania State University
SBD: Analyzing Requirements Chris North CS 3724: HCI.
Engineering Information: Putting It Together Honora F. Nerz NCSU Libraries.
World Wide Web Library 150 Week 8. The Web The World Wide Web is one part of the Internet. No one controls the web Diverse kinds of services accessed.
DATA RESOURCE MANAGEMENT
© 2010 Deep Web Technologies, Inc. Taking the Library Back from Google Abe Lederman, President and CTO Deep Web Technologies May 12, 2010.
 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
Research Methodology II Term review. Theoretical framework  What is meant by a theory? It is a set of interrelated constructs, definitions and propositions.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Discovering Computers Fundamentals, 2010 Edition Living in a Digital World.
Internet Searching the World Wide Web. The Internet and the World Wide Web The Internet is a worldwide collection of networks that allows people to communicate.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Objectives Overview Identify the four categories of application software Describe characteristics of a user interface Identify the key features of widely.
Strategies for improving Web site performance
Chapter 12: Automated data collection methods
Chapter 23 Deciding how to collect data
Presentation transcript:

1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University

2 Web Data Analysis Track Web search trends and characteristics Web query transaction logs collected in 1997,1999, 2001, 2003, 2004, 2005 and Combined dataset of 20 million+ Web searches

3 Web Search Studies Web search engines: -Alta Vista -Ask Jeeves -Excite -AlltheWeb -Vivisimo -Dogpile Transaction log analysis studies Focus on user search analysis for competitive advantage

4 Web Data Sample QueryUIDCookie Date/ TimeBrowserLocationVertical Organic_ Clicks Sponsored_ Clicks jamie pressly NIJYMA4 TB385Y :00:00msie6.0usaImages01 maytag parts UGJ23KA4 T2TCMV :00:00msie6.0usaWeb10 free gay porno videos KSKNPKA4 TB22ER :00:00msie6.0usaWeb01

5 Data Collection Methods Various combinations of methods and approaches Transaction log analysis Videotaping and Audio-taping Think aloud protocols Usability – HCI techniques Focus groups Interviews Survey Experiments Diaries

6 Data Analysis Methods Quantitative and statistical analysis Qualitative analysis – grounded theory Combination of both methods

7 Key Issues – Search Studies What is the goal of the project? – Insights, understanding and develop theory – User modeling – Trends analysis – Interface/systems design – User training

8 Key Issues – Search Studies What variables to measure? How much data is enough? Methods used – single or multiple? HCI approach – test interface/system features

9 Transaction Log Analysis (TLA) File or log of communications between user and system File recorded on a server – side recordings Log or file formats vary but there are fields common to most (e.g., IP address, cookie, time stamp, query, vertical, click thru)

10 Why Collect and Analyze Log Data? Gain understanding of user interaction with system and interface Goal to improve system and interface design, and improve user training. Transaction log analysis is extensively used in academia and industry

11 TLA Process Goals and objectives Data collection Log preparation Data analysis Making sense

12 Data Collected Process of collecting the interaction data for a given period in a transaction log Collect data on the search episode User identification Date Time Search session content Resources accessed (e.g., URL’s)

13 Logging Software Custom and commercial applications (the Wrapper - ) WinWhatWhere spy software Morea 1.1 software Camtasia Studio

14 Data Preparation Process of cleaning and preparing the log data for analysis Log data into a relational database Cleaning the log – corrupted data Parsing the log (e.g., removing Web sessions identified as agents) Normalizing the log

15 Log Analysis – Three Levels Term Query Session

16 Term Level Analysis Term occurrence Total terms High and low usage terms Term distribution Co-occurring terms

17 Query Level Analysis Initial query Subsequent queries Modified queries and query reformulation Identical queries Query complexity Boolean use Spelling Types of queries Query topics

18 Query Subjects – Alta Vista 2002 & Vivisimo People/Places 49.2% 2.Commerce, etc.12.5% 3.Computers, etc.12.4% 4.Health/sciences7.4% 5.Education/Humanities5% 6.Entertainment, etc.4.5% 7.Sex/Pornography3.2% 8.Society/Culture, etc. 3.1% 9.Government 1.5% 10.Performing/Fine Arts0.6% 1.Commerce, etc.21% 2.Indiscernible 19% 3.People/Places, etc.15% 4.Computers/Internet13% 5.Social/Culture 9% 6.Health/Sciences 6% 7.Education/Humanities 5% 8.Sex/Pornography 4% 9.Performing/Fine Arts 3% 10.Government 3% 11.Entertainment, etc. 2%

19 Web Search Session Level Analysis Search duration Search patterns Successive and multitasking sessions Page or resource viewing

20 Web Session Duration (Minutes) 56% less than 1 minute 72% sessions less than 5 minutes 81% sessions less than 15 minutes Mean: approx. 58 minutes and 2 seconds (see Jansen, B. J., Spink, A., and Koshman, S Web searcher interactions with the Dogpile.com meta-search engine. Journal of the American Society for Information Science and Technology. 58(5), )

21 Transaction Log Analysis (TLA) Methods Quantitative and statistical analysis – requires software and expertise Qualitative analysis – requires training Creativity factor Combination of quantitative and qualitative methods

22 TLA Strengths Data from a large user base Reasonable and non-intrusive Less time than other methods Can be relatively inexpensive

23 TLA Limitations Transaction logs do not include user demographic and other data Lacks data on search reasons and motivations Incomplete data due to corrupted logging

24 Conclusions Search analysis is a complex process with many choices TLA a powerful tool Requires planning, training and expertise Can be combined with other data collection and analysis techniques

25 Further Reading Spink, A., & Jansen, B. J. (2004). Web Search: Public Searching of the Web. Springer. Jansen, B. J. (2006). Search log analysis: What is it; what's been done; how to do it. Library and Information Science Research, 28(3), Jansen, B. J., Spink, A., & Taksa, I. (forthcoming). Handbook of Web Log Analysis. Idea Group Publishing.

26 QUESTIONS? Thank You