Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University.

Similar presentations


Presentation on theme: "1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University."— Presentation transcript:

1 1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University

2 2 Web Data Analysis 1997-2007 Track Web search trends and characteristics Web query transaction logs collected in 1997,1999, 2001, 2003, 2004, 2005 and 2006. Combined dataset of 20 million+ Web searches

3 3 Web Search Studies Web search engines: -Alta Vista -Ask Jeeves -Excite -AlltheWeb -Vivisimo -Dogpile Transaction log analysis studies Focus on user search analysis for competitive advantage

4 4 Web Data Sample QueryUIDCookie Date/ TimeBrowserLocationVertical Organic_ Clicks Sponsored_ Clicks jamie pressly 66.215.238.179 29NIJYMA4 TB385Y 2006-05- 15 00:00:00msie6.0usaImages01 maytag parts 206.192.19 7.53 2UGJ23KA4 T2TCMV 2006-05- 15 00:00:00msie6.0usaWeb10 free gay porno videos 65.23.175. 149 KSKNPKA4 TB22ER 2006-05- 15 00:00:00msie6.0usaWeb01

5 5 Data Collection Methods Various combinations of methods and approaches Transaction log analysis Videotaping and Audio-taping Think aloud protocols Usability – HCI techniques Focus groups Interviews Survey Experiments Diaries

6 6 Data Analysis Methods Quantitative and statistical analysis Qualitative analysis – grounded theory Combination of both methods

7 7 Key Issues – Search Studies What is the goal of the project? – Insights, understanding and develop theory – User modeling – Trends analysis – Interface/systems design – User training

8 8 Key Issues – Search Studies What variables to measure? How much data is enough? Methods used – single or multiple? HCI approach – test interface/system features

9 9 Transaction Log Analysis (TLA) File or log of communications between user and system File recorded on a server – side recordings Log or file formats vary but there are fields common to most (e.g., IP address, cookie, time stamp, query, vertical, click thru)

10 10 Why Collect and Analyze Log Data? Gain understanding of user interaction with system and interface Goal to improve system and interface design, and improve user training. Transaction log analysis is extensively used in academia and industry

11 11 TLA Process Goals and objectives Data collection Log preparation Data analysis Making sense

12 12 Data Collected Process of collecting the interaction data for a given period in a transaction log Collect data on the search episode User identification Date Time Search session content Resources accessed (e.g., URL’s)

13 13 Logging Software Custom and commercial applications (the Wrapper - http://ist.psu.edu/faculty_pages/jjansen/academic/wrapper.htm ) http://ist.psu.edu/faculty_pages/jjansen/academic/wrapper.htm WinWhatWhere spy software Morea 1.1 software Camtasia Studio

14 14 Data Preparation Process of cleaning and preparing the log data for analysis Log data into a relational database Cleaning the log – corrupted data Parsing the log (e.g., removing Web sessions identified as agents) Normalizing the log

15 15 Log Analysis – Three Levels Term Query Session

16 16 Term Level Analysis Term occurrence Total terms High and low usage terms Term distribution Co-occurring terms

17 17 Query Level Analysis Initial query Subsequent queries Modified queries and query reformulation Identical queries Query complexity Boolean use Spelling Types of queries Query topics

18 18 Query Subjects – Alta Vista 2002 & Vivisimo 2004 1.People/Places 49.2% 2.Commerce, etc.12.5% 3.Computers, etc.12.4% 4.Health/sciences7.4% 5.Education/Humanities5% 6.Entertainment, etc.4.5% 7.Sex/Pornography3.2% 8.Society/Culture, etc. 3.1% 9.Government 1.5% 10.Performing/Fine Arts0.6% 1.Commerce, etc.21% 2.Indiscernible 19% 3.People/Places, etc.15% 4.Computers/Internet13% 5.Social/Culture 9% 6.Health/Sciences 6% 7.Education/Humanities 5% 8.Sex/Pornography 4% 9.Performing/Fine Arts 3% 10.Government 3% 11.Entertainment, etc. 2%

19 19 Web Search Session Level Analysis Search duration Search patterns Successive and multitasking sessions Page or resource viewing

20 20 Web Session Duration (Minutes) 56% less than 1 minute 72% sessions less than 5 minutes 81% sessions less than 15 minutes Mean: approx. 58 minutes and 2 seconds (see Jansen, B. J., Spink, A., and Koshman, S. 2007. Web searcher interactions with the Dogpile.com meta-search engine. Journal of the American Society for Information Science and Technology. 58(5), 744-755.)

21 21 Transaction Log Analysis (TLA) Methods Quantitative and statistical analysis – requires software and expertise Qualitative analysis – requires training Creativity factor Combination of quantitative and qualitative methods

22 22 TLA Strengths Data from a large user base Reasonable and non-intrusive Less time than other methods Can be relatively inexpensive

23 23 TLA Limitations Transaction logs do not include user demographic and other data Lacks data on search reasons and motivations Incomplete data due to corrupted logging

24 24 Conclusions Search analysis is a complex process with many choices TLA a powerful tool Requires planning, training and expertise Can be combined with other data collection and analysis techniques

25 25 Further Reading Spink, A., & Jansen, B. J. (2004). Web Search: Public Searching of the Web. Springer. Jansen, B. J. (2006). Search log analysis: What is it; what's been done; how to do it. Library and Information Science Research, 28(3), 407-432 Jansen, B. J., Spink, A., & Taksa, I. (forthcoming). Handbook of Web Log Analysis. Idea Group Publishing.

26 26 QUESTIONS? Thank You


Download ppt "1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University."

Similar presentations


Ads by Google