Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ryen White Microsoft Research research.microsoft.com/~ryenw/talks/ppt/WhiteIMT542E.ppt.

Similar presentations


Presentation on theme: "Ryen White Microsoft Research research.microsoft.com/~ryenw/talks/ppt/WhiteIMT542E.ppt."— Presentation transcript:

1 Ryen White Microsoft Research ryenw@microsoft.com research.microsoft.com/~ryenw/talks/ppt/WhiteIMT542E.ppt

2 Overview Short, selfish bit about me User evaluation in IR Case study combining two approaches User study Log-based Introduction to Exploratory Search Systems Focus on evaluation Short group activity Wrap-up

3 Me, Me, Me Interested in understanding and supporting peoples’ search behaviors, in particular on the Web Ph.D. in Interactive Information Retrieval from University of Glasgow, Scotland (2001 – 2004) Post-doc at University of Maryland Human-Computer Interaction Lab (2004 – 2006) Instructor for course on Human-Computer Interaction at UMD College of Library and Information Studies Researcher in Text Mining, Search, and Navigation group at Microsoft Research, Redmond (2006 - present)

4 Overview Short, selfish bit about me User evaluation in IR Case study combining two approaches User study Log-based Introduction to Exploratory Search Systems Focus on evaluation Short group activity Wrap-up

5 Search Interfaces There are lots of different search interfaces, for lots of different situations Big question: How do we evaluate these interfaces?

6 Some Approaches Laboratory Experiments Naturalistic Studies Longitudinal Studies Formative (during) and Summative (after) evaluations Traditional usability studies Is an interface usable? Generally not comparative. Case Studies Often designer, not user, driven

7 Research Questions Research questions are questions that you hope that your study will answer (a formal statement of your goal) Hypotheses are specific predictions about relationships among variables Questions should be meaningful, answerable, concise, open-ended, and value-free

8 Research Questions: Example 1 For study of advanced query syntax (e.g., +, -, “”, site:), the research questions were: Is there a relationship between the use of advanced syntax and other characteristics of a search? Is there a relationship between the use of advanced syntax and post-query navigation behaviors? Is there a relationship between the use of advanced syntax and measures of search success?

9 Research Questions: Example 2 For a study of an interface gadget that points users to popular destinations (i.e., pages that many people visit): Are popular destinations preferable and more effective than query refinement suggestions and unaided Web search for: Searches that are well-defined (“known-item” tasks)? Searches that are ill-defined (“exploratory” tasks)? Should popular destinations be taken from the end of query trails or the end of session trails? More on this research question in the case study later!

10 Variables Independent Variable (IV): the “cause”; this is often (but not always) controlled or manipulated by the investigator Dependent Variable (DV): the “effect”; this is what is proposed to change as a result of different values of the independent variable Other variables: Intervening variable: explains link between variables Moderating variable: affects direction/strength IV-to-DV Confounding variable: not controlled for, affects DV

11 Hypotheses Alternative Hypothesis: a statement describing the relationship between two or more variables, e.g., E.g., Search engine users that use advanced query syntax find more relevant Web pages Null Hypothesis: a statement declaring that there is no relationship among variables; you may have heard of “reject the null hypothesis” “failing to reject the null hypothesis” E.g., Search engine users that use advanced query syntax find Web pages that are no more or less relevant than other users

12 Experimental Design Within and/or Between Subjects Within-subjects: All subjects use all systems Between-subjects: Subjects use only one system, different blocks of users use each system Control: System with no modifications (in within-subjects) Group of subjects that do not use experimental system, but instead use a baseline (in between-subjects) Factorial Designs > 1 variable (factor), e.g., system × task type

13 Tasks Task or topic? Task is the activity the user is asked to perform Topic is the subject matter of the task Artificial tasks Subjects given task or even queries; relevance pre- determined Simulated work tasks (Borlund, 2000) Subjects given task; compose queries; determine relevance Natural tasks (Kelly & Belkin, 2004) Subjects construct own tasks as part of real needs

14 System & Task Rotation Rotation & counterbalancing to counteract learning effects Latin Square rotation n × n table filled with n different symbols so that each symbol occurs exactly once in each row and exactly once in each column Factorial rotation all possible combinations Factorial has twice as many subjects Twice as expensive to perform

15 Data Collection Questionnaires Diaries Interviews Focus groups Observation Think-aloud Logging (system, proxy & server, client)

16 Data Analysis: Quantitative Descriptive Statistics Describes the characteristics of a sample of the relationship among variables Presents summary information about the example E.g., mean, correlation coefficient Inferential Statistics Used for hypotheses testing Demonstrate cause/effect relationships E.g., t-value (from t-test), F-value (from ANOVA)

17 Data Analysis: Qualitative Coding – open-questions, transcribed think-aloud, … Classifying or categorizing individual pieces of data Open Coding: codes are suggested by the investigator’s examination and questioning of the data Iterative process Closed Coding: codes are identified before the data is collected Each passage can have more than one code All passages do not have to have a code Code, code, and code some more!

18 Overview Short, selfish bit about me User evaluation in IR Case study combining two approaches User study Log-based Introduction to Exploratory Search Systems Focus on evaluation Short group activity Wrap-up

19 Case Study Leveraging popular destinations to enhance Web search interaction White, R.W., Bilenko, M., Cucerzan, S. (2007). Studying the use of popular destinations to enhance web search interaction. In Proceedings of the 30th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159-166.

20 Motivation Query suggestion is a popular approach to help users better define their information needs Incremental: may be inappropriate for exploratory needs In exploratory searches users rely a lot on browsing Can we use places others go rather than what they say? Query = [hubble telescope] Query suggestions

21 Search Trails: from user logs Initiated with a query to a top-5 search engine Query trails Query  Query Session trails Query  Event: Session timeout Visit homepage Type URL Check Web-based email or logon to online service S1 S3 S4 S3 dpreview.com S2 pmai.org digital cameras S2 Query Trail End canon.com amazon.com S5 howstuffworks.com S6 S5 S8 S6 S9 S1 S10S11 S10 S12 S13 S14 amazon digitalcamera-hq.com digital camera canon S7 S6 canon lenses Session Trail End S2

22 Popular Destinations Pages at which other users end up frequently after submitting the same or similar queries, and then browsing away from initially clicked search results Popular destinations lie at the end of many users’ trails May not be among the top-ranked results May not contain the queried terms May not even be indexed by the search engine

23 Suggesting Destinations Can we exploit a corpus of trails to support Web search?

24 Research Questions RQ1: Are destination suggestions preferable and more effective than query refinement suggestions and unaided Web search for: Searches that are well-defined (“known-item” tasks) Searches that are ill-defined (“exploratory” tasks) RQ2: Should destination suggestions be taken from the end of the query trails or the end of the session trails?

25 User Study Conducted a user study to answer these questions 36 subjects drawn from subject pool within our organization 4 systems 2 task types (“known-item” and “exploratory”) Within-subject experimental design Graeco-Latin square design Subjects attempted 2 known-item and 2 exploratory tasks, one on each system

26 Systems: Unaided Web Search Live Search backend No direct support for query refinement Query = [hubble telescope]

27 Systems: Query Suggestion Suggests queries based on popular extensions for the current query type by the user Query = [hubble telescope]

28 Systems: Destination Suggestion Query Destination (unaided + page support) Suggests pages many users visit before next query Session Destination (unaided + page support) Same as above, but before session end not next query Query = [hubble telescope]

29 Tasks Tasks taken and adapted from TREC Interactive Track and QA communities (e.g., Live QnA, Yahoo! Answers) Six of each task type, subject chose without replacement Two task types: known-item and exploratory Known-item: Identify three tropical storms (hurricanes and typhoons) that have caused property damage and/or loss of life. Exploratory task: You are considering purchasing a Voice Over Internet Protocol (VoIP) telephone. You want to learn more about VoIP technology and providers that offer the service, and select the provider and telephone that best suits you.

30 Methodology Subjects: Chose two known-item and two exploratory tasks from six Completed demographic and experience questionnaire For each of four interfaces, subjects were: Given an explanation of interface functionality (2 min.) Attempt the task on the assigned system (10 min.) Asked to complete a post-search questionnaire after each task After using four systems, subjects answered exit questionnaire

31 Findings: System Ranking Subjects asked to rank the systems in preference order Subjects preferred QuerySuggestion and QueryDestination Differences not statistically significant Overall ranking merges performance on different types of search task to produce one ranking SystemsBaselineQuerySuggest.QueryDest.SessionDest. Ranking2.472.141.922.31 Relative ranking of systems (lower = better).

32 Findings: Subject Comments Responses to open-ended questions Baseline: + familiarity of the system (e.g., “was familiar and I didn’t end up using suggestions” (S36)) − lack of support for query formulation (“Can be difficult if you don’t pick good search terms” (S20)) − difficulty locating relevant documents (e.g., “Difficult to find what I was looking for” (S13))

33 Findings: Subject Comments Query Suggestion: + rapid support for query formulation (e.g., “was useful in saving typing and coming up with new ideas for query expansion” (S12); “helps me better phrase the search term” (S24); “made my next query easier” (S21)) − suggestion quality (e.g., “Not relevant” (S11); “Popular queries weren’t what I was looking for” (S18)) − quality of results they led to (e.g., “Results (after clicking on suggestions) were of low quality” (S35); “Ultimately unhelpful” (S1))

34 Findings: Subject Comments QueryDestination: + support for accessing new information sources (e.g., “provided potentially helpful and new areas / domains to look at” (S27)) + bypassing the need to browse to these pages (“Useful to try to ‘cut to the chase’ and go where others may have found answers to the topic” (S3)) − lack of specificity in the suggested domains (“Should just link to site-specific query, not site itself” (S16); “Sites were not very specific” (S24); “Too general/vague” (S28)) − quality of the suggestions (“Not relevant” (S11); “Irrelevant” (S6))

35 Findings: Subject Comments SessionDestination: + utility of the suggested domains (“suggestions make an awful lot of sense in providing search assistance, and seemed to help very nicely” (S5)) − irrelevance of the suggestions (e.g., “did not seem reliable, not much help” (S30); “irrelevant, not my style” (S21)) − need to include explanations about why the suggestions were offered (e.g., “low-quality results, not enough information presented” (S35))

36 Findings: Task Completion Subjects felt that they were more successful for known- item searches on QuerySuggestion and more successful for exploratory searches in QueryDestination Task-type System BaselineQSuggestionQDestinationSDestination Known-item2.01.31.4 Exploratory2.82.31.42.6 Perceptions of task success (lower = better, scale = 1-5 )

37 Findings: Task Completion Time QuerySuggestion and QueryDestination sped up known- item performance Exploratory tasks took longer Known-item Exploratory 0 100 200 300 400 500 600 Task categories Baseline QSuggest Time (seconds) Systems 348.8 513.7 272.3 467.8 232.3 474.2 359.8 472.2 QDestination SDestination

38 Findings: Interaction Known-item tasks subjects used query suggestion most heavily Exploratory tasks subjects benefited most from destination suggestions Subjects submitted fewer queries and clicked fewer search results on QueryDestination Task-type System QSuggestionQDestinationSDestination Known-item35.733.523.4 Exploratory30.035.225.3 Suggestion uptake (values are percentages).

39 Log Analysis These findings are all from the laboratory Logs from consenting users of the Windows Live Toolbar allowed us to determine the external validity of our experimental findings Do the behaviors observed in the study mimic those of real users in the “wild”? Extracted search sessions from the logs that started with the same initial queries as our user study subjects

40 Log Analysis: Search Trails Initiated with a query to a top-5 search engine Query trails Query  Query Session trails Query  Event: Session timeout Visit homepage Type URL Check Web-based email or logon to online service S1 S3 S4 S3 dpreview.com S2 pmai.org digital cameras S2 Query Trail End canon.com amazon.com S5 howstuffworks.com S6 S5 S8 S6 S9 S1 S10S11 S10 S12 S13 S14 amazon digitalcamera-hq.com digital camera canon S7 S6 canon lenses Session Trail End S2

41 Log Analysis: Trails We extracted 2,038 trails from the logs that began with the same query as a user study session 700 from known-item and 1,338 from exploratory tasks In vitro group: User study subjects Ex vitro group: Remote subjects Compared: # query iterations, # unique query terms, # result clicks, and # of unique domains visited

42 Log Analysis: Results Generally same, apart from in the number of unique query terms submitted Subjects may be taking terms from the textual task descriptions provided to them Feature Known-itemExploratory In vitro Ex vitro In vitro Ex vitro 10 minAll10 minAll Query iterations 1.92.32.63.13.03.8 Unique query terms 5.22.83.27.44.44.9 Result clicks 2.61.82.53.32.83.1 Unique domains 1.31.41.72.11.82.1 These numbers are high!

43 Log Analysis: Results Known-item tasks 72% overlap between queries issued and terms appearing in the task description Exploratory tasks 79% overlap between queries issued and terms appearing in the task description Could confound experiment if we are interested in query formulation behavior – need to address!

44 Conclusions User study compared the popular destinations with traditional query refinement and unaided Web search Results revealed that: RQ1a: Query suggestion preferred for known-item tasks RQ1b: Destination suggestion preferred for exploratory tasks RQ2: Destinations from query trails rather than session trails Differences in number of unique query terms suggests that textual task descriptions may introduce some degree of experimental bias

45 Case Study What did we learn? Showed how a user evaluation can be conducted Showed how analysis of different sources – questionnaire responses and interaction logs (both local and remote) – can be combined to answer our research questions Showed that the findings of a user study can be generalized in some respects to the “real” world (i.e., has some external validity) Anything else?

46 Overview Short, selfish bit about me User evaluation in IR Case study combining two approaches User study Log-based Introduction to Exploratory Search Systems Focus on evaluation Short group activity Wrap-up

47 Exploratory Search “Exploratory search” describes: an information-seeking problem context that is open- ended, persistent, and multi-faceted commonly used in scientific discovery, learning, and decision making contexts information-seeking processes that are opportunistic, iterative, and multi-tactical exploratory tactics are used in all manner of information seeking and reflect seeker preferences and experience as much as the goal User’s search problem User’s search strategies

48 Marchionini’s definition:

49 Exploratory Search Systems Support both querying and browsing activities Search engines generally just support querying Help users explore complex information spaces Help users learn about new topics: go beyond finding Can consider user context E.g., Task constraints, user emotion, changing needs

50 Overview Short, selfish bit about me User evaluation in IR Case study combining two approaches User study Log-based Introduction to Exploratory Search Systems Focus on evaluation Short group activity Wrap-up

51 Group Activity Divide into two groups of 3-4 people Each group designs an evaluation of an exploratory search system Two systems: mSpace: faceted spatial browser for classical music PhotoMesa: photo browser with flexible filtering, grouping, and zooming tools You pick the evaluation criteria, comparator systems, approach, metrics, etc.

52 mSpace (mspace.fm)

53 PhotoMesa (photomesa.com)

54 Some questions to think about What are the independent/dependent variables? Which experimental design? What task types? What tasks? What topics? Any comparator systems? What subjects? How many? How will you recruit? Which instruments? (e.g., questionnaires) Which data analysis methods (qualitative/quantitative)? Most importantly: Which metrics? How do you determine user and system performance?

55 Overview Short, selfish bit about me User evaluation in IR Case study combining two approaches User study Log-based Introduction to Exploratory Search Systems Focus on evaluation Short group activity Wrap-up

56 Evaluating Exploratory Search SIGIR 2006 workshop on Evaluating Exploratory Search Systems Brought together around 40 experts to discuss issues in the evaluation of exploratory search systems http://research.microsoft.com/~ryenw/eess What metrics did they come up with? How do they compare to yours?

57 Metrics from workshop Engagement and enjoyment: e.g., task focus, happiness with system responses, the number of actionable events (e.g., purchases, forms filled) Information novelty: e.g., the amount of new information encountered Task success: e.g., reach target document? encountered sufficient information en route? Task time: to assess efficiency Learning and cognition: e.g., cognitive loads, attainment of learning outcomes, richness/completeness of post-exploration perspective, amount of topic space covered, number of insights

58 Activity Wrap-up [insert summary of comments from group activity]

59 Conclusion We have: Described aspects of user experimentation in IR Walked through a case study Introduced exploratory search Planned evaluation of exploratory search systems Related our proposed metrics to those of others interested in evaluating exploratory search systems

60 Acknowledgements Although modified, a few of the earlier slides in this lecture were based on an excellent SIGIR 2006 tutorial given by Diane Kelly and David Harper – Thank you Diane and David!

61 Referenced Reading Borlund, P. (2000). Experimental components for the evaluation of interaction information retrieval systems. Journal of Documentation, 56(1): 71-90. Kelly, D. and Belkin, N.J. (2004). Display time as implicit feedback: Understanding task effects. Proceedings of the 29th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 377-384.


Download ppt "Ryen White Microsoft Research research.microsoft.com/~ryenw/talks/ppt/WhiteIMT542E.ppt."

Similar presentations


Ads by Google