Human versus Machine in the Topic Distillation Task Mingfang Wu Alistair McLean Ross Wilkinson Gheorghe Muresan Muh-Chyun Tang Yuelin Li, Hyuk-Jin Lee.

Slides:



Advertisements
Similar presentations
The Math Studies Project for Internal Assessment
Advertisements

The Cost of Authoring with a Knowledge Layer Judy Kay and Lichao Li School of Information Technologies The University of Sydney, Australia.
An Empirical Study of the Effect of Coherent and Tailored Document Delivery as an Interface to Organizational Websites Cécile Paris Anne-Marie VercoustreStephen.
User-centered Information System Evaluation Tutorial
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Web Search Results Visualization: Evaluation of Two Semantic Search Engines Kalliopi Kontiza, Antonis Bikakis,
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Search Engines and Information Retrieval
© Tefko Saracevic, Rutgers University1 Search strategy & tactics Governed by effectiveness & feedback.
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
Modern Information Retrieval
Model Personalization (1) : Data Fusion Improve frame and answer (of persistent query) generation through Data Fusion (local fusion on personal and topical.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
Design of metadata surrogates in search result interfaces of learning object repositories: Linear versus clustered metadata design Panos Balatsoukas Anne.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
An Overview of Relevance Feedback, by Priyesh Sudra 1 An Overview of Relevance Feedback PRIYESH SUDRA.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
How to Write a Scientific Paper Hann-Chorng Kuo Department of Urology Buddhist Tzu Chi General Hospital.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Search Engines and Information Retrieval Chapter 1.
PeopleFinder: Searching for People, not just for Documents Technologies for Knowledge Sharing ICT-Centre CSIRO Alistair McLean, Anne-Marie Vercoustre,
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Ways for Improvement of Validity of Qualifications PHARE TVET RO2006/ Training and Advice for Further Development of the TVET.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Evaluating a Research Report
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.
Year 9 Humanities Personal Project Term 2. Contents  The task and outcome The task and outcome  The purpose The purpose  Becoming an effective learner.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
What Does the User Really Want ? Relevance, Precision and Recall.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 Evaluating the User Experience in CAA Environments: What affects User Satisfaction? Gavin Sim Janet C Read Phil Holifield.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
Information Retrieval in Practice
Evaluation Anisio Lacerda.
CS 430: Information Discovery
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
Q4 Measuring Effectiveness
Web Mining Department of Computer Science and Engg.
Presentation transcript:

Human versus Machine in the Topic Distillation Task Mingfang Wu Alistair McLean Ross Wilkinson Gheorghe Muresan Muh-Chyun Tang Yuelin Li, Hyuk-Jin Lee Nicholas J. Belkin

Outline  Introduction to the topic distillation task:  Automatic vs. interactive  Our research questions:  Human vs. machine performance  Does machine help user ?  Linear vs. hierarchical layout  Experimental design  Experimental results  Summary

Topic Distillation Task  Topic distillation:  Given a broad query, finding a list of key resources.  To be a “key resource”, a page should be a good entry point to a website which:  Provides credible information on the topic;  Is principally devoted to the topic (not too broad); and  Is not part of a larger site also principally devoted to the topic (not too narrow).

broad Task Example: Find government information on adoption procedures

right Task Example: Find government information on adoption procedures

too narrow Task Example: Find government information on adoption procedures

Topic Distillation Task Task Example: Find government information on adoption procedures (This website should meet the three conditions.)

Automatic Topic Distillation Task Answer: Administration for Children and Families (USHHS) cpru.usda.gov/ California Courts Self-Help Center Mesa County Dept. of Human Services Texas Department of Health, Central Adoption Task: Number: TD adoption procedures Description: Information regarding how one adopts children?

Automatic Topic Distillation Task Answer: Cotton Pathology Research Unit cpru.usda.gov/ FAS Cotton Group ffas.usda.gov.cots/cotton.html U.S. Cotton Data Sets wizard.arsusda.gov/cotton/ars2.html USDA Cotton Program Safety and Health Topics: Textiles Task: Number: TD7 cotton industry Description: Where can I find information about growing, harvesting cotton and turning it into cloth?

Why an interactive task?  Information tasks are a partnership between people and their information systems;  To understand task effectiveness, it is necessary to study not only system effectiveness, but also the effectiveness of the partnership.  Which parts are best done by people? (e.g. text understanding)  Which parts are best done by system? (e.g. matching a query)  What are the trade-offs in time and efficiency?  How to make the partnership work best?

Interactive topic distillation task  Example topic: Topic: cotton industry Search task: You are to construct a resource list for high school students who are interested in cotton industry, from growing, harvesting cotton and turning it into cloth. Your final resource list should cover all the major aspects of this topic.  Task analysis: To assess whether a web page is a key resource page, a searcher needs to make the following judgments about the page:  Is it relevant? (topical relevance)  Is it too broad or too narrow? (depth)  Does it cover a new aspect? (coverage)

How to measure “performance”? Four criteria for the quality of a resource list:  Accuracy  Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly  Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow  Coverage  Coverage: The set of saved entry points covers all the different aspects of the topic  1=Agree strongly, 3=Neutral, 5=Disagree strongly  Repetition: How much repetition/overlap is there with the set of saved entry points? 1=None, 3 = Some, 5=Way too much

How to measure performance?  More objective measures:  No. of query iterations  No. of document surrogates seen  No. of documents read  No. of documents saved  Actual time used

How to measure performance:  Subjective measures  searchers’ satisfaction with the interaction  searchers’ self-perception of their task completeness  searchers’ preference of an search system/interface

Research Questions  RQ1: Could a human searcher outperform a topic distillation engine?  RQ2: Would a human user perform better with a topic distillation engine than with a generic content-oriented search engine?  RQ3: Could a user’s performance be improved by organizing the search results in a form appropriate for the topic distillation task?

Experimental Design Auto (no interaction) Linear Human Hierarchical Human Best match engine Topic distillation engine Rutgers CSIRO Auto (no interaction) Linear Human Hierarchical Human

Experimental design abcd B B S1System I: B1aSystem II: B2a S2System I: B2aSystem II: B1a S3System II: B1aSystem I: B2a S4System II: B2aSystem I: B1a Topics are divided into blocks The system and topic block combination for the first four searchers

Experimental Procedure Entry questionnaire Tutorial and demo Hands-on practice Pre-search questionnaire Search on the topic After-search questionnaire After-system questionnaire Exit questionnaire For each topic For each system time

Research Question I – Human vs. Machine Hypothesis  There would be no difference in the quality of resource pages between a manually constructed and an automatically generated list Best Match Topic Distillation Auto (no interaction) Linear Human Hierarchical Human Auto (no interaction) Linear Human Hierarchical Human Rutgers CSIRO

Interfaces from Rutgers – Ranked List

Interfaces from CSIRO - Ranked List

Research Question I – Human vs. Machine  Result Best Match (L)Topic Distillation (L) AutoHumanAutoHuman Relevance Depth Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow

Best Match (L)Topic Distillation (L) AutoHumanAutoHuman Relevance Depth p < Research Question I – Human vs. Machine Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow  Result: human + machine >> machine

Best Match (L)Topic Distillation (L) AutoHumanAutoHuman Relevance Depth p <  Result: human + machine >> machine Research Question I – Human vs. Machine Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow

Best Match (L)Topic Distillation (L) AutoHumanAutoHuman Relevance Depth  Result: A system without human interaction, optimized for topic distillation, performs as well as a non-optimized system with human interaction Research Question I – Human vs. Machine Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow

Research Question II – Machine helps human Hypothesis  The performance gain from a topic distillation engine over a generic content-oriented search engine would be transferred to the human performance with each engine. Topic Distillation Auto (no interaction) Linear Human Hierarchical Human Auto (no interaction) Linear Human Hierarchical Human Best Match Rutgers CSIRO

Research Question II – Machine helps human  Result: Topic distillation is better than automatic best match. Best Match (L)Topic Distillation (L) AutoHumanAutoHuman Relevance (1) Depth (3) p < Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow

Research Question II – Machine helps human  Result: Performance improvement at system level transfers to interactive retrieval. Best Match (L)Topic Distillation (L) AutoHumanAutoHuman Relevance (1) Depth (3) p < 0.01 Relevance: The page is relevant to the topic 1=Agree strongly, 3=Neutral, 5=Disagree strongly Depth: Is the page too broad, too narrow, or at the right level of detail for the topic? 1 = Too broad, 3=Right level, 5=Too narrow p < 0.004

Research Question III – Linear vs. Hierarchical Hypothesis  The interface for constructing a resource page would be more useful to the potential user if it reflects the structure of the problem domain, rather than simply presenting a list of unrelated documents. Topic Distillation Auto (no interaction) Linear Human Hierarchical Human Auto (no interaction) Linear Human Hierarchical Human Best Match Rutgers CSIRO

Research Question III – Linear vs. Hierarchical  Methods  Categorise/group search results according to their organisational/business units structure. (CSIRO + Rutgers)  Summarise the information from each site with the title of the top three relevant pages. (CSIRO)

Interfaces from Rutgers – Hierarchical Interface

Interfaces from CSIRO – Site Summary

Interfaces from CSIRO – Sitemap

Hypothesis III – Linear vs. Hierarchical (1)  Result – no significant differences Best MatchTopic Distillation LinearHierarchicalLinearHierarchical Relevance Depth Coverage Repetition

Hypothesis III – Linear vs. Hierarchical (2)  Result – the hierarchic interface is conducive to less interaction Best Match Topic Distillation LHLH Iterations/queries Time (seconds) Number of surrogate seen Number viewed Number selected Number ever saved Number of final saved

Hypothesis III – Linear vs. Hierarchical (3)  Result – no significant difference in subjective self- assessment of task success Best MatchTopic Distillation LHLH Perceived usefulness (7) Perceived focus (7) Perceived coverage (7) Enough time (7) All variables are on a 1..7 Likert scale (least to most)..

Hypothesis III – Linear vs. Hierarchical (4)  Result – consistent and sometimes significant preference for hierarchic interface Best MatchTopic Distillation LHNo Diff LHNo Diff Easy to use Task support Preference

Summary  The automatic topic distillation engine did as well as the human users using a non-tuned search system.  The engagement of the user’s effort has a significant effect on the task performance.  The users performed significantly better with the search engine that is tuned to the topic distillation task.  For the comparison of the hierarchic delivery and the ranked list:  There is no significant difference between the ranked list interfaces and the hierarchical interfaces in terms of the four performance criteria.  The hierarchic interface is conducive to less interaction.  Subjects preferred the hierarchical interfaces.

Conclusions  Tuning the system to the task is a good idea.  The structured display is a good idea.  For the same level of performance users made less effort.  Users preferred a display of the results that is tuned to the task.

Thank you …