1 SIMS 247: Information Visualization and Presentation Marti Hearst Nov 2 and Nov 7, 2005.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

10/4/01 IS202: Information Organization & Retrieval Interfaces for Information Retrieval Ray Larson & Warren Sack IS202: Information Organization and Retrieval.
Jane Reid, AMSc IRIC, QMUL, 13/11/01 1 IR interfaces Purpose: to support users in information-seeking tasks Issues: –Functionality –Usability Motivations.
Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Spatial Hypermedia and Augmented Reality
Interfaces for Retrieval Results. Information Retrieval Activities Selecting a collection –Talked about last class –Lists, overviews, wizards, automatic.
© Anselm Spoerri Lecture 10 Visual Tools for Text Retrieval (cont.)
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
SIMS 247 Information Visualization and Presentation Prof. Marti Hearst October 5, 2000.
Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998.
1 SIMS 247: Information Visualization and Presentation Marti Hearst March 3, 2004.
WMES3103: INFORMATION RETRIEVAL WEEK 10 : USER INTERFACES AND VISUALIZATION.
1 i247: Information Visualization and Presentation Marti Hearst April 7, 2008.
Interfaces for Querying Collections. Information Retrieval Activities Selecting a collection –Lists, overviews, wizards, automatic selection Submitting.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
1 When/How/Why to use Grouping/Categorizing/Clustering in Search Interfaces Marti Hearst January 21, 2005.
Marti Hearst SIMS 247 SIMS 247 Lecture 19 Visualizing Text and Text Collections March 31, 1998.
ISP 433/633 Week 12 User Interface in IR. Why care about User Interface in IR Human Search using IR depends on –Search in IR and search in human memory.
Overview of Search Engines
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
1 Adapting the TileBar Interface for Visualizing Resource Usage Session 602 Adapting the TileBar Interface for Visualizing Resource Usage Session 602 Larry.
AuthorLink: Instant Author Co-Citation Mapping for Online Searching Xia Lin Howard D. White Jan Buzydlowski Drexel University Philadelphia,
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Fall 2002CS/PSY Information Visualization Picture worth 1000 words... Agenda Information Visualization overview  Definition  Principles  Examples.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Information Visualization: Ten Years in Review Xia Lin Drexel University.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Planning an Applied Research Project Chapter 3 – Conducting a Literature Review © 2014 by John Wiley & Sons, Inc. All rights reserved.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Document Collections cs5984: Information Visualization Chris North.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
Copyright © 2005, Pearson Education, Inc. Slides from resources for: Designing the User Interface 4th Edition by Ben Shneiderman & Catherine Plaisant Slides.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)
User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.
What Happens After the Search? User Interface Ideas for Information Retrieval Results Marti A. Hearst Xerox PARC.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Visualizing Documents and Search
Professor John Canny Spring 2003
Text Visualization Lecture 11
SIMS 202 Information Organization and Retrieval
Visualization of Web Search Results in 3D
Text & Web Mining 9/22/2018.
Information Visualization Picture worth 1000 words...
Visualizing Document Collections
Document Clustering Matt Hughes.
Introduction to Information Retrieval
CHAPTER 7: Information Visualization
Text Categorization Berlin Chen 2003 Reference:
Information Visualization
Presentation transcript:

1 SIMS 247: Information Visualization and Presentation Marti Hearst Nov 2 and Nov 7, 2005

2 Outline Why Text is Tough Single-document Visualization Visualizing Concept Spaces –Clusters –Category Hierarchies Visualizing Query Specifications Visualizing Retrieval Results Usability Study Meta-Analysis

3 Why Visualize Text? To help with Information Retrieval –give an overview of a collection –show user what aspects of their interests are present in a collection –help user understand why documents retrieved as a result of a query Text Data Mining –Mainly clustering & nodes-and-links Software Engineering –not really text, but has some similar properties

4 Why Text is Tough Text is not pre-attentive Text consists of abstract concepts –which are difficult to visualize Text represents similar concepts in many different ways –space ship, flying saucer, UFO, figment of imagination Text has very high dimensionality –Tens or hundreds of thousands of features –Many subsets can be combined together

5 Why Text is Tough The Dog.

6 Why Text is Tough The Dog. The dog cavorts. The dog cavorted.

7 Why Text is Tough The man. The man walks.

8 Why Text is Tough The man walks the cavorting dog. So far, we can sort of show this in pictures.

9 Why Text is Tough As the man walks the cavorting dog, thoughts arrive unbidden of the previous spring, so unlike this one, in which walking was marching and dogs were baleful sentinals outside unjust halls. How do we visualize this?

10 Why Text is Tough Abstract concepts are difficult to visualize Combinations of abstract concepts are even more difficult to visualize –time –shades of meaning –social and psychological concepts –causal relationships

11 Why Text is Tough Language only hints at meaning Most meaning of text lies within our minds and common understanding –“How much is that doggy in the window?” how much: social system of barter and trade (not the size of the dog) “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own “in the window” implies behind a store window, not really inside a window, requires notion of window shopping

12 Why Text is Tough General categories have no standard ordering (nominal data) Categorization of documents by single topics misses important distinctions Consider an article about –NAFTA –The effects of NAFTA on truck manufacture –The effects of NAFTA on productivity of truck manufacture in the neighboring cities of El Paso and Juarez

13 Why Text is Tough Other issues about language –ambiguous (many different meanings for the same words and phrases) –different combinations imply different meanings

14 Why Text is Tough I saw Pathfinder on Mars with a telescope. Pathfinder photographed Mars. The Pathfinder photograph mars our perception of a lifeless planet. The Pathfinder photograph from Ford has arrived. The Pathfinder forded the river without marring its paint job.

15 Why Text is Easy Text is highly redundant –When you have lots of it –Pretty much any simple technique can pull out phrases that seem to characterize a document Instant summary: –Extract the most frequent words from a text –Remove the most common English words

16 Guess the Text 478 said 233 god 201 father 187 land 181 jacob 160 son 157 joseph 134 abraham 121 earth 119 man 118 behold 113 years 104 wife 101 name 94 pharaoh

17 Visualizing Individual Documents Early approach: SuperBook Showing term occurences: TextArc

18 Superbook (

19 TextArc (

20 SeeSoft: Showing Text Content using a linear representation and brushing and linking (Eick & Wills 95)

21 Virtual Shakespeare (Small ‘96)

22 Text Collection Overviews How can we show an overview of the contents of a text collection? –Show info external to the docs e.g., date, author, source, number of inlinks does not show what they are about –Show the meanings or topics in the docs a list of titles results of clustering words or documents organize according to categories (next time)

23 The Need to Group Interviews with lay users often reveal a desire for better organization of retrieval results Useful for suggesting where to look next –People prefer links over generating search terms –But only when the links are for what they want Three main approaches for text and images: –Group items according to pre-defined categories –Group items into automatically-created clusters –Group items according to common keywords Ojakaar and Spool, Users Continue After Category Links, UIETips Newsletter, http://world.std.com/~uieweb/Articles/

24 Categories Human-created –But often automatically assigned to items Arranged in hierarchy, network, or facets –Can assign multiple categories to items –Or place items within categories Usually restricted to a fixed set –So help reduce the space of concepts Intended to be readily understandable –To those who know the underlying domain –Provide a novice with a conceptual structure There are many already made up! However, until recently, their use in interfaces has been –Under-investigated –Not met their promise

25 Clustering “The art of finding groups in data” –Kaufman and Rousseeuw Groups are formed according to associations and commonalities among the data’s features. –There are dozens of algorithms, more all the time –Most need a way of determining similarity or difference between a pair of items –In text clustering, documents usually represented as a vector of weighted features which are some transformation on the words –Similarity between documents is a weighted measure of feature overlap

26 Clustering Potential benefits: –Find the main themes in a set of documents Potentially useful if the user wants a summary of the main themes in the subcollection Potentially harmful if the user is interested in less dominant themes –More flexible than pre-defined categories There may be important themes that have not been anticipated –Disambiguate ambiguous terms ACL –Clustering retrieved documents tends to group those relevant to a complex query together Hearst, Pedersen, Revisiting the Cluster Hypothesis, SIGIR’96

27 Scatter/Gather Clustering Developed at PARC in the late 80’s/early 90’s Top-down approach –Start with k seeds (documents) to represent k clusters –Each document assigned to the cluster with the most similar seeds To choose the seeds: –Cluster in a bottom-up manner –Hierarchical agglomerative clustering Start with n documents, compare all by pairwise similarity, combine the two most similar documents to make a cluster Now compare both clusters and individual documents to find the most similar pair to combine Continue until k clusters remain Use the centroid of each of these as seeds –Centroid: average of the weighted vectors Can recluster a cluster to produce a hierarchy of clusters Pedersen, Cutting, Karger, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, SIGIR 1992

28 Scatter/Gather

29 Northern Light Web Search: Started out with clustering. Then integrated with categories. Then did not do web search and used only categories.

30

31

32 Visualizing Clustering Results Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. User dimension reduction and then project these onto a 2D/3D graphical representation

33 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

34 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

35 Kohonen Feature Maps on Text (from Chen et al., JASIS 49(7))

36 Is it useful? 4 Clustering Visualization Usability Studies

37 Clustering for Search Study 1 This study compared –a system with 2D graphical clusters –a system with 3D graphical clusters –a system that shows textual clusters Novice users Only textual clusters were helpful (and they were difficult to use well) Kleiboemer, Lazear, and Pedersen. Tailoring a retrieval system for naive users. SDAIR’96

38 Clustering Study 2: Kohonen Feature Maps Comparison: Kohonen Map and Yahoo Task: –“Window shop” for interesting home page –Repeat with other interface Results: –Starting with map could repeat in Yahoo (8/11) –Starting with Yahoo unable to repeat in map (2/14) Chen, Houston, Sewell, Schatz, Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. JASIS 49(7): (1998)

39 Kohonen Feature Maps (Lin 92, Chen et al. 97)

40 Study 2 (cont.) Participants liked: –Correspondence of region size to # documents –Overview (but also wanted zoom) –Ease of jumping from one topic to another –Multiple routes to topics –Use of category and subcategory labels Chen, Houston, Sewell, Schatz, Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. JASIS 49(7): (1998)

41 Study 2 (cont.) Participants wanted: –hierarchical organization –other ordering of concepts (alphabetical) –integration of browsing and search –correspondence of color to meaning –more meaningful labels –labels at same level of abstraction –fit more labels in the given space –combined keyword and category search –multiple category assignment (sports+entertain) (These can all be addressed with faceted hierarchical categories) Chen, Houston, Sewell, Schatz, Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. JASIS 49(7): (1998)

42 Clustering Study 3: NIRVE Each rectangle is a cluster. Larger clusters closer to the “pole”. Similar clusters near one another. Opening a cluster causes a projection that shows the titles.

43 Study 3 This study compared : –3D graphical clusters –2D graphical clusters –textual clusters 15 participants, between-subject design Tasks –Locate a particular document –Locate and mark a particular document –Locate a previously marked document –Locate all clusters that discuss some topic –List more frequently represented topics Visualization of search results: a comparative evaluation of text, 2D, and 3D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller, SIGIR ‘99.

44 Study 3 Results (time to locate targets) –Text clusters fastest –2D next –3D last –With practice (6 sessions) 2D neared text results; 3D still slower –Computer experts were just as fast with 3D Certain tasks equally fast with 2D & text –Find particular cluster –Find an already-marked document But anything involving text (e.g., find title) much faster with text. –Spatial location rotated, so users lost context Helpful viz features –Color coding (helped text too) –Relative vertical locations Visualization of search results: a comparative evaluation of text, 2D, and 3D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller, SIGIR ‘99.

45 Clustering Study 4 Compared several factors Findings: –Topic effects dominate (this is a common finding) –Strong difference in results based on spatial ability –No difference between librarians and other people –No evidence of usefulness for the cluster visualization Aspect windows, 3-D visualizations, and indirect comparisons of information retrieval systems, Swan, &Allan, SIGIR 1998.

46 Summary: Visualizing for Search Using Clusters Huge 2D maps may be inappropriate focus for information retrieval –cannot see what the documents are about –space is difficult to browse for IR purposes –(tough to visualize abstract concepts) Perhaps more suited for pattern discovery and gist-like overviews

47 Category Combinations Let’s show categories instead of clusters

48 DynaCat (Pratt, Hearst, & Fagan 99)

49 DynaCat (Pratt 97) Decide on important question types in an advance –What are the adverse effects of drug D? –What is the prognosis for treatment T? Make use of MeSH categories Retain only those types of categories known to be useful for this type of query.

50 DynaCat Study Design –Three queries –24 cancer patients –Compared three interfaces ranked list, clusters, categories Results –Participants strongly preferred categories –Participants found more answers using categories –Participants took same amount of time with all three interfaces

51 MultiTrees (Furnas & Zacks ’94)

52 Cat-a-Cone: Multiple Simultaneous Categories Key Ideas: –Separate documents from category labels –Show both simultaneously Link the two for iterative feedback Distinguish between: –Searching for Documents vs. –Searching for Categories

Cat-a-Cone Interface

54 Cat-a-Cone Catacomb: (definition 2b, online Websters) “A complex set of interrelated things” Makes use of earlier PARC work on 3D+animation: Rooms Henderson and Card 86 IV: Cone Tree Robertson, Card, Mackinlay 93 Web Book Card, Robertson, York 96

55 Collection Retrieved Documents search Category Hierarchy browse query terms

56 ConeTree for Category Labels Browse/explore category hierarchy –by search on label names –by growing/shrinking subtrees –by spinning subtrees Affordances –learn meaning via ancestors, siblings –disambiguate meanings –all cats simultaneously viewable

57 Virtual Book for Result Sets –Categories on Page (Retrieved Document) linked to Categories in Tree –Flipping through Book Pages causes some Subtrees to Expand and Contract –Most Subtrees remain unchanged –Book can be Stored for later Re-Use

58 Improvements over Standard Category Interfaces Integrate category selection with viewing of categories Integrate category selection with viewing of categories Show all categories + context Show all categories + context Show relationship of retrieved documents to the category structure Show relationship of retrieved documents to the category structure But … do users understand and like the 3D? But … do users understand and like the 3D?

59 The FLAMENCO Project Basic idea similar to Cat-a-Cone But use familiar HTML interaction to achieve similar goals Usability results are very strong for users who care about the collection.

60 Co-Citation Analysis Has been around since the 50’s. (Small, Garfield, White & McCain) Used to identify core sets of –authors, journals, articles for particular fields –Not for general search Main Idea: –Find pairs of papers that cite third papers –Look for commonalitieis A nice demonstration by Eugene Garfield at: –

61 Co-citation analysis (From Garfield 98)

62 Co-citation analysis (From Garfield 98)

63 Co-citation analysis (From Garfield 98)

64 Query Specification

65 Command-Based Query Specification command attribute value connector … –find pa shneiderman and tw user# What are the attribute names? What are the command names? What are allowable values?

66 Form-Based Query Specification (Altavista)

67 Form-Based Query Specification (Melvyl)

68 Form-based Query Specification (Infoseek)

69 Direct Manipulation Spec. VQUERY (Jones 98)

70 Menu-based Query Specification (Young & Shneiderman 93)

71 Context

72 Putting Results in Context Visualizations of Query Term Distribution –KWIC, TileBars, SeeSoft Visualizing Shared Subsets of Query Terms –InfoCrystal, VIBE, Lattice Views Table of Contents as Context –Superbook, Cha-Cha, DynaCat Organizing Results with Tables –Envision, SenseMaker Using Hyperlinks –WebCutter

73 Putting Results in Context Interfaces should –give hints about the roles terms play in the collection –give hints about what will happen if various terms are combined –show explicitly why documents are retrieved in response to the query –summarize compactly the subset of interest

74 KWIC (Keyword in Context) An old standard, ignored until recently by internet search engines –used in some intranet engines, e.g., Cha-Cha

75 Highlighting Keywords in Context

76

77 Superbook (Remde et al. 89) Hyper-media software manual Functions: –Word Lookup: –Table of Contents: Dynamic fisheye view of the hierarchical topics list –Page of Text: show selected page and highlighted search terms Hypertext features linking through search words rather than page links

78 Display of Retrieval Results Goal: minimize time/effort for deciding which documents to examine in detail Idea: show the roles of the query terms in the retrieved documents, making use of document structure

79 TileBars vGraphical Representation of Term Distribution and Overlap vSimultaneously Indicate: –relative document length –query term frequencies –query term distributions –query term overlap

80

81

82 Exploiting Visual Properties Variation in gray scale saturation imposes a universal, perceptual order (Bertin et al. ‘83) Varying shades of gray show varying quantities better than color (Tufte ‘83) Differences in shading should align with the values being presented (Kosslyn et al. ‘83)

83 Key Aspect: Faceted Queries Conjunct of disjuncts Each disjunct is a concept –osteoporosis, bone loss –prevention, cure –research, Mayo clinic, study User does not have to specify which are main topics, which are subtopics Ranking algorithm gives higher weight to overlap of topics –This kind of query works better at high-precision queries than similarity search (Hearst 95)

84 TileBars Summary vPreliminary User Studies vusers understand them vfind them helpful in some situations, but probably slower than just reading titles vsometimes terms need to be disambiguated

85 More Recent Attempts Analyzing retrieval results –KartOO –Grokker

86

87

88

89

90 Query Term Subsets Show which subsets of query terms occur in which subsets of documents occurs in which subsets of retrieved documents

91 Term Occurrences in Results Sets Show how often each query term occurs in retrieved documents –VIBE (Korfhage ‘91) –InfoCrystal (Spoerri ‘94) –Problems: can’t see overlap of terms within docs quantities not represented graphically more than 4 terms hard to handle no help in selecting terms to begin with

92 InfoCrystal (Spoerri 94)

93 VIBE (Olson et al. 93, Korfhage 93)

94 Term Occurrences in Results Sets –Problems: can’t see overlap of terms within docs quantities not represented graphically more than 4 terms hard to handle no help in selecting terms to begin with

95 DLITE (Cousins 97) Supporting the Information Seeking Process –UI to a digital library Direct manipulation interface Workcenter approach –experts create workcenters –lots of tools for one task –contents persistent

96 Slide by Shankar Raman DLITE (Cousins 97) Drag and Drop interface Reify queries, sources, retrieval results Animation to keep track of activity

97 IR Infovis Meta-Analysis (Chen & Yu ’00) Goal –Find invariant underlying relations suggested collectively by empirical findings from many different studies Procedure –Examine the literature of empirical infoviz studies 35 studies between 1991 and focused on information retrieval tasks But due to wide differences in the conduct of the studies and the reporting of statistics, could use only 6 studies

98 IR Infovis Meta-Analysis (Chen & Yu ’00) Conclusions: –IR Infoviz studies not reported in a standard format –Individual cognitive differences had the largest effect Especially on accuracy Somewhat on efficiency –Holding cognitive abilities constant, users did better with simpler visual-spatial interfaces –The combined effect of visualization is not statistically significant –Misc Tilebars and Scatter/Gather are well-known enough to not require citations!!

99 Summary: Search and Doc Viz Visualization still has yet to prove its usefulness for search and documents Needs to integrate with more accurate dialogue systems