Marti Hearst SIMS 247 SIMS 247 Lecture 19 Visualizing Text and Text Collections March 31, 1998.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Chapter 5: Introduction to Information Retrieval
To see and not to see, that is the question.. SANS FAIRE ATTENTION.
Introduction to Knowledge Representation Marti Hearst SIMS 202: Information Organization and Retrieval Lecture 6, Sept 10, 1998.
Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.
Marti Hearst School of Information, UC Berkeley Visualization in Text Analysis Problems VAC Consortium Meeting Stanford, May 24, 2006.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
© Anselm Spoerri Lecture 10 Visual Tools for Text Retrieval (cont.)
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
SIMS 296a-3: Current Topics in Information Access Marti Hearst Fall ‘98.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 SIMS 247: Information Visualization and Presentation Marti Hearst Sept
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
SIMS 247 Information Visualization and Presentation Prof. Marti Hearst October 5, 2000.
1 i247: Information Visualization and Presentation Marti Hearst Perceptual Principles.
Search User Interfaces Marti Hearst UC Berkeley. 2 Chapter Structure  1: Design of Search Interfaces  2: Evaluation of Search Interfaces  3: Models.
1 SIMS 247: Information Visualization and Presentation Marti Hearst March 3, 2004.
Designing the Search User Interface Dr. Marti Hearst UC Berkeley Enterprise Search Summit May
Physical Symbol System Hypothesis
1 SIMS 247: Information Visualization and Presentation Marti Hearst Nov 2 and Nov 7, 2005.
SIMS 247 Information Visualization and Presentation Marti Hearst March 15, 2002.
Symbols and Language Lexical Relations SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000.
Information Retrieval
Conceptual modelling. Overview - what is the aim of the article? ”We build conceptual models in our heads to solve problems in our everyday life”… ”By.
ISP 433/633 Week 12 User Interface in IR. Why care about User Interface in IR Human Search using IR depends on –Search in IR and search in human memory.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Modern Information Retrieval Computer engineering department Fall 2005.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 Visual Computing Perceptual Principles. 2 Visual Principles Vision as Knowledge Acquisition Pre-attentive Properties Gestalt Properties Sensory vs.
Information Visualization: Ten Years in Review Xia Lin Drexel University.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Document Collections cs5984: Information Visualization Chris North.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Visualization in Text Information Retrieval Ben Houston Exocortex Technologies Zack Jacobson CAC.
User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Modern Information Retrieval
Visualizing Documents and Search
Proceedings of Infoviz’95
Visualization of Web Search Results in 3D
Multimedia Information Retrieval
Information Retrieval
Visualizing Document Collections
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
cs5984: Information Visualization Chris North
Information Visualization
Presentation transcript:

Marti Hearst SIMS 247 SIMS 247 Lecture 19 Visualizing Text and Text Collections March 31, 1998

Marti Hearst SIMS 247 Today and Next Time Purposes of Text VisualizationPurposes of Text Visualization Why Text is ToughWhy Text is Tough Visualizing Concept SpacesVisualizing Concept Spaces –For Collection Overviews Visualizing Query SpecificationsVisualizing Query Specifications –Selecting Term Subsets –Viewing Metadata Visualizing Retrieval ResultsVisualizing Retrieval Results –Term Hit Distribution –Grouping of Retrieved Documents

Marti Hearst SIMS 247 Why Visualize Text? To help with Information AccessTo help with Information Access –give an overview of a collection –show user what aspects of their interests are present in a collection –help user understand why documents retrieved as a result of a query Text Data MiningText Data Mining –not much has been done in this yet Software EngineeringSoftware Engineering –not techically text, but has some similar properties

Marti Hearst SIMS 247 Why Text is Tough Text is not pre-attentiveText is not pre-attentive Text consists of abstract conceptsText consists of abstract concepts –which are difficult to visualize Text represents similar concepts in many different waysText represents similar concepts in many different ways –space ship, flying saucer, UFO, figment of imagination Text has very high dimensionalityText has very high dimensionality –Tens or hundreds of thousands of features –Many subsets can be combined together

Marti Hearst SIMS 247 Text Meaning is NOT pre-attentive SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC

Marti Hearst SIMS 247 Why Text is Tough Abstract concepts are difficult to visualizeAbstract concepts are difficult to visualize Combinations of abstract concepts are even more difficult to visualizeCombinations of abstract concepts are even more difficult to visualize –time –shades of meaning –social and psychological concepts –causal relationships

Marti Hearst SIMS 247 Why Text is Tough The Dog.

Marti Hearst SIMS 247 Why Text is Tough The Dog. The dog cavorts. The dog cavorted.

Marti Hearst SIMS 247 Why Text is Tough The man. The man walks.

Marti Hearst SIMS 247 Why Text is Tough The man walks the cavorting dog. So far, we can sort of show this in pictures.

Marti Hearst SIMS 247 Why Text is Tough As the man walks the cavorting dog, thoughts arrive unbidden of the previous spring, so unlike this one, in which walking was marching and dogs were baleful sentinals outside unjust halls. How do we visualize this?

Marti Hearst SIMS 247 Why Text is Tough Language only hints at meaningLanguage only hints at meaning Most meaning of text lies within our minds and common understandingMost meaning of text lies within our minds and common understanding –“How much is that doggy in the window?” how much: social system of barter and trade (not the size of the dog) “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own “in the window” implies behind a store window, not really inside a window, requires notion of window shopping

Marti Hearst SIMS 247 Why Text is Tough General categories have no standard ordering (nominal data)General categories have no standard ordering (nominal data) Categorization of documents by single topics misses important distinctionsCategorization of documents by single topics misses important distinctions Consider an article aboutConsider an article about –NAFTA –The effects of NAFTA on truck manufacture –The effects of NAFTA on productivity of truck manufacture in the neighboring cities of El Paso and Juarez

Marti Hearst SIMS 247 Why Text is Tough Other issues about languageOther issues about language –ambiguous (many different meanings for the same words and phrases) –different combinations imply different meanings

Marti Hearst SIMS 247 Why Text is Tough I saw Pathfinder on Mars with a telescope.I saw Pathfinder on Mars with a telescope. Pathfinder photographed Mars.Pathfinder photographed Mars. The Pathfinder photograph mars our perception of a lifeless planet.The Pathfinder photograph mars our perception of a lifeless planet. The Pathfinder photograph from Ford has arrived.The Pathfinder photograph from Ford has arrived. The Pathfinder forded the river without marring its paint job.The Pathfinder forded the river without marring its paint job.

Marti Hearst SIMS 247 Why Text is Easy Text is easier when you have a lot of itText is easier when you have a lot of it –Highly redundant –Because people are good at finding associations, just about any simple algorithm can get “good” results for coarse tasks Pull out “important” phrases Find “meaningfully” related words Create “summary” from document –Major problem: Evaluation People usually search on relatively coarse meaningsPeople usually search on relatively coarse meanings

Marti Hearst SIMS 247 Why Text is Easy Pretty much any simple technique can pull out phrases that seem to characterize a documentPretty much any simple technique can pull out phrases that seem to characterize a document Most frequent words from a lecture last fall:Most frequent words from a lecture last fall: 109 slide 69 to 37 view 37 version 37 graphic 37 first 37 back 36 previous 36 next 32 of 31 the 30 recall 28 relevant 27 precision 25 retrieved 25 documents 21 and 18 evaluate 15 a 13 what 13 vs 13 how 12 trec 12 is 12 high 12 for 10 relevance 10 queries 10 on 9 information 8 x 8 why 8 as 8 answer 7 search 7 maron 7 document 7 blair 6 top 6 results 6 measure 6 length 6 in 6 evaluation 6 curves

Marti Hearst SIMS 247 Why Text is Easy Same text, removing most frequent words in language and most frequent in this text:Same text, removing most frequent words in language and most frequent in this text: 30 recall 28 relevant 27 precision 25 retrieved 25 documents 18 evaluate 13 vs 12 trec 12 high 10 relevance 10 queries 9 information 8 x 8 answer 7 search 7 maron 7 document 7 blair 6 top 6 results 6 measure 6 length 6 evaluation 6 curves These words can act as a simple summary of the documentThese words can act as a simple summary of the document –people are good at inferring the relations –redundancy in the word meanings

Marti Hearst SIMS 247 Text Collection Overviews How can we show an overview of the contents of a text collection?How can we show an overview of the contents of a text collection? –show info external to the docs e.g., date, author, source, number of inlinks does not show what they are about –show the meanings or topics in the docs show a list of titles show results of clustering words or documents organize according to categories –how to show arbitrary subsets?

Marti Hearst SIMS 247 Showing Collection Overviews Showing the DocumentsShowing the Documents –External Metadata e.g., author, date, hyperlink connectivity Does not show what the documents are about –Visualizations of Document Clusters Mapping document clusters into nearby points Networks with Force-Directed Placement Kohonen Feature Maps –Zoomable “Landscapes”

Marti Hearst SIMS 247 Showing Collection Overviews Distinguish betweenDistinguish between –showing the documents –showing the words/concepts Distinguish betweenDistinguish between –a general overview –a query-centered view

Marti Hearst SIMS 247 Clustering for Collection Overviews Two main stepsTwo main steps –cluster the documents according to the words they have in common –map the cluster representation onto a (interactive) 2D or 3D representation Since text has tens of thousands of featuresSince text has tens of thousands of features –the mapping to 2D loses a tremendous amount of information –only very coarse themes are detected

Marti Hearst SIMS 247 Clustering for Collection Overviews –Scatter/Gather show main themes as groups of text summaries –Scatter Plots show docs as points; closeness indicates nearness in cluster space show main themes of docs as visual clumps or mountains –Kohonen Feature maps show main themes as adjacent polygons –BEAD show main themes as links within a force- directed placement network

Marti Hearst SIMS 247 Scatter/Gather

Marti Hearst SIMS 247 Scatter Plot of Clusters (Chen et al. 97)

Marti Hearst SIMS 247 BEAD (Chalmers 97)

Marti Hearst SIMS 247 BEAD (Chalmers 96)

Marti Hearst SIMS 247 Example: Themescapes (Wise et al. 95)

Marti Hearst SIMS 247 Kohonen Feature Maps (Lin 92, Chen et al. 97) (594 docs)

Marti Hearst SIMS 247 Galaxy of News Rennison 95

Marti Hearst SIMS 247 Visualizing Concept Overviews Huge 2D maps may be inappropriate focus for information retrievalHuge 2D maps may be inappropriate focus for information retrieval –cannot see what the documents are about –documents are forced into one position in semantic space –space is difficult to browse for IR purposes Perhaps more suited for pattern discoveryPerhaps more suited for pattern discovery –problem: often only one view on the space

Marti Hearst SIMS 247 How Useful are Graphical Clusters? A study (Kleiboemer et al. 96) comparedA study (Kleiboemer et al. 96) compared –a system with 2D graphical clusters –a system with 3D graphical clusters –a system that shows textual clusters Novice usersNovice users Only textual clusters were helpful (and they were difficult to use well)Only textual clusters were helpful (and they were difficult to use well)

Marti Hearst SIMS 247 Next Time Visualizing Query Term SpecificationVisualizing Query Term Specification –available words –available metadata Visualizing Retrieval ResultsVisualizing Retrieval Results