Marti Hearst SIMS 247 SIMS 247 Lecture 19 Visualizing Text and Text Collections March 31, 1998
Marti Hearst SIMS 247 Today and Next Time Purposes of Text VisualizationPurposes of Text Visualization Why Text is ToughWhy Text is Tough Visualizing Concept SpacesVisualizing Concept Spaces –For Collection Overviews Visualizing Query SpecificationsVisualizing Query Specifications –Selecting Term Subsets –Viewing Metadata Visualizing Retrieval ResultsVisualizing Retrieval Results –Term Hit Distribution –Grouping of Retrieved Documents
Marti Hearst SIMS 247 Why Visualize Text? To help with Information AccessTo help with Information Access –give an overview of a collection –show user what aspects of their interests are present in a collection –help user understand why documents retrieved as a result of a query Text Data MiningText Data Mining –not much has been done in this yet Software EngineeringSoftware Engineering –not techically text, but has some similar properties
Marti Hearst SIMS 247 Why Text is Tough Text is not pre-attentiveText is not pre-attentive Text consists of abstract conceptsText consists of abstract concepts –which are difficult to visualize Text represents similar concepts in many different waysText represents similar concepts in many different ways –space ship, flying saucer, UFO, figment of imagination Text has very high dimensionalityText has very high dimensionality –Tens or hundreds of thousands of features –Many subsets can be combined together
Marti Hearst SIMS 247 Text Meaning is NOT pre-attentive SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC
Marti Hearst SIMS 247 Why Text is Tough Abstract concepts are difficult to visualizeAbstract concepts are difficult to visualize Combinations of abstract concepts are even more difficult to visualizeCombinations of abstract concepts are even more difficult to visualize –time –shades of meaning –social and psychological concepts –causal relationships
Marti Hearst SIMS 247 Why Text is Tough The Dog.
Marti Hearst SIMS 247 Why Text is Tough The Dog. The dog cavorts. The dog cavorted.
Marti Hearst SIMS 247 Why Text is Tough The man. The man walks.
Marti Hearst SIMS 247 Why Text is Tough The man walks the cavorting dog. So far, we can sort of show this in pictures.
Marti Hearst SIMS 247 Why Text is Tough As the man walks the cavorting dog, thoughts arrive unbidden of the previous spring, so unlike this one, in which walking was marching and dogs were baleful sentinals outside unjust halls. How do we visualize this?
Marti Hearst SIMS 247 Why Text is Tough Language only hints at meaningLanguage only hints at meaning Most meaning of text lies within our minds and common understandingMost meaning of text lies within our minds and common understanding –“How much is that doggy in the window?” how much: social system of barter and trade (not the size of the dog) “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own “in the window” implies behind a store window, not really inside a window, requires notion of window shopping
Marti Hearst SIMS 247 Why Text is Tough General categories have no standard ordering (nominal data)General categories have no standard ordering (nominal data) Categorization of documents by single topics misses important distinctionsCategorization of documents by single topics misses important distinctions Consider an article aboutConsider an article about –NAFTA –The effects of NAFTA on truck manufacture –The effects of NAFTA on productivity of truck manufacture in the neighboring cities of El Paso and Juarez
Marti Hearst SIMS 247 Why Text is Tough Other issues about languageOther issues about language –ambiguous (many different meanings for the same words and phrases) –different combinations imply different meanings
Marti Hearst SIMS 247 Why Text is Tough I saw Pathfinder on Mars with a telescope.I saw Pathfinder on Mars with a telescope. Pathfinder photographed Mars.Pathfinder photographed Mars. The Pathfinder photograph mars our perception of a lifeless planet.The Pathfinder photograph mars our perception of a lifeless planet. The Pathfinder photograph from Ford has arrived.The Pathfinder photograph from Ford has arrived. The Pathfinder forded the river without marring its paint job.The Pathfinder forded the river without marring its paint job.
Marti Hearst SIMS 247 Why Text is Easy Text is easier when you have a lot of itText is easier when you have a lot of it –Highly redundant –Because people are good at finding associations, just about any simple algorithm can get “good” results for coarse tasks Pull out “important” phrases Find “meaningfully” related words Create “summary” from document –Major problem: Evaluation People usually search on relatively coarse meaningsPeople usually search on relatively coarse meanings
Marti Hearst SIMS 247 Why Text is Easy Pretty much any simple technique can pull out phrases that seem to characterize a documentPretty much any simple technique can pull out phrases that seem to characterize a document Most frequent words from a lecture last fall:Most frequent words from a lecture last fall: 109 slide 69 to 37 view 37 version 37 graphic 37 first 37 back 36 previous 36 next 32 of 31 the 30 recall 28 relevant 27 precision 25 retrieved 25 documents 21 and 18 evaluate 15 a 13 what 13 vs 13 how 12 trec 12 is 12 high 12 for 10 relevance 10 queries 10 on 9 information 8 x 8 why 8 as 8 answer 7 search 7 maron 7 document 7 blair 6 top 6 results 6 measure 6 length 6 in 6 evaluation 6 curves
Marti Hearst SIMS 247 Why Text is Easy Same text, removing most frequent words in language and most frequent in this text:Same text, removing most frequent words in language and most frequent in this text: 30 recall 28 relevant 27 precision 25 retrieved 25 documents 18 evaluate 13 vs 12 trec 12 high 10 relevance 10 queries 9 information 8 x 8 answer 7 search 7 maron 7 document 7 blair 6 top 6 results 6 measure 6 length 6 evaluation 6 curves These words can act as a simple summary of the documentThese words can act as a simple summary of the document –people are good at inferring the relations –redundancy in the word meanings
Marti Hearst SIMS 247 Text Collection Overviews How can we show an overview of the contents of a text collection?How can we show an overview of the contents of a text collection? –show info external to the docs e.g., date, author, source, number of inlinks does not show what they are about –show the meanings or topics in the docs show a list of titles show results of clustering words or documents organize according to categories –how to show arbitrary subsets?
Marti Hearst SIMS 247 Showing Collection Overviews Showing the DocumentsShowing the Documents –External Metadata e.g., author, date, hyperlink connectivity Does not show what the documents are about –Visualizations of Document Clusters Mapping document clusters into nearby points Networks with Force-Directed Placement Kohonen Feature Maps –Zoomable “Landscapes”
Marti Hearst SIMS 247 Showing Collection Overviews Distinguish betweenDistinguish between –showing the documents –showing the words/concepts Distinguish betweenDistinguish between –a general overview –a query-centered view
Marti Hearst SIMS 247 Clustering for Collection Overviews Two main stepsTwo main steps –cluster the documents according to the words they have in common –map the cluster representation onto a (interactive) 2D or 3D representation Since text has tens of thousands of featuresSince text has tens of thousands of features –the mapping to 2D loses a tremendous amount of information –only very coarse themes are detected
Marti Hearst SIMS 247 Clustering for Collection Overviews –Scatter/Gather show main themes as groups of text summaries –Scatter Plots show docs as points; closeness indicates nearness in cluster space show main themes of docs as visual clumps or mountains –Kohonen Feature maps show main themes as adjacent polygons –BEAD show main themes as links within a force- directed placement network
Marti Hearst SIMS 247 Scatter/Gather
Marti Hearst SIMS 247 Scatter Plot of Clusters (Chen et al. 97)
Marti Hearst SIMS 247 BEAD (Chalmers 97)
Marti Hearst SIMS 247 BEAD (Chalmers 96)
Marti Hearst SIMS 247 Example: Themescapes (Wise et al. 95)
Marti Hearst SIMS 247 Kohonen Feature Maps (Lin 92, Chen et al. 97) (594 docs)
Marti Hearst SIMS 247 Galaxy of News Rennison 95
Marti Hearst SIMS 247 Visualizing Concept Overviews Huge 2D maps may be inappropriate focus for information retrievalHuge 2D maps may be inappropriate focus for information retrieval –cannot see what the documents are about –documents are forced into one position in semantic space –space is difficult to browse for IR purposes Perhaps more suited for pattern discoveryPerhaps more suited for pattern discovery –problem: often only one view on the space
Marti Hearst SIMS 247 How Useful are Graphical Clusters? A study (Kleiboemer et al. 96) comparedA study (Kleiboemer et al. 96) compared –a system with 2D graphical clusters –a system with 3D graphical clusters –a system that shows textual clusters Novice usersNovice users Only textual clusters were helpful (and they were difficult to use well)Only textual clusters were helpful (and they were difficult to use well)
Marti Hearst SIMS 247 Next Time Visualizing Query Term SpecificationVisualizing Query Term Specification –available words –available metadata Visualizing Retrieval ResultsVisualizing Retrieval Results