SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
Today l Review Basic Human-Computer Interaction Principles l Starting Points for Search
UI and Viz in IA: Chapter Contents
Slide by James Landay Human-Computer Interaction (HCI) l Human –the end-user of a program –the others in the organization l Computer –the machine the program runs on l Interaction –the user tells the computer what they want –the computer communicates results
Slide by James Landay What is HCI? HumansTechnology Task Design Organizational & Social Issues
Shneiderman on HCI l Well-designed interactive computer systems promote: –Positive feelings of success, competence, and mastery. –Allow users to concentrate on their work, rather than on the system.
Slide by James Landay Usability Design Goals l Ease of learning –faster the second time and so on... l Recall –remember how from one session to the next l Productivity –perform tasks quickly and efficiently l Minimal error rates –if they occur, good feedback so user can recover l High user satisfaction –confident of success
Adapted from slide by James Landay Usability Slogans (from Nielsen’s Usability Engineering) l Your best guess is not good enough l The user is always right l The user is not always right l Users are not designers l Designers are not users l Less is more l Details matter
Adapted from slide by James Landay Design Guidelines l Set of design rules to follow l Apply at multiple levels of design l Are neither complete nor orthogonal l Have psychological underpinnings (ideally)
Slide by James Landay Who builds UIs? l A team of specialists (ideally) –graphic designers –interaction / interface designers –technical writers –marketers –test engineers –software engineers
Adapted from slide by James Landay How to Design and Build UIs l Task analysis l Rapid prototyping l Evaluation l Implementation Design Prototype Evaluate Iterate at every stage!
Slide by James Landay Task Analysis l Observe existing work practices l Create examples and scenarios of actual use l Try out new ideas before building software
Slide by James Landay Rapid Prototyping l Build a mock-up of design l Low fidelity techniques –paper sketches –cut, copy, paste –video segments l Interactive prototyping tools –Visual Basic, HyperCard, Director, etc. l UI builders –NeXT, etc.
Slide by James Landay Evaluation l Test with real users (participants) l Build models l Low-cost techniques –expert evaluation –walkthroughs
Information Seeking Behavior l Two parts of a process: »search and retrieval »analysis and synthesis of search results l This is a fuzzy area; we will look at several different working theories.
Standard Model l Assumptions: –Maximizing precision and recall simultaneously –The information need remains static –The value is in the resulting document set
Problem with Standard Model: l Users learn during the search process: –Scanning titles of retrieved documents –Reading retrieved documents –Viewing lists of related topics/thesaurus terms –Navigating hyperlinks l Some users don’t like long disorganized lists of documents
“Berry-Picking” as an Information Seeking Strategy (Bates 90) l Standard IR model –assumes the information need remains the same throughout the search process l Berry-picking model –interesting information is scattered like berries among bushes –the query is continually shifting
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q0 Q1 Q2 Q3 Q4 Q5
Implications l Interfaces should make it easy to store intermediate results l Interfaces should make it easy to follow trails with unanticipated results l Makes evaluation more difficult.
Search Tactics and Strategies l Search Tactics –Bates 79 l Search Strategies –Bates 89 –O’Day and Jeffries 93
Tactics vs. Strategies l Tactic: short term goals and maneuvers –operators, actions l Strategy: overall planning –link a sequence of operators together to achieve some end
Information Search Tactics (after Bates 79) l Source-level tactics –navigate to and within sources l Term and Search Formulation tactics –designing search formulation –selection and revision of specific terms within search formulation l Monitoring tactics –keep search on track –(should really be called a strategy)
Term Tactics l Move around a thesaurus –(more on this in 2 nd half of class)
Source-level Tactics l “Bibble”: – look for a pre-defined result set – e.g., a good link page on web l Survey: –look ahead, review available options –e.g., don’t simply use the first term or first source that comes to mind l Cut: –eliminate large proportion of search domain –e.g., search on rarest term first
Source-level Tactics (cont.) l Stretch –use source in unintended way –e.g., use patents to find addresses l Scaffold –take an indirect route to goal –e.g., when looking for references to obscure poet, look up contemporaries
Monitoring Tactics (strategy-level) l Check –compare original goal with current state l Weigh –make a cost/benefit analysis of current or anticipated actions l Pattern –recognize common strategies l Correct Errors l Record –keep track of (incomplete) paths
Additional Considerations (Bates 79) l Need a Sort tactic l More detail is needed about short-term cost/benefit decision rule strategies l When to stop? –How to judge when enough information has been gathered? –How to decide when to give up an unsuccesful search? –When to stop searching in one source and move to another?
After the Search l How to synthesize information is part of the information use process l One “theory” is called sensemaking –Russell at al. paper –Dan Russell is speaking today at 4pm! Room 110. Different topic.
Post-Search Analysis Types (O’Day & Jeffries 93) l Trends l Comparisons l Aggregation and Scaling l Identifying a Critical Subset l Assessing l Interpreting l The rest: »cross-reference »summarize »find evocative visualizations »miscellaneous
SenseMaking (Russell et al. 93) l The process of encoding retrieved information to answer task-specific questions l Combine –internal cognitive resources –external retrieved resources l Create a good representation –an iterative process –contend with a cost/benefit tradoff
The SenseMaking Loop,From Russell et al., 93
Observed Activities of Business Analysts Working From Russell et al.,93
The SenseMaking Process,From Russell et al.,InterCHI 93.
Sensemaking (Russell et al. 93) l An anytime activity –At any point a workable solution is available –Usually more time -> better solution –Usually more properties -> better solution
Sensemaking (Russell et al. 93) l A good strategy –Maximizes long term rate of gain –Example: »new technology brings more info faster »this causes a uniform increase in useful and useless information »best strategy: throw out bad stuff faster
Sensemaking (Russell et al. 93) l Most of the effort is in the synthesis of a good representation –covers the data –increase usability –decrease cost-of-use
UI and Viz in IA: Chapter Contents
Starting Points for Search l Types: –Lists –Overviews »Categories »Clusters »Links/Hyperlinks –Examples, Wizards, Guided Tours
Starting Points for Search l Faced with a prompt or an empty entry form … how to start? –Lists of sources –Overviews »Clusters »Category Hierarchies/Subject Codes »Co-citation links –Examples, Wizards, and Guided Tours –Automatic source selection
List of Sources l Have to guess based on the name l Requires prior exposure/experience
Dialog box for chosing sources in old lexis-nexis interface
Overviews in the User Interface l Supervised (Manual) Category Overviews –Yahoo! –HiBrowse –MeSHBrowse l Unsupervised (Automated) Groupings –Clustering –Kohonen Feature Maps
Incorporating Categories into the Interface l Yahoo is the standard method l Problems: –Hard to search, meant to be navigated. –Only one category per document (usually)
Evidence l Web search engines are heavily using –Link analysis –Page popularity –Interwoven categories l These all find dominant home pages
More Complex Example: MeSH and MedLine l MeSH Category Hierarchy –Medical Subject Headings –~18,000 labels –manually assigned –~8 labels/article on average –avg depth: 4.5, max depth 9 l Top Level Categories: anatomydiagnosisrelated disc animalspsychtechnology diseasebiologyhumanities drugsphysics
Category Labels l Advantages: –Interpretable –Capture summary information –Describe multiple facets of content –Domain dependent, and so descriptive l Disadvantages –Do not scale well (for organizing documents) –Domain dependent, so costly to acquire –May mis-match users’ interests
MeshBrowse (Korn & Shneiderman95) Grow the category structure gradually and in response to semantic similarity
HiBrowse (Pollitt 97) Show combinations of categories given that some categories already seen
Large Category Sets l Problems for User Interfaces » Too many categories to browse » Too many docs per category » Docs belong to multiple categories » Need to integrate search » Need to show the documents
Text Clustering l Finds overall similarities among groups of documents l Finds overall similarities among groups of tokens l Picks out some themes, ignores others
Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 l How it works –Cluster sets of documents into general “themes”, like a table of contents –Display the contents of the clusters by showing topical terms and typical titles –User chooses subsets of the clusters and re- clusters the documents within –Resulting new groups have different “themes” l Originally used to give collection overview l Evidence suggests more appropriate for displaying retrieval results in context
S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated
Using Clustering in Document Ranking l Cluster entire collection l Find cluster centroid that best matches the query l This has been explored extensively –it is expensive –it doesn’t work well
Two Queries: Two Clusterings AUTO, CAR, ELECTRICAUTO, CAR, SAFETY The main differences are the clusters that are central to the query 8 control drive accident … 25 battery california technology … 48 import j. rate honda toyota … 16 export international unit japan 3 service employee automatic … 6 control inventory integrate … 10 investigation washington … 12 study fuel death bag air … 61 sale domestic truck import … 11 japan export defect unite …
Another use of clustering l Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. l “Project” these onto a 2D graphical representation –Group by doc: SPIRE/Kohonen maps –Group by words: Galaxy of News/HotSauce/Semio
Clustering Multi-Dimensional Document Space (image from Wise et al 95)
Kohonen Feature Maps on Text (from Chen et al., JASIS 49(7))
UWMS Data Mining Workshop Study of Kohonen Feature Maps l H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7) l Comparison: Kohonen Map and Yahoo l Task: –“Window shop” for interesting home page –Repeat with other interface l Results: –Starting with map could repeat in Yahoo (8/11) –Starting with Yahoo unable to repeat in map (2/14)
UWMS Data Mining Workshop Study (cont.) l Participants liked: –Correspondence of region size to # documents –Overview (but also wanted zoom) –Ease of jumping from one topic to another –Multiple routes to topics –Use of category and subcategory labels
UWMS Data Mining Workshop Study (cont.) l Participants wanted: –hierarchical organization –other ordering of concepts (alphabetical) –integration of browsing and search –corresponce of color to meaning –more meaningful labels –labels at same level of abstraction –fit more labels in the given space –combined keyword and category search –multiple category assignment (sports+entertain)
Visualization of Clusters –Huge 2D maps may be inappropriate focus for information retrieval »Can’t see what documents are about »Documents forced into one position in semantic space »Space is difficult to use for IR purposes »Hard to view titles –Perhaps more suited for pattern discovery »problem: often only one view on the space
Summary: Clustering l Advantages: –Get an overview of main themes –Domain independent l Disadvantages: –Many of the ways documents could group together are not shown –Not always easy to understand what they mean –Different levels of granularity
Next Time Interfaces for Query Specification