INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under.

Slides:

Advertisements

Similar presentations

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.

Advertisements

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l

Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !

Information Retrieval: Human-Computer Interfaces and Information Access Process.

Unsupervised learning

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

I256 Applied Natural Language Processing Fall 2009

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

LBSC 796/INFM 718R: Week 10 Clustering, Classifying, and Filtering Jimmy Lin College of Information Studies University of Maryland Monday, April 10, 2006.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.

SIMS 296a-3: UI Background Marti Hearst Fall ‘98.

Information Retrieval: Human-Computer Interfaces and Information Access Process.

LBSC 796/INFM 718R: Week 9 User Interaction Jimmy Lin College of Information Studies University of Maryland Monday, April 3, 2006.

9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Interaction LBSC 796/INFM 718R Douglas W. Oard Week 4, October 1, 2007.

UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.

ISP 433/633 Week 12 User Interface in IR. Why care about User Interface in IR Human Search using IR depends on –Search in IR and search in human memory.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Clustering Unsupervised learning Generating “classes”

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.

Text Clustering.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.

Clustering C.Watters CS6403.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.

E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.

Text Clustering Hongning Wang

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

1 FollowMyLink Individual APT Presentation First Talk February 2006.

User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Data Mining and Text Mining. The Standard Data Mining process.

SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Hierarchical Clustering & Topic Models

Big Data Infrastructure

Semi-Supervised Clustering

Machine Learning Lecture 9: Clustering

Visualizing Document Collections

Document Clustering Matt Hughes.

CS 391L: Machine Learning Clustering

Text Categorization Berlin Chen 2003 Reference:

Presentation transcript:

INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for detailshttp://creativecommons.org/licenses/by-nc-sa/3.0/us/

iSchool The IR Black Box Search Query Ranked List

iSchool The Role of Interfaces Source Selection Search Query Selection Ranked List Examination Documents Delivery Documents Query Formulation Resource source reselection System discovery Vocabulary discovery Concept discovery Document discovery Help users decide where to start Help users formulate queries Help users make sense of results and navigate the information space

iSchool Today’s Topics Source selection What should I search? Query formulation What should my query be? Result presentation What are the search results? Browsing support How do I make sense of all these results? Navigation support Where am I? Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Source Selection: Google Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Source Selection: Ask Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Source Reselection Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool The Search Box Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Advanced Search: Facets Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Filter/Flow Query Formulation Degi Young and Ben Shneiderman. (1993) A Graphical Filter/Flow Representation of Boolean Queries: A Prototype Implementation and Evaluation. JASIS, 44(6): Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Direct Manipulation Queries Steve Jones. (1998) Graphical Query Specification and Dynamic Result Previews for a Digital Library. Proceedings of UIST Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Result Presentation How should the system present search results to the user? The interface should: Provide hints about the roles terms play within the result set and within the collection Provide hints about the relationship between terms Show explicitly why documents are retrieved in response to the query Compactly summarize the result set Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Alternative Designs One-dimensional lists Content: title, source, date, summary, ratings,... Order: retrieval score, date, alphabetic,... Size: scrolling, specified number, score threshold More sophisticated multi-dimensional displays Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Binoculars Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool TileBars Graphical representation of term distribution and overlap in search results Simultaneously Indicate: Relative document length Query term frequencies Query term distributions Query term overlap Marti Hearst (1995) TileBars: A Visualization of Term Distribution Information in Full Text Information Access. Proceedings of SIGCHI Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Technique Relative length of document Search term 1 Search term 2 Blocks indicate “chunks” of text, such as paragraphs Blocks are darkened according to the frequency of the term in the document Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Mainly about both DBMS and reliability Mainly about DBMS, discusses reliability Mainly about, say, banking, with a subtopic discussion on DBMS/Reliability Mainly about high-tech layoffs Example Topic: reliability of DBMS (database systems) Query terms: DBMS, reliability DBMS reliability DBMS reliability DBMS reliability DBMS reliability Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool TileBars Screenshot Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool TileBars Summary Compact, graphical representation of term distribution in search results Simultaneously display term frequency, distribution, overlap, and doc length However, does not provide the context in which query terms are used Do they help? Users intuitively understand them Lack of context sometimes causes problems in disambiguation Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Scrollbar-Tilebar From U. Mass Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Cat-a-Cone Key Ideas: Separate documents from category labels Show both simultaneously Link the two for iterative feedback Integrate searching and browsing Distinguish between: Searching for documents Searching for categories Marti A. Hearst and Chandu Karadi. (1997) Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. SIGIR Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Cat-a-Cone Interface Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Collection Retrieved Documents Category Hierarchy browse query terms search Cat-a-Cone Architecture Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Clustering Search Results Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Vector Space Model Assumption: Documents that are “close together” in vector space “talk about” the same things t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Similarity Metric How about |d 1 – d 2 |? Instead of Euclidean distance, use “angle” between the vectors It all boils down to the inner product (dot product) of vectors Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Components of Similarity The “inner product” (aka dot product) is the key to the similarity function The denominator handles document length normalization Example: Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Text Clustering What? Automatically partition documents into clusters based on content Documents within each cluster should be similar Documents in different clusters should be different Why? Discover categories and topics in an unsupervised manner Help users make sense of the information space No sample category labels provided by humans Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979 Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Visualizing Clusters Centroids Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Two Strategies Aglommerative (bottom-up) methods Start with each document in its own cluster Iteratively combine smaller clusters to form larger clusters Divisive (partitional, top-down) methods Directly separate documents into clusters Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool HAC HAC = Hierarchical Agglomerative Clustering Start with each document in its own cluster Until there is only one cluster: Among the current clusters, determine the two clusters c i and c j, that are most similar Replace c i and c j with a single cluster c i  c j The history of merging forms the hierarchy Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool HAC ABCDEFGHABCDEFGH Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool What’s going on geometrically? Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y) What’s appropriate for documents? What’s the similarity between two clusters? Single Link: similarity of two most similar members Complete Link: similarity of two least similar members Group Average: average similarity between members Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Different Similarity Functions Single link: Uses maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect Complete link: Use minimum similarity of pairs: Makes more “tight” spherical clusters Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Non-Hierarchical Clustering Typically, must provide the number of desired clusters, k Randomly choose k instances as seeds, one per cluster Form initial clusters based on these seeds Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering Stop when clustering converges or after a fixed number of iterations Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool K-Means Clusters are determined by centroids (center of gravity) of documents in a cluster: Reassignment of documents to clusters is based on distance to the current cluster centroids Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool K-Means Algorithm Let d be the distance measure between documents Select k random instances {s 1, s 2,… s k } as seeds. Until clustering converges or other stopping criterion: Assign each instance x i to the cluster c j such that d(x i, s j ) is minimal Update the seeds to the centroid of each cluster For each cluster c j, s j =  (c j ) Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool K-Means Clustering Example Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool K-Means: Discussion How do you select k? Issues: Results can vary based on random seed selection Possible consequences: poor convergence rate, convergence to sub-optimal clusters Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Why cluster for IR? Cluster the collection Retrieve clusters instead of documents Cluster the results Provide support for browsing “Closely associated documents tend to be relevant to the same requests.” “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool From Clusters to Centroids Centroids Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Clustering the Collection Basic idea: Cluster the document collection Find the centroid of each cluster Search only on the centroids, but retrieve clusters If the cluster hypothesis is true, then this should perform better Why would you want to do this? Why doesn’t it work? Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Clustering the Results Commercial example: Clusty Research example: Scatter/Gather Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Scatter/Gather How it works: The system clusters documents into general “themes” The system displays the contents of the clusters by showing topical terms and typical titles User chooses a subset of the clusters The system automatically re-clusters documents within selected cluster The new clusters have more refined “themes” Originally used to give collection overview Evidence suggests more appropriate for displaying retrieval results in context Marti A. Hearst and Jan O. Pedersen. (1996) Reexaming the Cluster Hypothesis: Scatter/Gather on Retrieval Results. Proceedings of SIGIR Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool symbols8 docs film, tv 68 docs astrophysics97 docs astronomy67 docs flora/fauna10 docs Clustering and re-clustering is entirely automated sports14 docs film, tv47 docs music7 docs stellar phenomena12 docs galaxies, stars49 docs constellations29 docs miscellaneous7 docs Query = “star” on encyclopedic text Scatter/Gather Example Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Clustering Result Sets Advantages: Topically coherent sets of documents are presented to the user together User gets a sense of topics in the result set Supports exploration and browsing of retrieved hits Disadvantage: Clusters might not “make sense” May be difficult to understand the topic of a cluster based on summary terms Summary term might not describe the clusters Additional computational processing required Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Navigation Support The “back” button isn’t enough! Behavior is counterintuitive to many users A B D C You hit “back” twice from page D. Where do you end up? Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool PadPrints Tree-based history of recently visited Web pages History map placed to left of browser window Node = title + thumbnail Visually shows navigation history Zoomable: ability to grow and shrink sub-trees Ron R. Hightower et al. (1998) PadPrints: Graphical Multiscale Web Histories. Proceedings of UIST 1998, GCHI Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool PadPrints Screenshot Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool PadPrints Thumbnails Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Zoomable History Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Does it work? Study involved CHI database and National Park Service website In tasks requiring return to prior pages, 40% savings in time when using PadPrints Users more satisfied with PadPrints Source Selection Query Formulation Result Presentation Browsing Support Navigation Support

iSchool Today’s Topics Source selection What should I search? Query formulation What should my query be? Result presentation What are the search results? Browsing support How do I make sense of all these results? Navigation support Where am I? Source Selection Query Formulation Result Presentation Browsing Support Navigation Support