WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.

Slides:

Advertisements

Similar presentations

Recuperação de Informação B Cap. 10: User Interfaces and Visualization 10.1,10.2,10.3 November 17, 1999.

Advertisements

Chapter 5: Introduction to Information Retrieval

Modern Information Retrieval Chapter 1: Introduction

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.

Dimensionality Reduction PCA -- SVD

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Hinrich Schütze and Christina Lioma

Information Retrieval in Practice

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

ISP 433/533 Week 2 IR Models.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.

Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.

INFO 624 Week 3 Retrieval System Evaluation

1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.

E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.

Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Chapter 2 Dimensionality Reduction. Linear Methods

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Chapter 6: Information Retrieval and Web Search

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

SINGULAR VALUE DECOMPOSITION (SVD)

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Information Retrieval

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.

Information Retrieval in Practice

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

CS276A Text Information Retrieval, Mining, and Exploitation

Introduction to Information Retrieval

Relevance and Reinforcement in Interactive Browsing

Presentation transcript:

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14

Today’s Topics Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval

How LSI is used for Text Search LSI is a technique for dimension reduction Similar to Principal Component Analysis (PCA) Addresses (near-)synonymy: car/automobile Attempts to enable concept-based retrieval Pre-process docs using a technique from linear algebra called Singular Value Decomposition. Reduce dimensionality to: Fewer dimensions, more “collapsing of axes”, better recall, worse precision More dimensions, less collapsing, worse recall, better precision Queries handled in this new (reduced) vector space.

Input: Term-Document Matrix w i,j = (normalized) weighted count (t i, d j ) Key idea: Factorize this matrix titi djdj n m

h j is representation of d j in terms of basis W If rank(W) ≥ rank(A) then we can always find H so A = WH Notice duality of problem More “semantic” dimensions -> LSI (latent semantic indexing) Matrix Factorization =x n m BasisRepresentation m k k n A = W x H hjhj djdj

Minimization Problem Minimize Minimize information loss Given: norm for SVD, the 2-norm constraints on W, S, V for SVD, W and V are orthonormal, and S is diagonal

Matrix Factorizations: SVD =x n m Basis Representation m k k n A = W x S x V T x Singular Values Restrictions on representation: W, V orthonormal; S diagonal

Dimension Reduction For some s << Rank, zero out all but the s biggest singular values in S. Denote by S s this new version of S. Typically s in the hundreds while r (Rank) could be in the (tens of) thousands. Before: A= W S V t Let A s = W S s V t = W s S s V s t A s is a good approximation to A. Best rank s approximation according to 2-norm

Dimension Reduction =x n m Basis Representation 00 m k k n A s = W x S s x V T x 0 0 Singular Values The columns of A s represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm s s 0

More on W and V Recall m  n matrix of terms  docs, A. Define term-term correlation matrix T = AA t A t denotes the matrix transpose of A. T is a square, symmetric m  m matrix. Doc-doc correlation matrix D=A t A. D is a square, symmetric n  n matrix. Why?

Eigenvectors Denote by W the m  r matrix of eigenvectors of T. Denote by V the n  r matrix of eigenvectors of D. Denote by S the diagonal matrix with the squares of the eigenvalues of T = AA t in sorted order. It turns out that A = WSV t is the SVD of A Semi-precise intuition: The new dimensions are the principal components of term correlation space.

Query processing Exercise: How do you map the query into the reduced space?

Take Away LSI is optimal: optimal solution for given dimensionality Caveat: Mathematically optimal is not necessarily “semantically” optimal. LSI is unique Except for signs, singular values with same value Key benefits of LSI Enhances recall, addresses synonymy problem But can decrease precision Maintenance challenges Changing collections Recompute in intervals? Performance challenges Cheaper alternatives for recall enhancement E.g. Pseudo-feedback Use of LSI in deployed systems Why?

Resources: LSI Random projection theorem: Faster random projection: Latent semantic indexing: Books: FSNLP 15.4, MG 4.6, MIR

Interactive Information Retrieval User Interfaces

The User in Information Access Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results yes no User Find starting point

Main Focus of Information Retrieval yes no Focus of most IR! Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results User Find starting point

Information Access in Context Stop High-Level Goal Synthesize Done? Analyze yes no User Information Access

The User in Information Access Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results yes no User Find starting point

Queries on the Web Most Frequent on 2002/10/26

Queries on the Web (2000) Why only 9% sex?

Intranet Queries (Aug 2000) 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid Source: Ray Larson

Intranet Queries Summary of sample data from 3 weeks of UCB queries 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393) Source: Ray Larson

Types of Information Needs Need answer to question (who won the superbowl?) Re-find a particular document Find a good recipe for tonight’s dinner Exploration of new area (browse sites about Mexico City) Authoritative summary of information (HIV review) In most cases, only one interface! Cell phone / pda / camera / mp3 analogy

The User in Information Access Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results yes no User Find starting point

Find Starting Point By Browsing x x x xx x x x x x x x x x Entry point Starting point for search (or the answer?)

Hierarchical browsing Level 2 Level 1 Level 0

Visual Browsing: Hyperbolic Tree

Visual Browsing: Themescape

Scatter/Gather Scatter/gather allows the user to find a set of documents of interest through browsing. It iterates: Scatter Take the collection and scatter it into n clusters. Gather Pick the clusters of interest and merge them.

Scatter/Gather

Browsing vs. Searching Browsing and searching are often interleaved. Information need dependent Open-ended (find information about mexico city) -> browsing Specific (who won the superbowl) -> searching User dependent Some users prefer searching, others browsing (confirmed in many studies: some hate to type) Advantage of browsing: You don’t need to know the vocabulary of the collection Compare to physical world Browsing vs. searching in a grocery store

Browsers vs. Searchers 1/3 of users do not search at all 1/3 rarely search Or urls only Only 1/3 understand the concept of search (ISP data from 2000) Why?

Starting Points Methods for finding a starting point Select collections from a list Highwire press Google! Hierarchical browsing, directories Visual browsing Hyperbolic tree Themescape, Kohonen maps Browsing vs searching

The User in Information Access Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results yes no User Find starting point

Form-based Query Specification (Infoseek) Credit: Marti Hearst

Boolean Queries Boolean logic is difficult for the average user. Some interfaces for average users support formulation of boolean queries Current view is that non-expert users are best served with non-boolean or simple +/- boolean (pioneered by altavista). But boolean queries are the standard for certain groups of expert users (eg, lawyers).

Direct Manipulation Spec. VQUERY (Jones 98) Credit: Marti Hearst

One Problem With Boolean Queries: Feast or Famine Famine Feast Specifying a well targeted query is hard. Bigger problem for Boolean. Google: 1860 hits for “standard user dlink 650” 0 hits after adding “no card found” How general is the query?

Boolean Queries Summary Complex boolean queries are difficult for average user Feast or famine problem Prior to google, many IR researchers thought boolean queries were a bad idea. Google queries are strict conjunctions. Why is this working well?

Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort. Parametric search example

We can add text search.

Parametric search Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color A parametric search interface allows the user to combine a full-text query with selections on these parameters

Interfaces for term browsing

Re/Formulate Query Single text box (google, stanford intranet) Command-based (socrates) Boolean queries Parametric search Term browsing Other methods Relevance feedback Query expansion Spelling correction Natural language, question answering

The User in Information Access Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results yes no User Find starting point

Category Labels to Support Exploration Example: ODP categories on google Advantages: Interpretable Capture summary information Describe multiple facets of content Domain dependent, and so descriptive Credit: Marti Hearst Disadvantages Domain dependent, so costly to acquire May mis-match users’ interests

Evaluate Results Context in Hierarchy: Cat-a-Cone

Summarization to Support Exploration Query-dependent summarization KWIC (keyword in context) lines (a la google) Query-independent summarization Summary written by author (if available) Automatically generated summary.

Visualize Document Structure for Exploration

Result Exploration User Goal: Do these results answer my question? Methods Category labels Summarization Visualization of document structure Other methods Metadata: URL, date, file size, author Hypertext navigation: Can I find the answer by following a link? Browsing in general Clustering of results (jaguar example)

Exercise Current information retrieval user interfaces are designed for typical computer screens. How would you design a user interface for a wall-sized screen? Observe your own information seeking behavior Examples WWW University library Grocery store Are you a searcher or a browser? How do you reformulate your query? Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …

Take Away yes no Focus of most IR Stop Information need Explore results Formulate/ Reformulate Done? Query Send to system Receive results User Find starting point

Evaluation of Interactive Retrieval

Recap: Relevance Feedback User sends query Search system returns results User marks some results as relevant and resubmits query plus relevant results Search system now has better description of the information need and returns more relevant results. One method: Rocchio algorithm

Why Evaluate Relevance Feedback? Simulated interactive retrieval consistently outperforms non-interactive retrieval (70% here).

Relevance Feedback Evaluation Case Study Example of evaluation of interactive information retrieval Koenemann & Belkin 1996 Goal of study: show that relevance feedback improves retrieval effectiveness

Details on the User Study 64 novice searchers 43 female, 21 male, native English TREC test bed Wall Street Journal subset Two search topics Automobile Recalls Tobacco Advertising and the Young Relevance judgements from TREC and experimenter System was INQUERY (vector space with some bells and whistles) Subjects had a tutorial session to learn the system Their goal was to keep modifying the query until they have developed one that gets high precision Reweighting of terms similar to but different from Rocchio

Credit: Marti Hearst

Evaluation Criterion: (precision at 30 documents) Compare: for users with relevance feedback for users without relevance feedback Goal: show that users with relevance feedback do better

Precision vs. RF condition (from Koenemann & Belkin 96) Credit: Marti Hearst

Result Subjects with relevance feedback had, on average, 17-34% better performance than subjects without relevance feedback. Does this show conclusively that relevance feedback is better?

But … Difference in precision numbers not statistically significant. Search times approximately equal.

Take Away Evaluating interactive systems is harder than evaluating algorithms. Experiments involving humans have many confounding variables: Age Level of education Prior experience with search Search style (browsing vs searching) Mac vs linux vs MS user Mood, level of alertness, chemistry with experimenter etc. Showing statistical significance becomes harder as the number of confounding variables increases. Also: human subject studies are resource-intensive It’s hard to “scientifically prove” the superiority of relevance feedback.

Other Evaluation Issues Query variability Always compare methods on query-by-query basis Methods with the same average performance can differ a lot in user friendliness Inter-judge variability In general, judges disagree often Big impact on relevance assessment of a single document Little impact on ranking of systems Redundancy A highly relevant document with no new information is useless Most IR measures don’t measure redundancy

Resources FOA 4.3 MIR Ch – Ellen Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, ACM Sigir 98 Harman, D.K. Overview of the Third REtrieval Conference (TREC-3). In: Overview of The Third Text REtrieval Conference (TREC-3). Harman, D.K. (Ed.). NIST Special Publication , 1995, pp.l-19. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results (1996) Marti A. Hearst, Jan O. Pedersen Proceedings of SIGIR-96, Paul Over, TREC-6 Interactive Track Report, NIST, 1998.

Resources MIR Ch – 10.7 Donna Harman, Overview of the fourth text retrieval conference (TREC 4), National Institute of Standards and Technology. Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR. Hearst, Cat-a-cone, an interactive interface for specifying searches and viewing retrieving results in a large category hierarchy, ACM SIGIR. 1_txt.htm