Download presentation
Presentation is loading. Please wait.
Published byAmie Boone Modified over 9 years ago
1
Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität
2
Short History I started getting involved in April 2002 I gave a series of lectures in Nov./Dec. 2002 We started a subgroup beginning of this year, current members are: –Kurt Mehlhorn (director of AG 1) –Hisao Tamaki (visiting professor from Japan) –Kavitha Telikepalli, Venkatesh Srinivasan (postdocs) –Irit Katriel, Debapriyo Majumdar (PhD students, IMPRS)
3
Dimension Reduction Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space Formally: given an m×n matrix, possibly full rank, find best low-rank approximation car1101010 automobile1011000 search0000110 engine1110101 web0000111
4
Dimension Reduction car1111000 automobile1111000 search0000111 engine1111111 web0000111 Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space Formally: given an m×n matrix, possibly full rank, find best low-rank approximation
5
Generic Method Find concepts (vectors in term space) c 1,…,c k Replace each document by a linear combination of the c 1,…,c k That is, replace term document matrix by product C·D’, where –columns of C are c 1,…,c k –columns of D’ are documents expressed in terms of concepts
6
Generic Method Find concepts (vectors in term space) c 1,…,c k Replace each document by a linear combination of the c 1,…,c k That is, replace term document matrix by product C·D’, where –columns of C are c 1,…,c k –columns of D’ are documents expressed in terms of concepts
7
Specific Methods Latent semantic indexing (LSI) –Dumais et al. 1989 –orthogonal concepts c 1,…,c k –span of c 1,…,c k is that k-dimensional subspace which minimizes the squared distances –choice of basis not specified (at least two sensible ways) –computable in poynomial time via the singular value decomposition (SVD) –surprisingly good in practice
8
Specific Methods Probabilistic Latent Semantic Indexing (PLSI) –Hofmann 1999 –find stochastic matrix of rank k that maximizes the probability that given matrix is an instance –connects problem to statistical learning theory –hard to compute, approximate by local search techniques –very good results on some test collections
9
Specific Methods Concept Indexing (CI) –Karypis & Han 2000 –c 1,…,c k = centroid vectors of a k-clustering –documents = projections onto these centroids –computationally easy (given the clustering) –gave surprisingly good results in a recent DFKI project (Arbeitsamt)
10
Comparing Methods Fundamental question: which method is how good under which circumstances? Few theoretically founded answers to this question –seminal paper: A Probabilistic Analysis of Latent Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!) –follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01 –main statement: LSI is robust against addition of (how much?) noise
11
Why does LSI work so well? A good method should produce –small angles between documents on similar topics –large angles between documents on different topics A formula for angles in the reduced space: –Let D = C·G, and let c 1 ’,…,c k ’ be the images of the concepts under LSI –Then the k×k dot products c i ’·c j ’ are given by the matrix (G·G T ) -1 –That is, pairwise angles are ≥ 90 degrees if and only if (G·G T ) -1 has nonpositive offdiagonal entries (M-matrix)
12
Polysemy and Simonymy Let T ij be the dot product of the i-th with the j-th row of a term-document matrix (~ co-occurence of terms i and j) –Call term k a polysem if there exist terms i and j such that for some t, T ik, T jk ≥ t but T ij < t –Two terms i and j are simonyms if T ij ≥ T ii or T jj Without polysems and simonyms we have 1.T ij ≥ min(T ik,T jk ) for all i,j,k 2.T ii > T ij for all j≠i A symmetric matrix (T ij ) with 1. and 2. is called strictly ultrametric
13
Help from Linear Algebra Theorem [Martinez,Michon,San Martin 1994]: The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive
14
A new LSI theorem Theorem: If D can be well approximated by a set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles. Beware: This only holds for the original LSI, not for its widely used variant! Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!
15
Exploiting Link Structure Achlioptas,Fiat,Karlin,McSherry (FOCS’01): –documents have a topic (implicit in the distribution of terms) –and a quality (implicit in the link structure) –represent each document by a vector direction corresponds to the topic length corresponds to the quality –Goal: for a given query, rank documents by their dot product with the topic of the query
16
Model details Underlying parameters –A = [A 1 … A n ]authority topics, one per doc. –H = [H 1 … H n ]hub topics, one per doc. –C = [C 1 … C k ]translates topics to terms –q = [q 1 … q k ]query topic The input we see –D A·C + H·C term document matrix –L H T ·Alink matrix –Q q·Cquery terms Goal: recover ordering of A 1 ·q,…,A n ·q
17
Model - Problems Link matrix generation L H T ·A –is ok, because the presence of a link is related to the hub/authority value Term document matrix generation D A·C + H·C –very unrealistic: the term distribution gives information on the topic, but not on the quality! –more realistic: D A 0 ·C + H 0 ·C, where A 0 and H 0 contain the normed columns of A and H So far, we could solve the special case where A differs from H by only a diagonal matrix (i.e. hub topic = authority topic)
18
Perspective Strong theoretical foundations –unifying framework + comparative analysis for large variety of dimension reduction methods –realistic models + performance guarantuees Make proper use of human intelligence –integrate explicit knowledge –but only as much as required (automatic detection) –combine dimension reduction methods with interactive schemes (e.g. phrase browsing)
19
Ende!
20
Specific Methods Latent semantic indexing (LSI) [Dumais et al. ’89] –orthogonal concepts c 1,…,c k –span of c 1,…,c k is that k-dimensional subspace which minimizes the squared distances Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99] –find stochastic matrix of rank k that maximizes the probability that given matrix is an instance Concept Indexing (CI) [Karypis & Han ’00] –c 1,…,c k = centroid vectors of a k-clustering –documents = projections onto these centroids
21
Dimension Reduction Methods Main idea: the high-dimensional space of objects is a variant of an underlying low dimensional space Formally: given an m×n matrix, possibly full rank, find best low-rank approximation car11 01011 1010 1010 1010 automobile1 010111 1010 1010 1010 search0000 0101 0101 0101 engine111 0101 0101 0101 0101 web0000 0101 0101 0101
22
I will talk about … Dimension reduction techniques –some methods –a new theorem Exploiting link structure –state of the art –some new ideas Perspective
23
Overview Exploiting the link structure –Google, HITS, SmartyPants –Trawling Semantic Web –XML, XML-Schema –RDF, DAML+OIL Interactive browsing –Scatter/Gather –Phrase Browsing
24
Scatter/Gather Cutting, Karger, Pedersen, Tukey, SIGIR’92 Motivation: Zooming into a large document collection Realisation: geometric clustering Challenge: extremely fast algorithms required, i.p. –linear-time preprocessing –constant-time query processing Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)
25
Scatter/Gather – Example taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster- based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR
26
Scatter/Gather – Example taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster- based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR
27
Phrase Browsing Nevill-Manning,Witten,Moffat, 1997 Formulating a good query requires more or less knowledge of the document collection –if less, fine –if more, interaction is a must Build hierarchy of phrases Example: http://www.nzdl.org/cgi-bin/libraryhttp://www.nzdl.org/cgi-bin/library Challenge: fast algorithms for finding minimal grammar, e.g. for S babaabaabaa
28
Teoma More refined concept of authoritativeness, depending on the specific query (“subject-specific popularity”) More sophisticated query refinement But: Coverage is only 10% of that of Google Example: http://www.teoma.comhttp://www.teoma.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.