Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität.

Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität

Short History  I started getting involved in April 2002  I gave a series of lectures in Nov./Dec. 2002  We started a subgroup beginning of this year, current members are: –Kurt Mehlhorn (director of AG 1) –Hisao Tamaki (visiting professor from Japan) –Kavitha Telikepalli, Venkatesh Srinivasan (postdocs) –Irit Katriel, Debapriyo Majumdar (PhD students, IMPRS)

Dimension Reduction  Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space  Formally: given an m×n matrix, possibly full rank, find best low-rank approximation car1101010 automobile1011000 search0000110 engine1110101 web0000111

Dimension Reduction car1111000 automobile1111000 search0000111 engine1111111 web0000111  Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space  Formally: given an m×n matrix, possibly full rank, find best low-rank approximation

Generic Method  Find concepts (vectors in term space) c 1,…,c k  Replace each document by a linear combination of the c 1,…,c k  That is, replace term document matrix by product C·D’, where –columns of C are c 1,…,c k –columns of D’ are documents expressed in terms of concepts

Specific Methods  Latent semantic indexing (LSI) –Dumais et al. 1989 –orthogonal concepts c 1,…,c k –span of c 1,…,c k is that k-dimensional subspace which minimizes the squared distances –choice of basis not specified (at least two sensible ways) –computable in poynomial time via the singular value decomposition (SVD) –surprisingly good in practice

Specific Methods  Probabilistic Latent Semantic Indexing (PLSI) –Hofmann 1999 –find stochastic matrix of rank k that maximizes the probability that given matrix is an instance –connects problem to statistical learning theory –hard to compute, approximate by local search techniques –very good results on some test collections

Specific Methods  Concept Indexing (CI) –Karypis & Han 2000 –c 1,…,c k = centroid vectors of a k-clustering –documents = projections onto these centroids –computationally easy (given the clustering) –gave surprisingly good results in a recent DFKI project (Arbeitsamt)

Comparing Methods  Fundamental question: which method is how good under which circumstances?  Few theoretically founded answers to this question –seminal paper: A Probabilistic Analysis of Latent Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!) –follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01 –main statement: LSI is robust against addition of (how much?) noise

Why does LSI work so well?  A good method should produce –small angles between documents on similar topics –large angles between documents on different topics  A formula for angles in the reduced space: –Let D = C·G, and let c 1 ’,…,c k ’ be the images of the concepts under LSI –Then the k×k dot products c i ’·c j ’ are given by the matrix (G·G T ) -1 –That is, pairwise angles are ≥ 90 degrees if and only if (G·G T ) -1 has nonpositive offdiagonal entries (M-matrix)

Polysemy and Simonymy  Let T ij be the dot product of the i-th with the j-th row of a term-document matrix (~ co-occurence of terms i and j) –Call term k a polysem if there exist terms i and j such that for some t, T ik, T jk ≥ t but T ij < t –Two terms i and j are simonyms if T ij ≥ T ii or T jj  Without polysems and simonyms we have 1.T ij ≥ min(T ik,T jk ) for all i,j,k 2.T ii > T ij for all j≠i  A symmetric matrix (T ij ) with 1. and 2. is called strictly ultrametric

Help from Linear Algebra  Theorem [Martinez,Michon,San Martin 1994]: The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive

A new LSI theorem  Theorem: If D can be well approximated by a set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.  Beware: This only holds for the original LSI, not for its widely used variant!  Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!

Exploiting Link Structure  Achlioptas,Fiat,Karlin,McSherry (FOCS’01): –documents have a topic (implicit in the distribution of terms) –and a quality (implicit in the link structure) –represent each document by a vector  direction corresponds to the topic  length corresponds to the quality –Goal: for a given query, rank documents by their dot product with the topic of the query

Model details  Underlying parameters –A = [A 1 … A n ]authority topics, one per doc. –H = [H 1 … H n ]hub topics, one per doc. –C = [C 1 … C k ]translates topics to terms –q = [q 1 … q k ]query topic  The input we see –D  A·C + H·C term document matrix –L  H T ·Alink matrix –Q  q·Cquery terms  Goal: recover ordering of A 1 ·q,…,A n ·q

Model - Problems  Link matrix generation L  H T ·A –is ok, because the presence of a link is related to the hub/authority value  Term document matrix generation D  A·C + H·C –very unrealistic: the term distribution gives information on the topic, but not on the quality! –more realistic: D  A 0 ·C + H 0 ·C, where A 0 and H 0 contain the normed columns of A and H  So far, we could solve the special case where A differs from H by only a diagonal matrix (i.e. hub topic = authority topic)

Perspective  Strong theoretical foundations –unifying framework + comparative analysis for large variety of dimension reduction methods –realistic models + performance guarantuees  Make proper use of human intelligence –integrate explicit knowledge –but only as much as required (automatic detection) –combine dimension reduction methods with interactive schemes (e.g. phrase browsing)

Specific Methods  Latent semantic indexing (LSI) [Dumais et al. ’89] –orthogonal concepts c 1,…,c k –span of c 1,…,c k is that k-dimensional subspace which minimizes the squared distances  Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99] –find stochastic matrix of rank k that maximizes the probability that given matrix is an instance  Concept Indexing (CI) [Karypis & Han ’00] –c 1,…,c k = centroid vectors of a k-clustering –documents = projections onto these centroids

Dimension Reduction Methods  Main idea: the high-dimensional space of objects is a variant of an underlying low dimensional space  Formally: given an m×n matrix, possibly full rank, find best low-rank approximation car11 01011 1010 1010 1010 automobile1 010111 1010 1010 1010 search0000 0101 0101 0101 engine111 0101 0101 0101 0101 web0000 0101 0101 0101

I will talk about …  Dimension reduction techniques –some methods –a new theorem  Exploiting link structure –state of the art –some new ideas  Perspective

Overview  Exploiting the link structure –Google, HITS, SmartyPants –Trawling  Semantic Web –XML, XML-Schema –RDF, DAML+OIL  Interactive browsing –Scatter/Gather –Phrase Browsing

Scatter/Gather  Cutting, Karger, Pedersen, Tukey, SIGIR’92  Motivation: Zooming into a large document collection  Realisation: geometric clustering  Challenge: extremely fast algorithms required, i.p. –linear-time preprocessing –constant-time query processing  Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)

Phrase Browsing  Nevill-Manning,Witten,Moffat, 1997  Formulating a good query requires more or less knowledge of the document collection –if less, fine –if more, interaction is a must  Build hierarchy of phrases  Example: http://www.nzdl.org/cgi-bin/libraryhttp://www.nzdl.org/cgi-bin/library  Challenge: fast algorithms for finding minimal grammar, e.g. for S  babaabaabaa

Teoma  More refined concept of authoritativeness, depending on the specific query (“subject-specific popularity”)  More sophisticated query refinement  But: Coverage is only 10% of that of Google  Example: http://www.teoma.comhttp://www.teoma.com

Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität.

Similar presentations

Presentation on theme: "Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität.

Similar presentations

Presentation on theme: "Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität."— Presentation transcript:

Similar presentations

About project

Feedback