Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.

Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in Salvador, Brazil, August 15 – 19

What we mean by spectral retrieval d1d1 d2d2 d3d3 d4d4 d5d5 q T d 1 ——— |q||d 1 | cosine similarities q T d 2 ——— |q||d 2 | 0.820.00 0.380.00 20010 12010 11021 00112  1 0 0 0 internet web surfing beach  Ranked retrieval in the term space 1.00 0.000.500.00 "true" similarities to query q 

What we mean by spectral retrieval d1d1 d2d2 d3d3 d4d4 d5d5 cosine similarities 0.820.00 0.380.00 20010 12010 11021 00112  1 0 0 0 internet web surfing beach  Ranked retrieval in the term space 1.00 0.000.500.00 "true" similarities to query q   Spectral retrieval = linear projection to an eigensubspace projection matrix L 0.420.510.660.37 0.330.43-0.08-0.84 2.011.670.372.611.39 1.010.79-0.84-0.21-1.75 0.42 0.33 L q L d 1 L d 2 L d 3 L d 4 L d 5 (Lq) T (Ld 1 ) —————— |Lq| |Ld 1 | … 0.98 -0.250.730.01 cosine similarities in the subspace 

Why and when does this work?  Previous work: if the term-document matrix is a slight perturbation of a rank-k matrix then projection to a k-dimensional subspace works –Papadimitriou, Tamaki, Raghavan, Vempala PODS'98 –Ding SIGIR'99 –Ando and Lee SIGIR'01 –Azar, Fiat, Karlin, McSherry, Saia STOC'01  Our explanation: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns –no single subspace is appropriate for all term pairs –we fix that problem

Spectral retrieval — alternative view  Spectral retrieval = linear projection to an eigensubspace projection matrix L 0.420.510.660.37 0.330.43-0.08-0.84 2.011.670.372.611.39 1.010.79-0.84-0.21-1.75 0.42 0.33 L q L d 1 L d 2 L d 3 L d 4 L d 5 (Lq) T (Ld 1 ) —————— |Lq||Ld 1 | … cosine similarities in the subspace  q T (L T Ld 1 ) —————— |Lq||L T Ld 1 | = d1d1 d2d2 d3d3 d4d4 d5d5 20010 12010 11021 00112 1 0 0 0 internet web surfing beach  Ranked retrieval in the term space q

Spectral retrieval — alternative view  Spectral retrieval = linear projection to an eigensubspace 2.011.670.372.611.39 1.010.79-0.84-0.21-1.75 0.42 0.33 L q L d 1 L d 2 L d 3 L d 4 L d 5 … cosine similarities in the subspace  q T (L T Ld 1 ) —————— |Lq||L T Ld 1 | d1d1 d2d2 d3d3 d4d4 d5d5 20010 12010 11021 00112 1 0 0 0 internet web surfing beach  Ranked retrieval in the term space q 0.290.360.25-0.12 0.360.440.30-0.17 0.250.300.440.30 -0.12-0.170.300.84 expansion matrix L T L 0.420.510.660.37 0.330.43-0.08-0.84 projection matrix L

Spectral retrieval — alternative view  Spectral retrieval = linear projection to an eigensubspace 2.011.670.372.611.39 1.010.79-0.84-0.21-1.75 0.42 0.33 L q L d 1 L d 2 L d 3 L d 4 L d 5 … cosine similarities in the subspace  1.180.96-0.121.030.01 1.451.19-0.171.22-0.05 1.241.040.301.731.04 -0.11-0.040.841.151.98 1 0 0 0 internet web surfing beach  Ranked retrieval in the term space q 0.290.360.25-0.12 0.360.440.30-0.17 0.250.300.440.30 -0.12-0.170.300.84 expansion matrix L T L 0.420.510.660.37 0.330.43-0.08-0.84 q T (L T Ld 1 ) —————— |Lq||L T Ld 1 | L T Ld 1 L T Ld 2 L T Ld 3 L T Ld 4 L T Ld 5 q T (L T Ld 1 ) —————— |q||L T Ld 1 | … similarities after document expansion  Spectral retrieval = document expansion (not query expansion) projection matrix L

Why document "expansion" 1100 1100 0010 0001 internet web surfing beach · 0-1 expansion matrix internet web surfing beach 1 1 1 0 = 0 1 1 0

Why document "expansion" 1100 1100 0010 0001 internet web surfing beach · add "internet" if "web" is present 0-1 expansion matrix internet web surfing beach 1 1 1 0 = 0 1 1 0

Why document "expansion" 0.290.360.25-0.12 0.360.440.30-0.17 0.250.300.440.30 -0.12-0.170.300.84 internet web surfing beach 0 1 1 0  Ideal expansion matrix has –high scores for intuitively related terms –low scores for intuitively unrelated terms expansion matrix L T L internet web surfing beach 0.61 0.74 0.13 matrix L projecting to 2 dimensions 0.420.510.660.37 0.330.43-0.08-0.84 add "internet" if "web" is present · = expansion matrix depends heavily on the subspace dimension!

Why document "expansion" 0.93-0.120.20-0.11 -0.120.800.34-0.18 0.200.340.440.30 -0.11-0.180.300.84 internet web surfing beach 0 1 1 0  Ideal expansion matrix has –high scores for intuitively related terms –low scores for intuitively unrelated terms internet web surfing beach 0.08 1.13 0.78 0.12 0.420.510.660.37 0.330.43-0.08-0.84 -0.800.590.06-0.01 add "internet" if "web" is present · = matrix L projecting to 3 dimensions expansion matrix depends heavily on the subspace dimension! expansion matrix L T L

node / vertex 2004006000 subspace dimension logic / logics 2004006000 subspace dimension logic / vertex 2004006000 subspace dimension Our Key Observation  We studied how the entries in the expansion matrix depend on the dimension of the subspace to which documents are projected expansion matrix entry 0 no single dimension is appropriate for all term pairs

node / vertex 2004006000 subspace dimension logic / logics 2004006000 subspace dimension logic / vertex 2004006000 subspace dimension Our Key Observation  We studied how the entries in the expansion matrix depend on the dimension of the subspace to which documents are projected expansion matrix entry 0 no single dimension is appropriate for all term pairs but the shape of the curve is a good indicator for relatedness!

Curves for related terms ·····1100 ·····0011 ·····1111 001111010 001110101  We call two terms perfectly related if they have an identical co-occurrence pattern 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension expansion matrix entry proven shape for perfectly related terms provably small change after slight perturbation half way to a real matrix up-and-then-down shape remains point of fall-off is different for every term pair! term 1 term 2 0

Curves for unrelated terms  Co-occurrence graph: –vertices = terms –edge = two terms co-occur  We call two terms perfectly unrelated if no path connects them in the graph curves for unrelated terms are random oscillations around zero proven shape for perfectly unrelated terms provably small change after slight perturbation half way to a real matrix 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension expansion matrix entry 0

Telling the shapes apart — TN 1.Normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs 2.For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to 0 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension a simple 0-1 classification, no fractional entries! set entry to 1 set entry to 0 expansion matrix entry 0

An alternative algorithm — TM 1.Again, normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs 2.For each term pair compute the monotonicity of its initial curve (= 1 if perfectly monotone,  0 as number of turns increase) 3.If monotonicity is above some threshold, set entry in expansion matrix to 1, otherwise to 0 again: a simple 0-1 classification! 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension 0.82 0.69 0.07 expansion matrix entry set entry to 1 0.82 0.69 0.07 set entry to 0 0

Experimental results TIME 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Baseline: cosine similarity in term space Latent Semantic Indexing Dumais et al. 1990 Term-normalized LSI Ding et al. 2001 Correlation-based LSI Dupret et al. 2001 Iterative Residual Rescaling Ando & Lee 2001 our non-negativity test our monotonicity test (average precision)

Experimental results TIME 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms REUTERS 36.2% 32.0% 37.0% 32.3% —— 41.9% 42.9% 21578 docs 5701 terms OHSUMED 13.2% 6.9% 13.0% 10.9% —— 14.4% 15.3% 233445 docs 99117 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! (average precision)

Conclusions  Main message: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns –a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace  Some useful corollaries … –new insights into the effect of term-weighting and other normalizations for spectral retrieval –straightforward integration of known word relationships –consequences for spectral link analysis?

Conclusions Obrigado!  Main message: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns –a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace  Some useful corollaries … –new insights into the effect of term-weighting and other normalizations for spectral retrieval –straightforward integration of known word relationships –consequences for spectral link analysis?

Why document "expansion" 1000 0100 0010 0001 internet web surfing beach 0 1 1 0  Ideal expansion matrix has –high scores for related terms –low scores for unrelated terms  Expansion matrix L T L depends on the subspace dimension internet web surfing beach 0 1 1 0 0.420.510.660.37 0.330.43-0.08-0.84 -0.800.590.06-0.01 0.270.45-0.750.41 add "internet" if "web" is present · = expansion matrix L T L matrix L projecting to 4 dimensions

Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.

Similar presentations

Presentation on theme: "Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.

Similar presentations

Presentation on theme: "Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in."— Presentation transcript:

Similar presentations

About project

Feedback