Why Spectral Retrieval Works

Slides:

Advertisements

Similar presentations

Mathematical Induction The Principle of Mathematical Induction Application in the Series Application in divisibility.

Advertisements

Collaborative Filtering in Social Tagging System on Joint Item-Tag Recommendations Date : 2011/11/7 Source : Jing Peng et. al (CIKM’10) Speaker : Chiu.

Chapter 3 Determinants and Eigenvectors 大葉大學資訊工程系黃鈴玲 Linear Algebra.

Dimensionality Reduction PCA -- SVD

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

1 生物計算期末作業暨南大學資訊工程系 2003/05/13. 2 compare f1 f2  只比較兩個檔案 f1 與 f2 ，比完後將結果輸出。 compare directory  以兩兩比對的方式，比對一個目錄下所有檔案的相似程度。  將相似度很高的檔案做成報表輸出，報表中至少要.

Divide-and-Conquer. 什麼是 divide-and-conquer ？ Divide 就是把問題分割 Conquer 則是把答案結合起來.

: A-Sequence 星級 : ★★☆☆☆ 題組： Online-judge.uva.es PROBLEM SET Volume CIX 題號： Problem D : A-Sequence 解題者：薛祖淵解題日期： 2006 年 2 月 21 日題意：一開始先輸入一個.

Chapter 2 Random Vectors 與他們之間的性質 (Random vectors and their properties)

Hinrich Schütze and Christina Lioma

Reference, primitive, call by XXX 必也正名乎誌謝 : 部份文字取於前輩 TAHO 的文章.

指導教授：陳淑媛學生：李宗叡李卿輔.  利用下列三種方法 (Edge Detection 、 Local Binary Pattern 、 Structured Local Edge Pattern) 來判斷是否為場景變換，以方便使用者來找出所要的片段。

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

STAT0_sampling Random Sampling  母體： Finite population & Infinity population  由一大小為 N 的有限母體中抽出一樣本數為 n 的樣本，若每一樣本被抽出的機率是一樣的，這樣本稱為隨機樣本 (random sample)

第一章信號與系統初論信號的簡介與DSP的處理方式。系統特性與穩定性的判定方法。以MATLAB驗證系統的線性、非時變、因果等特性。

Lecture Note of 9/29 jinnjy. Outline Remark of “Central Concepts of Automata Theory” (Page 1 of handout) The properties of DFA, NFA,  -NFA.

Monte Carlo Simulation Part.2 Metropolis Algorithm Dept. Phys. Tunghai Univ. Numerical Methods C. T. Shih.

Network Connections ★★★☆☆ 題組： Contest Archive with Online Judge 題號： Network Connections 解題者：蔡宗翰解題日期： 2008 年 10 月 20 日題意：給你電腦之間互相連線的狀況後，題.

Hint of final exams jinnjy. Outline Hint of final 2006 (6/28/2007)

: The largest Clique ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11324: The largest Clique 解題者：李重儀解題日期： 2008 年 11 月 24 日題意：簡單來說，給你一個 directed.

Matlab Assignment Due Assignment 兩個 matlab 程式 : Eigenface ： Eigenvector 和 eigenvalue 的應用. Fractal ： Affine transform( rotation, translation,

: Tight words ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： : Tight Words 解題者：鐘緯駿、林一帆解題日期： 2006 年 03 月 14 日題意：給定數字 k 與 n (0 ≦ k.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

: Problem A : MiniMice ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11411: Problem A : MiniMice 解題者：李重儀解題日期： 2008 年 9 月 3 日題意：簡單的說，題目中每一隻老鼠有一個編號.

: Multisets and Sequences ★★★★☆ 題組： Problem Set Archive with Online Judge 題號： 11023: Multisets and Sequences 解題者：葉貫中解題日期： 2007 年 4 月 24 日題意：在這個題目中，我們要定義.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

公用品.  該物品的數量不會因一人的消費而受到影響，它可以同時地被多人享用。角色分配  兩位同學當我的助手，負責：  其餘各人是投資者，每人擁有 $100 ，可以投資在兩種資產上。  記錄  計算  協助同學討論.

: Problem G e-Coins ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 10306: Problem G e-Coins 解題者：陳瀅文解題日期： 2006 年 5 月 2 日題意：給定一個正整數 S (0

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

: THE SAMS' CONTEST ☆☆★★★ 題組： Problem Set Archive with Online Judge 題號： 10520: THE SAMS' CONTEST 解題者：陳相廷，林祺光解題日期： 2006 年 5 月 22 日題意：依以下式子，給定 n.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

JAVA 程式設計與資料結構第二十章 Searching. Sequential Searching Sequential Searching 是最簡單的一種搜尋法，此演算法可應用在 Array 或是 Linked List 此等資料結構。 Sequential Searching 的 worst-case.

: Expect the Expected ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11427: Expect the Expected 解題者：李重儀解題日期： 2008 年 9 月 21 日題意：玩一種遊戲 (a game.

845: Gas Station Numbers ★★★ 題組： Problem Set Archive with Online Judge 題號： 845: Gas Station Numbers. 解題者：張維珊解題日期： 2006 年 2 月題意：將輸入的數字，經過重新排列組合或旋轉數字，得到比原先的數字大，

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

1 523: Minimum Transport Cost ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 523: Minimum Transport Cost 解題者：林祺光解題日期： 2006 年 6 月 12 日題意：計算兩個城市之間最小的運輸成本，運輸.

質數 (Prime) 相關問題 (III) — 如何找出相對大的質數 Date: May 27, 2009 Introducer: Hsing-Yen Ann.

2005/7 Linear system-1 The Linear Equation System and Eliminations.

轉寄來的信件. Sometimes, we think of why friends keep forwarding mails to us without writing a word, maybe this could explain why.. 有時我們會想, 為什麼朋友總是不斷轉寄信件給我們但卻沒有.

E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

:Commandos ★★★☆☆ 題組： Contest Archive with Online Judge 題號： 11463: Commandos 解題者：李重儀解題日期： 2008 年 8 月 11 日題意：題目會給你一個敵營區內總共的建築物數，以及建築物之間可以互通的路有哪些，並給你起點的建築物和終點.

:Count the Trees ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 10007:Count the Trees 解題者：楊家豪解題日期： 2006 年 3 月題意：給 n 個點, 每一個點有自己的 Label,

: Finding Paths in Grid ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11486: Finding Paths in Grid 解題者：李重儀解題日期： 2008 年 10 月 14 日題意：給一個 7 個 column.

:Problem E.Stone Game ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 10165: Problem E.Stone Game 解題者：李濟宇解題日期： 2006 年 3 月 26 日題意： Jack 與 Jim.

: How many 0's? ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 11038: How many 0’s? 解題者：楊鵬宇解題日期： 2007 年 5 月 15 日題意：寫下題目給的 m 與 n(m

1 柱體與錐體 1. 找出柱體與錐體的規則 2. 柱體的命名與特性 3. 柱體的展開圖 4. 錐體的命名與特性 5. 錐體的展開圖

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

SINGULAR VALUE DECOMPOSITION (SVD)

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.

Automated Information Retrieval

Singular Value Decomposition and its applications

Authors: Hung-Yu, Chi-Sung Laih

Document Clustering Based on Non-negative Matrix Factorization

Latent Semantic Indexing

LSI, SVD and Data Management

Representation of documents and queries

CS 430: Information Discovery

CS 430: Information Discovery

Retrieval Evaluation - Measures

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.

Latent Semantic Analysis

Presentation transcript:

Why Spectral Retrieval Works Holger Bast & Debapriyo Majumdar, Why spectral retrieval works, In Proceedings SIGIR’05, pages11-18, August 15-19, 2005 Presenter: Suhan Yu

Abstract Spectral retrieval: in LSI, the characteristic is latent topic, combine these characteristic to a vector, then use the vector to compare terms, this processing is called Spectral retrieval. Fixed low-dimension vary the dimension Study for each term pair the resulting curve of relatedness scores. Spectral :“每”一個頻率的weight 這是在訊號處理上的解釋每個頻率可視為一種特徵在 IR上　　對doc來說　它的特徵就是每一個term 在LSA上　對doc來說　它的特徵就是每一個latent topic 在每個特徵上都有weight 因為有很多個特徵　所以把這些特徵串起來就變成一個vector 然後在比較時就用這個特徵vector來比較　就叫spectral retrieval

What we mean by spectral retrieval Ranked retrieval in the term space d1 d2 d3 d4 d5 q internet web surfing beach 2 1 1 1.00 1.00 0.00 0.50 0.00  "true" similarities to query qTd1 ——— |q||d1| 0.82 qTd2 ——— |q||d2| 0.00 0.00 0.38 0.00  cosine similarities Let me first explain the classical view on spectral retrieval, and introduce some basic notation on the way. This will be our example for the next 5 minutes so let me try to make it perfectly clear.

What we mean by spectral retrieval Ranked retrieval in the term space d1 d2 d3 d4 d5 q internet web surfing beach 2 1 1 1.00 1.00 0.00 0.50 0.00  "true" similarities to query 0.82 0.00 0.00 0.38 0.00  cosine similarities Spectral retrieval = linear projection to an eigensubspace Let me first explain the classical view on spectral retrieval, and introduce some basic notation on the way. This will be our example for the next 5 minutes so let me try to make it perfectly clear. L d1 L d2 L d3 L d4 L d5 L q projection matrix L 2.01 1.67 0.37 2.61 1.39 1.01 0.79 -0.84 -0.21 -1.75 0.42 0.33 0.42 0.51 0.66 0.37 0.33 0.43 -0.08 -0.84 (Lq)T(Ld1) —————— |Lq| |Ld1| 0.98 0.98 -0.25 0.73 0.01  cosine similarities in the subspace …

Viewing LSI as document expansion The cosine-similarity of a query q with document A Can be dropped Diagonal matrix

Viewing LSI as document expansion 1. 2. is an eigenvector Query need to project (online) Can be count offline and document expansion.

Spectral retrieval — alternative view Ranked retrieval in the term space d1 d2 d3 d4 d5 q internet web surfing beach 2 1 1 Spectral retrieval = linear projection to an eigensubspace Let me first explain the classical view on spectral retrieval, and introduce some basic notation on the way. This will be our example for the next 5 minutes so let me try to make it perfectly clear. L d1 L d2 L d3 L d4 L d5 L q projection matrix L 2.01 1.67 0.37 2.61 1.39 1.01 0.79 -0.84 -0.21 -1.75 0.42 0.33 0.42 0.51 0.66 0.37 0.33 0.43 -0.08 -0.84 (Lq)T(Ld1) —————— |Lq||Ld1|  cosine similarities in the subspace … = qT(LTLd1) —————— |Lq||LTLd1|

Spectral retrieval — alternative view Ranked retrieval in the term space d1 d2 d3 d4 d5 q expansion matrix LTL internet web surfing beach 2 1 1 0.29 0.36 0.25 -0.12 0.44 0.30 -0.17 0.84 Spectral retrieval = linear projection to an eigensubspace Let me first explain the classical view on spectral retrieval, and introduce some basic notation on the way. This will be our example for the next 5 minutes so let me try to make it perfectly clear. L d1 L d2 L d3 L d4 L d5 L q projection matrix L 2.01 1.67 0.37 2.61 1.39 1.01 0.79 -0.84 -0.21 -1.75 0.42 0.33 0.42 0.51 0.66 0.37 0.33 0.43 -0.08 -0.84  cosine similarities in the subspace … qT(LTLd1) —————— |Lq||LTLd1|

Spectral retrieval — alternative view Ranked retrieval in the term space LTLd1 LTLd2 LTLd3 LTLd4 LTLd5 q expansion matrix LTL internet web surfing beach 1.18 0.96 -0.12 1.03 0.01 1.45 1.19 -0.17 1.22 -0.05 1.24 1.04 0.30 1.73 -0.11 -0.04 0.84 1.15 1.98 1 0.29 0.36 0.25 -0.12 0.44 0.30 -0.17 0.84 qT(LTLd1) —————— |q||LTLd1| …  similarities after document expansion Spectral retrieval = linear projection to an eigensubspace Let me first explain the classical view on spectral retrieval, and introduce some basic notation on the way. This will be our example for the next 5 minutes so let me try to make it perfectly clear. L d1 L d2 L d3 L d4 L d5 L q projection matrix L 2.01 1.67 0.37 2.61 1.39 1.01 0.79 -0.84 -0.21 -1.75 0.42 0.33 0.42 0.51 0.66 0.37 0.33 0.43 -0.08 -0.84 qT(LTLd1) —————— |Lq||LTLd1|  cosine similarities in the subspace … Spectral retrieval = document expansion (not query expansion)

The curve of relatedness scores Instead of looking at all the entries for a fixed dimension, looking at a fixed entry for all dimensions. Two terms All dimensions Term j 在 Dimension=k時 i On Latent semantic space j

The curve of relatedness scores The most curves quite naturally fall into one of these three categories. Unrelated terms

Perfectly related terms Perfectly related terms: terms that have identical co-occurrence patterns. Definition:

Perfectly related terms (cont.) If (singular value decomposition of A) Define Identity matrix (eigenvector are orthogonal) C instead of A

Perfectly related terms (cont.) (m-2)*(m-2) (m-2)*1 (m-2)*1 1*(m-2) 1*1 1*1 Paper wrong

Perfectly related terms (cont.) Calculate eigenvector One of the eigenvalue= x - y

Perfectly related terms (cont.) Replace the x and y

Perfectly related terms (cont.) Calculate eigenvector: Let norm=1

Perfectly related terms (cont.) then eigenvector of C assume is other eigenvectors of C (eigenvectors are orthogonal)

Perfectly related terms (cont.) We get: (Lemma1) the vector is a left singular vector of A the corresponding singular value is for all other left singular vectors of A , , that is, the last two entries are equal.

Perfectly related terms (cont.) · 1 term 1 term 2 point of fall-off is different for every term pair!

Adding perturbations The perfectly related terms is robust under small perturbations of the term-document matrix. Lemma2: 如果perfectly related terms 可以忍受一些雜訊那麼我們就不一定要找到perfectly related terms 有些微的差異也是可以忍受的

Adding perturbations K most significant left singular vectors of A Orthogonal k*k matrix Any matrix, and F: Forbenius norm F<1/4 By Stewart’s theorem on perturbation of symmetric matrixes A B C a b c p m n x Get

Adding perturbations Let Cauchy’s inequality

up-and-then-down shape remains Adding perturbations up-and-then-down shape remains Sufficiently small perturbations change the curve of relatedness scores only little at any dimension before it fall-off.

Curves for unrelated terms co-occur Co-occurrence graph: vertices = terms edge = two terms co-occur Lemma3: We call two terms perfectly unrelated if no path connects them in the graph term proven shape for perfectly unrelated terms provably small change after slight perturbation half way to a real matrix 200 400 600 subspace dimension 200 400 600 subspace dimension 200 400 600 subspace dimension expansion matrix entry curves for unrelated terms are random oscillations around zero

3 Lemma review Lemma1 (perfectly related terms) the vector is a left singular vector of A the corresponding singular value is for all other left singular vectors of A , , that is, the last two entries are equal. Lemma2 (Add some perturbations) Lemma3 (definition unrelated terms) We call two terms perfectly unrelated if no path connects them in the graph

Dimensionless algorithm Choose dimension: Every choice is inappropriate for a significant fraction of the term pair.

Dimensionless algorithm

Algorithm TN (dimensionless) Normalize the length of A to length 1 Compute SVD of the normalize A For each pair of term i,j compute the size of the set Perform document expansion with zero-one matrix T ( if and only if ) represents two terms related The corresponding curve of relatedness scores is never negative

Telling the shapes apart — TN Normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to 0 expansion matrix entry set entry to 1 set entry to 1 set entry to 0 200 400 600 200 400 600 200 400 600 subspace dimension subspace dimension subspace dimension a simple 0-1 classification, no fractional entries!

Algorithm TN (dimensionless) Lemma1: all perfectly related terms , . Lemma2: these assignment to T are invariant under small perturbations of the underlying term-document matrix. Lemma3: completely unrelated terms have all-zero curves, ( , )

Algorithm TS (dimensionless) Compute the same matrix U as for TN Disadvantage: need to find s. Advantage: finding s is more intuitive than previous method. (fixed dimension) For each pair of term i,j compute the smoothness of their curve as Perform document expansion with zero-one matrix T ( if and only if ) if only if the scores go only up or only down zig-zags s: smoothness threshold This experiments set s: 0.2% of the entries in T are 1.

An alternative algorithm — TM Again, normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs For each term pair compute the monotonicity of its initial curve (= 1 if perfectly monotone,  0 as number of turns increase) If monotonicity is above some threshold, set entry in expansion matrix to 1, otherwise to 0 0.69 0.69 0.07 0.07 0.82 0.82 expansion matrix entry One more slide with the sij replaced by Tij (or one after the other) set entry to 1 set entry to 1 set entry to 0 200 400 600 200 400 600 200 400 600 subspace dimension subspace dimension subspace dimension again: a simple 0-1 classification!

Computation complexity Nonzero entries Original LSI: because of TN/TS: the can save by discarding pairs of terms that do not co-occur in at least one document. Average number of related terms of a term

Experimental evaluation Three collection: Time collection (3882*425) <83queries> Reuters collection (5701*21578) <120queries> Topic labels Ohsumed collection (99117*233445) <63queries>

Experimental evaluation Use other spectral retrieval schemes from the literal: LSI LSI-RN: term normalized variant CORR: correlation method IRR: iterative residual rescaling Baseline method: COS (cosine similarity) COS,LSI,IRR,LSI-RN: standard tf-idf matrix CORR,TN,TS: row-normalized matrix

Result of the experiments dimension300 dimension400

Result of the experiments dim800 dim1000 dim1000 dim1200

Experimental results COS LSI* LSI-RN* CORR* IRR* TN TM TIME 63.2% (average precision) 425 docs 3882 terms COS LSI* LSI-RN* CORR* IRR* TN TM TIME 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% Baseline: cosine similarity in term space Latent Semantic Indexing Dumais et al. 1990 Term-normalized LSI Ding et al. 2001 Correlation-based LSI Dupret et al. 2001 Mention unpredictable effects of LSI and relatives: sometimes even below COS baseline! Iterative Residual Rescaling Ando & Lee 2001 our non-negativity test our monotonicity test * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!

Experimental results COS LSI* LSI-RN* CORR* IRR* TN TM TIME 63.2% (average precision) 425 docs 3882 terms 21578 docs 5701 terms 233445 docs 99117 terms COS LSI* LSI-RN* CORR* IRR* TN TM TIME 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% REUTERS 36.2% 32.0% 37.0% 32.3% —— 41.9% 42.9% OHSUMED 13.2% 6.9% 13.0% 10.9% —— 14.4% 15.3% Mention unpredictable effects of LSI and relatives: sometimes even below COS baseline! * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!

Binary vs. fractional relatedness The finding is that algorithms which do a simple binary classification into related and unrelated term pairs outperform schemes which seem to have additional power by giving a fractional assessment for each term pair.

Binary vs. fractional relatedness Most curves have scores at or below zero at either very few dimensions or at quite a lot of dimensions. 表示大部份的term有完美的相關性或者完全無關

Conclusions and outlook This paper introduced the curves of relatedness scores as a new angle of looking at retrieval. Dimensionless algorithm outperform previous schemes, and more intuitive. Now we use symmetric matrix, but one of the relation between terms is asymmetric. Such as “nucleic” (核酸) and “acid” (酸).