9/7 Agenda  Project 1 discussion  Correlation Analysis  PCA (LSI)

Slides:



Advertisements
Similar presentations
3D Geometry for Computer Graphics
Advertisements

Text Databases Text Types
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
Linear Algebra.
PCA + SVD.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Learning for Text Categorization
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Computer Graphics Recitation 5.
3D Geometry for Computer Graphics
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Improving Vector Space Ranking
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
3D Geometry for Computer Graphics
Relevance Feedback Main Idea:
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
SVD(Singular Value Decomposition) and Its Applications
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Chapter 2 Dimensionality Reduction. Linear Methods
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
Elementary Linear Algebra Anton & Rorres, 9th Edition
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Vector Space Models.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principle Component Analysis and its use in MA clustering Lecture 12.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Instructor: Mircea Nicolescu Lecture 8 CS 485 / 685 Computer Vision.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Vector Semantics Dense Vectors.
CS246 Linear Algebra Review. A Brief Review of Linear Algebra Vector and a list of numbers Addition Scalar multiplication Dot product Dot product as a.
Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)
Lecture 13: Singular Value Decomposition (SVD)
Automatic Global Analysis
Yet another Example T This happens to be a rank-7 matrix
Presentation transcript:

9/7 Agenda  Project 1 discussion  Correlation Analysis  PCA (LSI)

The first rule of the social fabric – that in times of crisis you protect the vulnerable – was trampled. -- David Brooks (NYT 9/4) Conservative Political Commentator 9/7 Agenda  Project 1 discussion  Correlation Analysis  PCA (LSI)

Improving Vector Space Ranking We will consider two classes of techniques –Correlation analysis, which looks at correlations between keywords (and thus effectively computes a thesaurus based on the word occurrence in the documents) –Principal Components Analysis (also called Latent Semantic Indexing) which subsumes correlation analysis and does dimensionality reduction.

Correlation/Co-occurrence analysis Co-occurrence analysis: Terms that are related to terms in the original query may be added to the query. Two terms are related if they have high co-occurrence in documents. Let n be the number of documents; n 1 and n 2 be # documents containing terms t1 and t2, m be the # documents having both t1 and t2 If t 1 and t 2 are independent If t 1 and t 2 are correlated Measure degree of correlation >> if Inversely correlated

Association Clusters Let Mij be the term-document matrix –For the full corpus (Global) –For the docs in the set of initial results (local) –(also sometimes, stems are used instead of terms) Correlation matrix C = MM T ( term-doc Xdoc-term = term-term ) Un-normalized Association Matrix Normalized Association Matrix Nth-Association Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example d 1 d 2 d 3 d 4 d 5 d 6 d 7 K K K Correlation Matrix Normalized Correlation Matrix 1 th Assoc Cluster for K 2 is K 3

Scalar clusters Consider the normalized association matrix S The “association vector” of term u Au is (Su1,Su2…Suk) To measure neighborhood-induced correlation between terms: Take the cosine-theta between the association vectors of terms u and v Even if terms u and v have low correlations, they may be transitively correlated (e.g. a term w has high correlation with u and v). Nth-scalar Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example Normalized Correlation Matrix A K1 USER(43): (neighborhood normatrix) 0: (COSINE-METRIC ( ) ( )) 0: returned 1.0 0: (COSINE-METRIC ( ) ( )) 0: returned : (COSINE-METRIC ( ) ( )) 0: returned : (COSINE-METRIC ( ) ( )) 0: returned : (COSINE-METRIC ( ) ( )) 0: returned 1.0 0: (COSINE-METRIC ( ) ( )) 0: returned : (COSINE-METRIC ( ) ( )) 0: returned : (COSINE-METRIC ( ) ( )) 0: returned : (COSINE-METRIC ( ) ( )) 0: returned Scalar (neighborhood) Cluster Matrix 1 th Scalar Cluster for K 2 is still K 3

Metric Clusters Let r(ti,tj) be the minimum distance (in terms of number of separating words) between ti and tj in any single document (infinity if they never occur together in a document) –Define cluster matrix Suv= 1/r(ti,tj) Nth-metric Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk r(ti,tj) is also useful For proximity queries And phrase queries average..

Similarity Thesaurus The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. –obtained by considering that the terms are concepts in a concept space. each term is indexed by the documents in which it appears. –Terms assume the original role of documents while documents are interpreted as indexing elements

Motivation KaKa KbKb Q KvKv KiKi Kj

Similarity Thesaurus The relationship between two terms ku and kv is computed as a correlation factor cu,v given by The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection –Expensive Possible to do incremental updates… Similar to the scalar clusters Idea, but for the tf/itf weighting Defining the term vector

Similarity Thesaurus Inverse term frequency for document dj To ki is associated a vector Where n Terminology u t: number of terms in the collection u N: number of documents in the collection u F i,j : frequency of occurrence of the term ki in the document dj u tj: vocabulary of document dj u itfj: inverse term frequency for document dj Idea: It is no surprise if Oxford dictionary Mentions the word!

Beyond Correlation analysis: PCA/LSI Suppose I start with documents described in terms of just two key words, u and v, but then –Add a bunch of new keywords (of the form 2u-3v; 4u-v etc), and give the new doc-term matrix to you. Will you be able to tell that the documents are really 2-dimensional (in that there are only two independent keywords)? –Suppose, in the above, I also add a bit of noise to each of the new terms (i.e. 2u-3v+noise; 4u-v+noise etc). Can you now discover that the documents are really 2-D? –Suppose further, I remove the original keywords, u and v, from the doc- term matrix, and give you only the new linearly dependent keywords. Can you now tell that the documents are 2-dimensional? Notice that in this last case, the true dimensions of the data are not even present in the representation! You have to re-discover the true dimensions as linear combinations of the given dimensions. added

Data Generation Models The fact that keywords in the documents are not actually independent, and that they have synonymy and polysemy among them, often manifests itself as if some malicious oracle mixed up the data as above. Need Dimensionality Reduction Techniques If the keyword dependence is only linear (as above), a general polynomial complexity technique called Principal Components Analysis is able to do this dimensionality reduction PCA applied to documents is called Latent Semantic Indexing –If the dependence is nonlinear, you need non-linear dimensionality reduction techniques (such as neural networks); much costlier.

Visual Example Classify Fish –Length –Height

Move Origin To center of centroid But are these the best axes?

Better if one axis accounts for most data variation What should we call the red axis? Size (“factor”)

We retain 1.75/2.00 x 100 (87.5%) of the original variation. Thus, by discarding the yellow axis we lose only 12.5% of the original information. Reduce Dimensions What if we only consider “size”

If you can do it for fish, why not to docs? We have documents as vectors in the space of terms We want to –“Transform” the axes so that the new axes are “Orthonormal” (independent axes) Can be ordered in terms of the amount of variation in the documents they capture Pick top K dimensions (axes) in this ordering; and use these new K dimensions to do the vector-space similarity ranking Why? –Can reduce noise –Can eliminate dependent variabales –Can capture synonymy and polysemy How? –SVD (Singular Value Decomposition)

What happens if you multiply a vector by a matrix? In general, when you multiply a vector by a matrix, the vector gets “scaled” as well as “rotated” –..except when the vector happens to be in the direction of one of the eigen vectors of the matrix –.. in which case it only gets scaled (stretched) A (symmetric square) matrix has all real eigen values, and the values give an indication of the amount of stretching that is done for vectors in that direction The eigen vectors of the matrix define a new ortho-normal space –You can model the multiplication of a general vector by the matrix in terms of First decompose the general vector into its projections in the eigen vector directions –..which means just take the dot product of the vector with the (unit) eigen vector Then multiply the projections by the corresponding eigen values—to get the new vector. –This explains why power method converges to principal eigen vector....since if a vector has a non-zero projection in the principal eigen vector direction, then repeated multiplication will keep stretching the vector in that direction, so that eventually all other directions vanish by comparison.. added

SVD, Rank and Dimensionality Suppose we did SVD on a doc- term matrix d-t, and took the top-k eigen values and reconstructed the matrix d-t k. We know –d-t k has rank k (since we zeroed out all the other eigen values when we reconstructed d-t k ) –There is no k-rank matrix M such that ||d-t –M|| < ||d-t – d-t k || In other words d-t k is the best rank-k (dimension-k) approximation to d-t! –This is the guarantee given by SVD! Rank of a matrix M is defined as the size of the largest square sub-matrix of M which has a non-zero determinant. –The rank of a matrix M is also equal to the number of non-zero singular values it has –Rank of M is related to the true dimensionality of M. If you add a bunch of rows to M that are linear combinations of the existing rows of M, the rank of the new matrix will still be the same as the rank of M. Distance between two equi-sized matrices M and M’; ||M-M’|| is defined as the sum of the squares of the differences between the corresponding entries (Sum (m uv -m’ uv ) 2 ) –Will be equal to zero when M = M’ Optional

Terms and Docs as vectors in “factor” space Document vector Term vector If terms are independent, the T-T similarity matrix would be diagonal =If it is not diagonal, we can use the correlations to add related terms to the query =But can also ask the question “Are there independent dimensions which define the space where terms & docs are vectors ?” In addition to doc-doc similarity, We can compute term-term distance

Singular Value Decomposition Convert doc-term matrix into 3matrices D-F, F-F, T-F Where DF*FF*TF’ gives the Original matrix back  Reduce Dimensionality: Throw out low-order rows and columns Recreate Matrix: Multiply to produce approximate term- document matrix. dt k is a k-rank matrix That is closest to dt dtdf ff tf t df k ff k tf k t Overview of Latent Semantic Indexing Doc-factor (eigen vectors of d-t*d-t’) factor-factor (+ve sqrt of eigen values of d-t*d-t’or d-t’*d-t; both same) (term-factor) T (eigen vectors of d-t’*d-t) doc Term doc dt k dxtdxffxf fxt dxkkxkkxtdxt

New document coordinates d-f*f-f t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear D-F F-F T-F 6 singular values ( positive sqrt of eigen values of MM’ or M’M ) Eigen vectors of MM’ (Principal document directions) Eigen vectors of M’M (Principal term directions)

For the database/regression example t1= database t2=SQL t3=index t4=regression t5=likelihood t6=linear Suppose D1 is a new Doc containing “database” 50 times and D2 contains “SQL” 50 times

LSI Ranking… Given a query –Either add query also as a document in the D-T matrix and do the svd OR –Convert query vector (separately) to the LSI space –DFq*FF=q*TF this is the weighted query document in LSI space –Reduce dimensionality as needed Do the vector-space similarity in the LSI space

Using LSI Can be used on the entire corpus –First compute the SVD of the entire corpus –Store first k columns of the df*ff matrix [df*ff] k –Keep the tf matrix handy –When a new query q comes, take the k columns of q*tf –Compute the vector similarity between [q*tf] k and all rows of [df*ff] k, rank the documents and return Can be used as a way of clustering the results returned by normal vector space ranking –Take some top 50 or 100 of the documents returned by some ranking (e.g. vector ranking) –Do LSI on these documents –Take the first k columns of the resulting [df*ff] matrix –Each row in this matrix is the representation of the original documents in the reduced space. –Cluster the documents in this reduced space (We will talk about clustering later) –MANJARA did this –We will need fast SVD computation algorithms for this. MANJARA folks developed approximate algorithms for SVD Added based on class discussion

SVD Computation complexity For an mxn matrix SVD computation is –O( km 2 n+k’n 3 ) complexity k=4 and k’=22 for best algorithms –Approximate algorithms that exploit the sparsity of M are available (and being developed)

Bunch of Facts about SVD Relation between SVD and Eigen value decomposition –Eigen value decomp is defined only for square matrices Only square symmetric matrices have real-valued eigen values –SVD is defined for all matrices Given a matrix M, we consider the eigen decomposion of the correlation matrices MM T and M T M. SVD is the eigen vectors of MM T * positive square roots of eigen values of MM T * eigen vectors of M T M –Both MM T and M T M are symmetric (they are correlation matrices) –They both will have the same eigen values Unless M is symmetric, MM T and M T M are different –So, in general their eigen vectors will be different (although their eigen values are same) –Since SVD is defined in terms of the eigen values and vectors of the “Correlation matrices” of a matrix, the eigen values will always be real valued (even if the matrix M is not symmetric). In general, the SVD decomposition of a matrix M equals its eigen decomposition only if M is both square and symmetric Added based on the discussion in the class Optional

Ignore beyond this slide (hidden)

Yet another Example term ch2ch3 ch4 ch5 ch6 ch7 ch8 ch9 controllability observability realization feedback controller observer transfer function polynomial matrices U (9x7) = S (7x7) = V (7x8) = This happens to be a rank-7 matrix -so only 7 dimensions required Singular values = Sqrt of Eigen values of AA T T

DF2 (9x2) = FF2 (2x2) = TF2 (8x2) = DF (9x7) = FF (7x7) = TF(7x8) = DF2*FF2*TF2 T will be a 9x8 matrix That approximates original matrix T Formally, this will be the rank-k (2) matrix that is closest to M in the matrix norm sense

K=2 K=6 One component ignored 5 components ignored df 6 ff 6 tf 6 T df 2 ff 2 tf 2 T Df*ff*tf T df 4 ff 4 tf 4 T K=4 =df 7 * ff 7 * tf 7 T 3 components ignored What should be the value of k?

Coordinate transformation inherent in LSI Mapping of keywords into LSI space is given by T-F*F-F For k=2, the mapping is: controllability observability realization feedback controller observer Transfer function polynomial matrices LSx LSy controllability controller LSIx LSIy Mapping of a doc d=[w1….wk] into LSI space is given by d’*T-F*(F-F) -1 The base-keywords of The doc are first mapped To LSI keywords and Then differentially weighted By F-F -1 ch3 T-D = T-F*F-F*(D-F) T Doc rep

Querying To query for feedback controller, the query vector would be q = [ ]' (' indicates transpose), since feedback and controller are the 4-th and 5-th terms in the index, and no other terms are selected. Let q be the query vector. Then the document-space vector corresponding to q is given by: q'*TF (2) *inv(FF (2) ) = Dq For the feedback controller query vector, the result is: Dq = To find the best document match, we compare the Dq vector against all the document vectors in the 2- dimensional V2 space. The document vector that is nearest in direction to Dq is the best match. The cosine values for the eight document vectors and the query vector are: Centroid of the terms In the query (with scaling) T-F F-F D-F

FF is a diagonal Matrix. So, its inverse Is diagonal too. Diagonal Matrices are symmetric

Variations in the examples  DB-Regression example –Started with D-T matrix –Used the term axes as T-F; and the doc rep as D-F*F- F –Q is converted into q’*T-F Chapter/Medline etc examples –Started with T-D matrix –Used term axes as T- F*FF and doc rep as D-F –Q is converted to q’*T- F*FF -1 We will stick to this convention

Medline data from Berry’s paper

Within.40 threshold K is the number of singular values used