Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small.

Similar presentations


Presentation on theme: "Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small."— Presentation transcript:

1 Latent Semantic Indexing 1

2 Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small (in some sense) Clothes, movies, politics, … Each topic can be represented as a cluster of (semantically) related terms, e.g.: clothes=golf, jacket, shoe.. Can we represent the term-document space by a lower dimensional “latent” space (latent space=set of topics)? 2

3 Searching with latent topics 3 Given a collection of documents, LSI learns clusters of frequently co-occurring terms (ex: information retrieval, ranking and web) If you query with ranking, information retrieval LSI “automatically” extends the search to documents including also (and even ONLY) web

4 4 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Selection based on ‘Golf’ Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Petrol Topgear GTI Polo Document base (20) Motor Bike Oil Petrol Tourer Bed lace legal Petrol button soft Petrol cat line yellow wind full sail harbour beach report Petrol Topgear June Speed Fish Pond gold Petrol Koi PC Dell RAM Petrol Floppy Core Petrol Apple Pip Tree Pea Pod Fresh Green French Lupin Petrol Seed May April Office Pen Desk Petrol VDU Friend Pal Help Petrol Can Paper Petrol Paste Pencil Roof Card Stamp Glue Happy Send Toil Petrol Work Time Cost With standard VSM 4 documents are selected Golf

5 Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Selection based on ‘Golf’ Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Petrol Topgear GTI Polo 20 documents Motor Bike Oil Petrol Tourer Bed lace legal Petrol button soft Petrol cat line yellow wind full sail harbour beach report Petrol Topgear June Speed Fish Pond gold Petrol Koi PC Dell RAM Petrol Floppy Core Petrol Apple Pip Tree Pea Pod Fresh Green French Lupin Petrol Seed May April Office Pen Desk Petrol VDU Friend Pal Help Petrol Can Paper Petrol Paste Pencil Roof Card Stamp Glue Happy Send Toil Petrol Work Time Cost The most relevant words associated with golf in these docs are: Car, Topgear and Petrol Rank of selected documents Car 2 *(20/3) = 13 Topgear 2 *(20/3) = 13 Petrol 3 *(20/16) = 4 wf.idf

6 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Rank of selected docs Selezione basata su ‘Golf’ Car 2 *(20/3) = 13 Topgear 2 *(20/3) = 13 Petrol 3 *(20/16) = 4 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Petrol Topgear GTI Polo 20 docs Motor Bike Oil Petrol Tourer Bed lace legal Petrol button soft Petrol cat line yellow wind full sail harbour beach report Petrol Topgear June Speed Fish Pond gold Petrol Koi PC Dell RAM Petrol Floppy Core Petrol Apple Pip Tree Pea Pod Fresh Green French Lupin Petrol Seed May April Office Pen Desk Petrol VDU Friend Pal Help Petrol Can Paper Petrol Paste Pencil Roof Card Stamp Glue Happy Send Toil Petrol Work Time Cost Since words are weighted also wrt the idf, car e topgear turn out to be related to Golf more than petrol weight of co-occurrences weight of co-occurrences

7 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Petrol Topgear GTI Polo 20 docs Motor Bike Oil Petrol Tourer Bed lace legal Petrol button soft Petrol cat line yellow wind full sail harbour beach report Petrol Topgear June Speed Fish Pond gold Petrol Koi PC Dell RAM Petrol Floppy Core Petrol Apple Pip Tree Pea Pod Fresh Green French Lupin Petrol Seed May April Office Pen Desk Petrol VDU Friend Pal Help Petrol Can Paper Petrol Paste Pencil Roof Card Stamp Glue Happy Send Toil Petrol Work Time Cost Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee rank df selected docs Selection based on‘Golf’ Selection based on the semantic domain of Golf Car 2 *(20/3) = 13 Topgear 2 *(20/3) = 13 Petrol 3 *(20/16) = 4 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Wheel Topgear GTI Polo We now search with all the words in the “semantic domain” of Golf. The list of retrieved docs now Includes one new document and does not include a previously retrieved document.

8 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee rank of selected docs Selezione basata su ‘Golf’ Selection based on semantic domain Car 2 *(20/3) = 13 Topgear 2 *(20/3) = 13 Petrol 3 *(20/16) = 4 Rank 2617 030 Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Petrol Topgear GTI Polo 20 docs Motor Bike Oil Petrol Tourer Bed lace legal Petrol button soft Petrol cat line yellow wind full sail harbour beach report Petrol Topgear June Speed Fish Pond gold Petrol Koi PC Dell RAM Petrol Floppy Core Petrol Apple Pip Tree Pea Pod Fresh Green French Lupin Petrol Seed May April Office Pen Desk Petrol VDU Friend Pal Help Petrol Can Paper Petrol Paste Pencil Roof Card Stamp Glue Happy Send Toil Petrol Work Time Cost Golf Car Topgear Petrol GTI Golf Car Clarkson Petrol Badge Golf Petrol Topgear Polo Red Golf Tiger Woods Belfry Tee Car Wheel Topgear GTI Polo The co-occurrence based ranking improves the performance. Note that one of the most relevant doc does NOT include the word Golf, and a doc with a “spurious” sense disappears

9 Linear Algebra Background 9 The LSI method: how to detect “topics”

10 Eigenvalues & Eigenvectors Eigenvectors (for a square m  m matrix S ) How many eigenvalues are there at most? only has a non-zero solution if this is a m -th order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real. eigenvalue(right) eigenvector Example 10

11 Example of eigenvector/eigenvalues Def: A vector v  R n, v ≠ 0, is an eigenvector of a matrix mxm S with corresponding eigenvalue, if: Sv = v

12 Example of computation remember 2 and 4 are the eigenvalues of S Characteristic polynomial Characteristic polynomial Notice that we computed only the DIRECTION of eigenvectors

13 Matrix-vector multiplication 13 Therefore, any generic vector (say x= ) can be viewed as a combination of the eigenvectors: x = 2v 1 + 4v 2 + 6v 3 2, 4 and 6 are the coordinates of v in the orthonormal space O Eigenvectors define an orthonormal space O; two eigenvectors of different eigenvalues of a symmetric matrix are orthogonal. Eigenvectors define an orthonormal space O; two eigenvectors of different eigenvalues of a symmetric matrix are orthogonal. Notice, there is not “the” eigenvectors for a matrix, but rather an orthonormal basis (an eigenspace)

14 Example of 2 bidimensional orthonormal spaces 14

15 Difference between orthonormal and orthogonal? Orthogonal mean that the dot product is null. Orthonormal mean that the dot product is null and the norm is equal to 1. If two or more vectors are orthonormal they are also orthogonal but the inverse is not true. 15

16 Matrix vector multiplication Thus a matrix-vector multiplication such as Sx (S, x a generic vector as in the previous slide) can be rewritten in terms of the eigenvalues/vectors of S: Even though x is an arbitrary vector, the action of S on x (transformation) is determined by the eigenvalues/vectors. Why? 16 X=

17 Matrix vector multiplication 17 Matrix multipliication by a vector = a linear transformation of the initial vector

18 Geometric explanation Multiplying a matrix and a vector has two effects over the vector: rotation (the coordinates of the vector change) and scaling (the length changes). The max compression and rotation depends on the matrix’s eigenvalues λi (s1,s2 and s3 in the picture) x is a generic vector, v are the eigenvectors)

19 Geometric explanation In the distorsion, the max effect is played by the biggest eigenvalues (s1 and s2 in the picture ) In the distorsion, the max effect is played by the biggest eigenvalues (s1 and s2 in the picture ) The eigenvalues describe the distorsion operated by the matrix on the original vector

20 Geometric explanation In this linear transformation, the image of “La Gioconda” is modified but the central vertical axis is the same. The blue vector has changed, not the red one. Hence the red is an eigenvector of the linear transformation. Furthermore, the red vector has not been enlarged, not shortened, nor inverted, therefore its eigenvalue is 1 (the eigenvalue is a constant of translation of the image pixels along the blue direction). Sv= v

21 Let be a square matrix with m orthogonal eigenvectors Theorem: Exists an eigen decomposition (cf. matrix diagonalization theorem) Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of Eigen/diagonal Decomposition diagonal Unique for distinct eigen- values 21

22 Diagonal decomposition: why/how Let U have the eigenvectors as columns: Then, SU can be written And S=U  U –1. Thus SU=U , or U –1 SU=  22

23 Example 23 From which we get v 21 =0 and v 22 any real from which we get v 11 =−2v 12

24 Diagonal decomposition – example 2 Recall The eigenvectors and form Inverting, we have Then, S=U  U –1 = Recall UU –1 =1. 24

25 So what? What do these matrices have to do with IR? Recall M  N term-document matrices … But everything so far needs square matrices – so you need to be patient and learn one last notion 25

26 Singular Value Decomposition for non-square matrixes MMMMMNMN V is N  N For an M  N matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: The columns of U are orthogonal eigenvectors of AA T (left singular vectors). The columns of V are orthogonal eigenvectors of A T A (right singular eigenvector) Singular values. Eigenvalues 1 … r of AA T = eigenvalues of A T A and: 26

27 An example 27

28 Singular Value Decomposition Illustration of SVD dimensions and sparseness

29 If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red Then Σ is k×k, U is M×k, V T is k×N, and A k is M×N This is referred to as the reduced SVD It is the convenient (space-saving) and usual form for computational applications http://www.bluebit.gr/matrix-calculator/calculate.aspx Reduced SVD k 29

30 Let’s recap 30 Since the yellow part is zero, an exact representation of A is: But for some k<r, a good approximation is: M<N M>N

31 Example of rank approximation 31 =x x 0.981 0.000 0.000 0.000 1.985 0.000 0.000 3.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4.000 0.000 0.000 0.000 A*=

32 Approximation error How good (bad) is this approximation? It’s the best possible, measured by the Frobenius norm of the error: where the  i are ordered such that  i   i+1. Suggests why Frobenius error drops as k increases. 32

33 SVD Low-rank approximation Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000) We can construct an approximation A 100 with rank 100!!! Of all rank 100 matrices, it would have the lowest Frobenius error. C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936. 33

34 Images gives a better intuition 34

35 K=10 35

36 K=20 36

37 K=30 37

38 K=100 38

39 K=322 (the original image) 39

40 We save space!! But this is only one issue 40

41 But what about texts and IR? (finally!!) Latent Semantic Indexing via the SVD If A is a term/document matrix, then AA T and A T A are the matrixes of term and document co-occurrences, repectively 41

42 Example 2

43 Term-document matrix A

44 Co-occurrences of terms in documents L ij is the number of documents in which wi and wj co-occurr L ij T is the number of co-occurring words in docs di and dj (or viceversa if A is a document-term matrix) L ij is the number of documents in which wi and wj co-occurr L ij T is the number of co-occurring words in docs di and dj (or viceversa if A is a document-term matrix) Word i in doc k Word j in doc k

45 Term co-occurrences example L trees,graph = (000001110)  (000000111) T =2

46 A numerical example 2 2 0 0 0 1 w1 w2 w3 d1 d2 d3 10 1 0 1 0 AA T 11 0 0 0 1 1 1 0 = w11 w12 w13 w21 w22 w23 w31 w32 w33 But wij=wji hence w11 w12 w13 w12 w22 w23 w13 w23 w33 Symmetrical

47 LSI: the intuition 47 t1 t3 t2 Projecting A in the term space: green yellow and red vector are documents The black vector are the unary eigenvector of A: they represent the main “directions” of the document vectors The black vector are the unary eigenvector of A: they represent the main “directions” of the document vectors The blue segments give the intuition of eigenvalues of A: bigger values are those for which the projection of all vectors on the direction of correspondent eigenvector Is higher The blue segments give the intuition of eigenvalues of A: bigger values are those for which the projection of all vectors on the direction of correspondent eigenvector Is higher

48 LSI intuition 48 t1 t3 t2 We now project our document vectors on the reference orthonormal system represented by the 3 black vectors We now project our document vectors on the reference orthonormal system represented by the 3 black vectors

49 LSI intuition 49..and we remove the dimension(s) with lower eigenvalues..and we remove the dimension(s) with lower eigenvalues Now the two axis represent a combination of words e.g. a latent semantic space

50 Example 50 1.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 1.000 1.000 t1 t2 t3 t4 d1 d2 d3 d4 0.000 -0.851 -0.526 0.000 0.000 -0.526 0.851 0.000 -0.707 0.000 0.000 -0.707 -0.707 0.000 0.000 0.707 2.000 0.000 0.000 0.000 0.000 1.618 0.000 0.000 0.000 0.000 0.618 0.000 0.000 0.000 0.000 0.000 -0.707 -0.707 -0.851 -0.526 0.000 0.000 0.526 -0.851 0.000 0.000 0.000 0.000 -0.707 0.707 xx 1.172 0.724 0.000 0.000 0.724 0.448 0.000 0.000 0.000 0.000 1.000 1.000 We project terms and docs on two dimensions, v1 and v2 (the principal eigenvectors) We project terms and docs on two dimensions, v1 and v2 (the principal eigenvectors) We have two latent semantic coordinates: s1:(t1,t2) and s2:(t3,t4) We have two latent semantic coordinates: s1:(t1,t2) and s2:(t3,t4) Approximated new term-doc matrix Even if t2 does not occur in d2, now if we query with t2 the system will return also d2!!

51 Co-occurrence space In principle, the space of document or term co-occurrences is much (much!) higher than the original space of terms!! But with SVD we consider only the most relevant ones, trough rank reduction 51

52 LSI: what are the steps From term-doc matrix A, compute the approximation A k. with SVD Project docs, terms and queries in a space of k<<r dimensions (the k “survived” eigenvectors) and compute cos-similarity as usual These dimensions are not the original axes, but those defined by the orthonormal space of the reduced matrix Ak 52 Where σ i q i (i=1,2..k<<r) are the new coordinates of q in the orthonormal space of Ak

53 Projecting terms documents and queries in the LS space 53 A If A=U  V T we also have that: V = A T U  -1 t = t ∧ T  V T d = d ∧ T U  -1 q = q ∧ T U  -1 After rank k- approximation : A ≅ Ak=U k  k V k T d k ≅ d T U k  k -1 q k ≅ q T U k  k -1 sim(q, d) = sim(q T U k  k -1, d T U k  k -1 )

54 Consider a term-doc matrix MxN (M=11, N=3) and a query q query A

55 1. Compute SVD: A= U  V T

56 2. Obtain a low rank approximation (k=2) A k = U k  k V T k

57 3a. Compute doc/query similarity For N documents, A k has N columns, each representing the coordinates of a document d i projected in the k LSI dimensions A query is considered like a document, and is projected in the LSI space

58 3c. Compute the query vector q k = q T U k  k -1 q is projected in the 2-dimension LSI space!

59 Documents and queries projected in the LSI space

60 q/d similarity

61 An overview of a semantic network of terms based on the top 100 most significant latent semantic dimensions (Zhu&Chen) 61

62 Another LSI Example

63

64 t1 t2 t3 t4 d1 d2 d3 ATAT t1 t2 t3 t4 Term co-occurrences AA T

65 A k =U k Σ k V k T ≈A Now it is like if t3 belongs to d1!


Download ppt "Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small."

Similar presentations


Ads by Google