Presentation is loading. Please wait.

Presentation is loading. Please wait.

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Similar presentations


Presentation on theme: "Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])"— Presentation transcript:

1 Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) More generally, for all possible r – Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) Compute the loss between bellcore and b.r as a function of r Plot the loss as a function of r

2 IR Models Keywords (and Boolean combinations thereof) Vector-Space ‘‘Model’’ (Salton, chap 10.1) – Represent the query and the documents as V- dimensional vectors – Sort vectors by Probabilistic Retrieval Model – (Salton, chap 10.3) – Sort documents by

3 Information Retrieval and Web Search Alternative IR models Instructor: Rada Mihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms

4 Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.

5 Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

6 Bellcore’s Example http://en.wikipedia.org/wiki/Latent_semantic_analysis http://en.wikipedia.org/wiki/Latent_semantic_analysis c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

7 Term by Document Matrix

8 "bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1),.Dim = c( 12, 9),.Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) help(dump) help(source)

9 Query Expansion Query: Find documents relevant to human computer interaction Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

10 Large Correl- ations

11 Correlations: Too Large to Ignore

12 How to compute correlations round(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100

13 plot(hclust(as.dist(-cor(t(bellcore)))))

14 plot(hclust(as.dist(-cor(bellcore))))

15 Correcting for Large Correlations

16 Thesaurus

17 Term by Doc Matrix: Before & After Thesaurus

18 Singular Value Decomposition (SVD) X = UDV T X= U VTVT D t x dt x mm x dm x m m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X

19 m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X

20 Dimensionality Reduction X = t x dt x kk x dk x k k is the number of latent concepts (typically 300 ~ 500) U DVTVT ^

21 Dimension Reduction in R b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) dimnames(b2) = dimnames(bellcore) par(mfrow=c(2,2)) plot(hclust(as.dist(-cor(bellcore)))) plot(hclust(as.dist(-cor(t(bellcore))))) plot(hclust(as.dist(-cor(b2)))) plot(hclust(as.dist(-cor(t(b2)))))

22

23 SVD B B T = U D 2 U T B T B = V D 2 V T Latent Term Doc

24 Dimension Reduction  Block Structure round(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 > round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 91 100 100 84 -86 -85 -85 -81 c2 91 100 91 88 99 -57 -56 -56 -50 c3 100 91 100 100 84 -86 -85 -85 -81 c4 100 88 100 100 81 -89 -88 -88 -84 c5 84 99 84 81 100 -44 -44 -43 -37 m1 -86 -57 -86 -89 -44 100 100 100 100 m2 -85 -56 -85 -88 -44 100 100 100 100 m3 -85 -56 -85 -88 -43 100 100 100 100 m4 -81 -50 -81 -84 -37 100 100 100 100

25 Dimension Reduction  Block Structure round(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100 > round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minors human 100 100 93 94 99 82 82 100 -12 -85 -84 -83 interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80 computer 93 95 100 100 96 98 98 93 26 -59 -57 -56 user 94 96 100 100 97 97 97 94 23 -62 -60 -59 system 99 100 96 97 100 88 88 100 -2 -79 -78 -77 response 82 85 98 97 88 100 100 83 46 -40 -38 -37 time 82 85 98 97 88 100 100 83 46 -40 -38 -37 EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82 survey -12 -7 26 23 -2 46 46 -11 100 63 65 66 trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100 graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100 minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100

26 t1t1 t2t2 t3t3 d1d1 d2d2  The space has as many dimensions as there are terms in the word list. The term vector space

27 term document query --- cosine > 0.9 Latent concept vector space


Download ppt "Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])"

Similar presentations


Ads by Google