Download presentation
Presentation is loading. Please wait.
Published byAndra Booth Modified over 9 years ago
1
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) More generally, for all possible r – Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) Compute the loss between bellcore and b.r as a function of r Plot the loss as a function of r
2
IR Models Keywords (and Boolean combinations thereof) Vector-Space ‘‘Model’’ (Salton, chap 10.1) – Represent the query and the documents as V- dimensional vectors – Sort vectors by Probabilistic Retrieval Model – (Salton, chap 10.3) – Sort documents by
3
Information Retrieval and Web Search Alternative IR models Instructor: Rada Mihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms
4
Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.
5
Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)
6
Bellcore’s Example http://en.wikipedia.org/wiki/Latent_semantic_analysis http://en.wikipedia.org/wiki/Latent_semantic_analysis c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey
7
Term by Document Matrix
8
"bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1),.Dim = c( 12, 9),.Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) help(dump) help(source)
9
Query Expansion Query: Find documents relevant to human computer interaction Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5
10
Large Correl- ations
11
Correlations: Too Large to Ignore
12
How to compute correlations round(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100
13
plot(hclust(as.dist(-cor(t(bellcore)))))
14
plot(hclust(as.dist(-cor(bellcore))))
15
Correcting for Large Correlations
16
Thesaurus
17
Term by Doc Matrix: Before & After Thesaurus
18
Singular Value Decomposition (SVD) X = UDV T X= U VTVT D t x dt x mm x dm x m m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X
19
m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X
20
Dimensionality Reduction X = t x dt x kk x dk x k k is the number of latent concepts (typically 300 ~ 500) U DVTVT ^
21
Dimension Reduction in R b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) dimnames(b2) = dimnames(bellcore) par(mfrow=c(2,2)) plot(hclust(as.dist(-cor(bellcore)))) plot(hclust(as.dist(-cor(t(bellcore))))) plot(hclust(as.dist(-cor(b2)))) plot(hclust(as.dist(-cor(t(b2)))))
23
SVD B B T = U D 2 U T B T B = V D 2 V T Latent Term Doc
24
Dimension Reduction Block Structure round(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 > round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 91 100 100 84 -86 -85 -85 -81 c2 91 100 91 88 99 -57 -56 -56 -50 c3 100 91 100 100 84 -86 -85 -85 -81 c4 100 88 100 100 81 -89 -88 -88 -84 c5 84 99 84 81 100 -44 -44 -43 -37 m1 -86 -57 -86 -89 -44 100 100 100 100 m2 -85 -56 -85 -88 -44 100 100 100 100 m3 -85 -56 -85 -88 -43 100 100 100 100 m4 -81 -50 -81 -84 -37 100 100 100 100
25
Dimension Reduction Block Structure round(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100 > round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minors human 100 100 93 94 99 82 82 100 -12 -85 -84 -83 interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80 computer 93 95 100 100 96 98 98 93 26 -59 -57 -56 user 94 96 100 100 97 97 97 94 23 -62 -60 -59 system 99 100 96 97 100 88 88 100 -2 -79 -78 -77 response 82 85 98 97 88 100 100 83 46 -40 -38 -37 time 82 85 98 97 88 100 100 83 46 -40 -38 -37 EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82 survey -12 -7 26 23 -2 46 46 -11 100 63 65 66 trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100 graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100 minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100
26
t1t1 t2t2 t3t3 d1d1 d2d2 The space has as many dimensions as there are terms in the word list. The term vector space
27
term document query --- cosine > 0.9 Latent concept vector space
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.