Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) More generally, for all possible r – Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) Compute the loss between bellcore and b.r as a function of r Plot the loss as a function of r
IR Models Keywords (and Boolean combinations thereof) Vector-Space ‘‘Model’’ (Salton, chap 10.1) – Represent the query and the documents as V- dimensional vectors – Sort vectors by Probabilistic Retrieval Model – (Salton, chap 10.3) – Sort documents by
Information Retrieval and Web Search Alternative IR models Instructor: Rada Mihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms
Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.
Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)
Bellcore’s Example c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey
Term by Document Matrix
"bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1),.Dim = c( 12, 9),.Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) help(dump) help(source)
Query Expansion Query: Find documents relevant to human computer interaction Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5
Large Correl- ations
Correlations: Too Large to Ignore
How to compute correlations round(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c c c c c m m m m round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human interface computer user system response time EPS survey trees graph minors
plot(hclust(as.dist(-cor(t(bellcore)))))
plot(hclust(as.dist(-cor(bellcore))))
Correcting for Large Correlations
Thesaurus
Term by Doc Matrix: Before & After Thesaurus
Singular Value Decomposition (SVD) X = UDV T X= U VTVT D t x dt x mm x dm x m m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X
m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X
Dimensionality Reduction X = t x dt x kk x dk x k k is the number of latent concepts (typically 300 ~ 500) U DVTVT ^
Dimension Reduction in R b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) dimnames(b2) = dimnames(bellcore) par(mfrow=c(2,2)) plot(hclust(as.dist(-cor(bellcore)))) plot(hclust(as.dist(-cor(t(bellcore))))) plot(hclust(as.dist(-cor(b2)))) plot(hclust(as.dist(-cor(t(b2)))))
SVD B B T = U D 2 U T B T B = V D 2 V T Latent Term Doc
Dimension Reduction Block Structure round(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c c c c c m m m m > round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c c c c c m m m m
Dimension Reduction Block Structure round(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human interface computer user system response time EPS survey trees graph minors > round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minors human interface computer user system response time EPS survey trees graph minors
t1t1 t2t2 t3t3 d1d1 d2d2 The space has as many dimensions as there are terms in the word list. The term vector space
term document query --- cosine > 0.9 Latent concept vector space