Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Slides:



Advertisements
Similar presentations
Text Databases Text Types
Advertisements

Dimensionality Reduction PCA -- SVD
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
An Introduction to Latent Semantic Analysis
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
Lecture 19 Quadratic Shapes and Symmetric Positive Definite Matrices Shang-Hua Teng.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Chapter 2 Dimensionality Reduction. Linear Methods
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Natural Language Processing Topics in Information Retrieval August, 2002.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Packing to fewer dimensions
Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 13: Singular Value Decomposition (SVD)
CS 430: Information Discovery
Packing to fewer dimensions
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) More generally, for all possible r – Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) Compute the loss between bellcore and b.r as a function of r Plot the loss as a function of r

IR Models Keywords (and Boolean combinations thereof) Vector-Space ‘‘Model’’ (Salton, chap 10.1) – Represent the query and the documents as V- dimensional vectors – Sort vectors by Probabilistic Retrieval Model – (Salton, chap 10.3) – Sort documents by

Information Retrieval and Web Search Alternative IR models Instructor: Rada Mihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms

Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.

Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

Bellcore’s Example c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

Term by Document Matrix

"bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1),.Dim = c( 12, 9),.Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) help(dump) help(source)

Query Expansion Query: Find documents relevant to human computer interaction Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

Large Correl- ations

Correlations: Too Large to Ignore

How to compute correlations round(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c c c c c m m m m round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human interface computer user system response time EPS survey trees graph minors

plot(hclust(as.dist(-cor(t(bellcore)))))

plot(hclust(as.dist(-cor(bellcore))))

Correcting for Large Correlations

Thesaurus

Term by Doc Matrix: Before & After Thesaurus

Singular Value Decomposition (SVD) X = UDV T X= U VTVT D t x dt x mm x dm x m m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X

m is the rank of X < min(t, d) D is diagonal – D 2 are eigenvalues (sorted in descending order) U U T = I and V V T = I – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X

Dimensionality Reduction X = t x dt x kk x dk x k k is the number of latent concepts (typically 300 ~ 500) U DVTVT ^

Dimension Reduction in R b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) dimnames(b2) = dimnames(bellcore) par(mfrow=c(2,2)) plot(hclust(as.dist(-cor(bellcore)))) plot(hclust(as.dist(-cor(t(bellcore))))) plot(hclust(as.dist(-cor(b2)))) plot(hclust(as.dist(-cor(t(b2)))))

SVD B B T = U D 2 U T B T B = V D 2 V T Latent Term Doc

Dimension Reduction  Block Structure round(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c c c c c m m m m > round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c c c c c m m m m

Dimension Reduction  Block Structure round(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human interface computer user system response time EPS survey trees graph minors > round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minors human interface computer user system response time EPS survey trees graph minors

t1t1 t2t2 t3t3 d1d1 d2d2  The space has as many dimensions as there are terms in the word list. The term vector space

term document query --- cosine > 0.9 Latent concept vector space