Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Eigen Decomposition and Singular Value Decomposition
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Lecture 16 Graphs and Matrices in Practice Eigenvalue and Eigenvector Shang-Hua Teng.
An Introduction to Latent Semantic Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Lecture 16 Cramer’s Rule, Eigenvalue and Eigenvector Shang-Hua Teng.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Chapter 6: Information Retrieval and Web Search
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Latent Semantic Indexing
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Building an Index By: Ryan Knowles “building the automatic index is as important as any other component of search engine development”
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 

Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Scale-Space Representation of 3D Models and Topological Matching
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011

Class Outline  Summary of last lecture  Indexing  Vector Space Models  Matrix Decompositions  Latent Semantic Analysis  Mechanics  Example

Summary of previous class  Principal Component Analysis  Singular Value Decomposition  Uses  Mechanics  Example swap rates

Introduction How can we retrieve information using a search engine?. We can represent the query and the documents as vectors (vector space model)  However to construct these vectors we should perform a preliminary document preparation. The documents are retrieved by finding the closest distance between the query and the document vector.  Which is the most suitable distance to retrieve documents?

Search engine

Document File Preparation  Manual Indexing Relationships and concepts between topics can be established It is expensive and time consuming It may not be reproduced if it is destroyed. The huge amount of information suggest a more automated system

Document File Preparation  Automatic indexing To buid an automatic index, we need to perform two steps: Document Analysis Decide what information or parts of the document should be indexed Token analysis Decide with words should be used in order to obtain the best representation of the semantic content of documents.

Document Normalization After this preliminary analysis we need to perform another preprocessing of the data Remove stop words  Function words: a, an, as, for, in, of, the…  Other frequent words Stemming  Group morphological variants Plurals “ streets” -> “street” Adverbs “fully” -> “full”  The current algorithms can make some mistakes “police“, “policy” -> “polic”

File Structures Once we have eliminated the stop words and apply the stemmer to the document we can construct:  Document File We can extract the terms that should be used in the index and assign a number to each document.

File Structures  Dictionary We will construct a searchable dictionary of terms by arranging them alphabetically and indicating the frequency of each term in the collection TermGlobal Frequency banana1 cranb2 Hanna2 hunger1 manna1 meat1 potato1 query1 rye2 sourdough1 spiritual1 wheat2

File Structures  Inverted List For each term we find the documents and its related position associated with Term(Doc, Position) banana(5,7) cranb(4,5); (6,4) Hanna(1,7); (8,2) hunger(9,4) manna(2,6) meat(7,6) potato(4,3) query(3,8) rye(3,3);(6,3) sourdough(5,5) spiritual(7,5) wheat(3,5);(6,6)

Vector Space Model The vector space model can be used to represent terms and documents in a text collection The document collection of n documents can be represented with a matrix of m X n where the rows represent the terms and the columns the documents Once we construct the matrix, we can normalize it in order to have unitary vectors

Vector Space Models

Query Matching  If we want to retrieve a document we should: Transform the query to a vector look for the most similar document vector to the query. One of the most common similarity methods is the cosine distance between vectors defined as: Where a is the document and q is the query vector

Example:  Using the book titles we want to retrieve books of “Child Proofing” Book titles Query Cos  2 =Cos  3 = Cos  5 =Cos  6 =0.500 With a threshold of 0.5, the 5 th and the 6 th would be retrieved.

Term weighting  In order to improve the retrieval, we can give to some terms more weight than others. Where Local Term WeightsGlobal Term Weights

Synonymy and Polysemy auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related

Matrix Decomposition To produce a reduced –rank approximation of the document matrix, first we need to be able to identify the dependence between columns (documents) and rows (terms) QR Factorization SVD Decomposition

QR Factorization  The document matrix A can be decomposed as below: Where Q is an mXm orthogonal matrix and R is an mX m upper triangular matrix  This factorization can be used to determine the basis vectors for any matrix A  This factorization can be used to describe the semantic content of the corresponding text collection

Example A=

Example

Query Matching  We can rewrite the cosine distance using this decomposition  Where r j refers to column j of the matrix R

Singular Value Decomposition (SVD)  This decomposition provides a reduced rank approximations in the column and row space of the document matrix  This decomposition is defined as mmmmmnmnV is n  n Where the columns U are orthogonal eigenvectors of AA T. The columns of V are orthogonal eigenvectors of A T A. Eigenvalues 1 … r of AA T are the square root of the eigenvalues of A T A.

Latent Semantic Decomposition (LSA)  It is the application of SVD in text mining. We decompose the document-term matrix A into three matrices A V U The V matrix refers to terms and U matrix refers to documents

Latent Semantic Analysis  Once we have decomposed the document matrix A we can reduce its rank We can account for synonymy and polysemy in the retrieval of documents Select the vectors associated with the higher value of  in each matrix and reconstruct the matrix A

Latent Semantic Analysis

Query Matching  The cosines between the vector q and the n document vectors can be represented as:  where e j is the canonical vector of dimension n This formula can be simplified as where

Example Apply the LSA method to the following technical memo titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey

Example First we construct the document matrix

Example The Resulting decomposition is the following {U} =

Example { S } =

Example {V} =

Example  We will perform a 2 rank reconstruction: We select the first two vectors in each matrix and set the rest of the matrix to zero We reconstruct the document matrix

Example The word user seems to have presence in the documents where the word human appears