The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Text Categorization.
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
CSE3201/4500 Information Retrieval Systems
Eigen Decomposition and Singular Value Decomposition
3D Geometry for Computer Graphics
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Text Databases Text Types
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
6 6.1 © 2012 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Chapter 4.1 Mathematical Concepts
Chapter 5 Orthogonality
Chapter 4.1 Mathematical Concepts. 2 Applied Trigonometry Trigonometric functions Defined using right triangle  x y h.
Lecture 19 Quadratic Shapes and Symmetric Positive Definite Matrices Shang-Hua Teng.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Basics of Linear Algebra A review?. Matrix  Mathematical term essentially corresponding to an array  An arrangement of numbers into rows and columns.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Presentation to VII International Workshop on Advanced Computing and Analysis Techniques in Physics Research October, 2000.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Singular Value Decomposition and its applications
Tutorial#3.
LSI, SVD and Data Management
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Lecture 13: Singular Value Decomposition (SVD)
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
Lecture 20 SVD and Its Applications
Latent Semantic Analysis
Presentation transcript:

The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin

Acknowledgments This presentation is based on the following paper: Matrices, Vector Spaces, and Information Retrieval. by Michael W. Barry, Zlatko Drmat, and Elizabeth R.Jessup.

Indexing of Scientific Works Indexing primarily done by using the title, author list, abstract, key word list, and subject classification These are created in large part to allow them to be found in a search of scientific documents The use of automated information retrieval (IR) has improved consistency and speed

Vector Space Model for IR The basic mechanism for this model is the encoding of a document as a vector All documents vectors are stored in a single matrix Latent Semantic Indexing (LSI) replaces the original matrix by a matrix of a smaller rank while maintaining similar information by use of Rank Reduction

Creating the Database Matrix Each document is defined in a column of the matrix (d is the number of documents) Each term is defined as a row (t is the number of terms) This gives us a t x d matrix The document vectors span the content

Simple Example Let the six terms as follows: T1: bak(e, ing) T2: recipes T3: bread T4: cake T5: pastr(y, ies) T6: pie The following are the d=5 documents D1: How to Bake Bread Without Recipes D2: The Classical Art of Viennese Pastry D3: Numerical Recipes: The Art of Scientific Computing D4: Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes D5:Pastry: A Book of Best French Recipes Thus the document matrix becomes: A =

The matrix A after Normalization Thus after the normalization of the columns of A we get the following:

Making a Query Next we will use the document matrix to ease our search for related documents. Referring to our example we will make the following query: Baking Bread We will now format a query using our terms definitions given before: q=(101000) T

Matching the Document to the Query Matching the documents to a given query is typically done by using the cosine of the angle between the query and document vectors The cosine is given as follows:

A Query By using the cosine formula we would get: We will set our lower limit on our cosine at.5. Thus by conducting a query baking bread we get the following two articles:Thus by conducting a query baking bread we get the following two articles: D1: How to Bake Bread Without Recipes D4: Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes

Singular Value Decomposition The Singular Value Decomposition (SVD) is used to reduce the rank of the matrix, while also giving a good approximation of the information stored in it The decomposition is written in the following manner: Where U spans the column space of A, is the matrix with singular values of A along the main diagonal, and V spans the row space of A. U and V are also orthogonal.

SVD continued Unlike the QR Factorization, SVD provides us with a lower rank representation of the column and row spacesUnlike the QR Factorization, SVD provides us with a lower rank representation of the column and row spaces We know A k is the best rank-k approximation to A by Eckert and Youngs Theorem that states:We know A k is the best rank-k approximation to A by Eckert and Youngs Theorem that states: Thus the rank-k approximation of A is given as follows:Thus the rank-k approximation of A is given as follows: A k = U k k V k T Where U k =the first k columns of UWhere U k =the first k columns of U k =a k x k matrix whose diagonal is a set of decreasing values, call them: k =a k x k matrix whose diagonal is a set of decreasing values, call them: V k T =is the k x d matrix whose rows are the first k rows of V V k T =is the k x d matrix whose rows are the first k rows of V

SVD Factorization

Interpretation From the matrix given on the slide before we notice that if we take the rank-4 matrix has only four non-zero singular values Also the two non-zero columns in tell us that the first four columns of U give us the basis for the column space of A

Analysis of the Rank-k Approximations Using the following formula we can calculate the relative error from the original matrix to its rank-k approximation: ||A-A k || F = Thus only a 19% relative error is needed to change from a rank-4 to a rank-3 matrix, however a 42% relative error is necessary to move to a rank-2 approximation from a rank-4 approximation As expected these values are less than the rank- k approximations for the QR factorization

Using the SVD for Query Matching Using the following formula we can calculate the cosine of the angles between the query and the columns of our rank-k approximation of A.Using the following formula we can calculate the cosine of the angles between the query and the columns of our rank-k approximation of A. Using the rank-3 approximation we return the first and fourth books again using the cutoff of.5Using the rank-3 approximation we return the first and fourth books again using the cutoff of.5

Term-Term Comparison It is possible to modify the vector space model for comparing queries with documents in order to compare terms with terms. When this is added to a search engine it can act as a tool to refine the result First we run our search as before and retrieve a certain number of documents in the following example we will have five documents retrieved. We will then create another document matrix with the remaining information, call it G.

Another Example T1:Run(ning)T2:BikeT3:EnduranceT4:TrainingT5:BandT6:MusicT7:Fishes D1:Complete Triathlon Endurance Training Manual:Swim, Bike, Run D2:Lake, River, and Sea-Run Fishes of Canada D3:Middle Distance Running, Training and Competition D4:Music Law: How to Run your Bands Business D5:Running: Learning, Training Competing Terms Documents

Analysis of the Term-Term Comparison For this we use the following formula:

Clustering Clustering is the process by which terms are grouped if they are related such as bike, endurance and trainingClustering is the process by which terms are grouped if they are related such as bike, endurance and training First the terms are split into groups which are relatedFirst the terms are split into groups which are related The terms in each group are placed such that their vectors are almost parallelThe terms in each group are placed such that their vectors are almost parallel

Clusters In this example the first cluster is running The second cluster is bike, endurance and training The third is band and music And the fourth is fishes

Analyzing the term-term Comparison We will again use the SVD rank-k approximation Thus the cosine of the angles becomes:

Conclusion Through the use of this model many libraries and smaller collections can index their documents However, as the next presentation will show a different approach is used in large collections such as the internet