BHU Banaras Hindu University 1 DST-CIMS International Workshop on Machine Learning and Text Analytics (MLTA2013) Linear Algebra for Machine Learning and.

BHU Banaras Hindu University 1 DST-CIMS International Workshop on Machine Learning and Text Analytics (MLTA2013) Linear Algebra for Machine Learning and IR Manoj Kumar Singh DST-Centre for Interdisciplinary Mathematical Sciences(DST-CIMS) Banaras Hindu University (BHU), Varanasi-221005, INDIA. E-mail: manoj.dstcims@bhu.ac.in December 15, 2013 South Asian University (SAU), New Delhi.

BHU Banaras Hindu University 2 DST-CIMS Content  Vector Matrix Model in IR, ML and Other Area  Vector Space - Formal definition - Linear Combination - Independence - Generator and Basis - Dimension - Inner product, Norm, Orthogonality - Example  Linear Transformation - Definition - Matrix and Determinant - LT using Matrix - Rank and Nullity - Column Space and Row Space - Invertility - Singularity and Non-Singularity – Eigen Value Eigen Vector - Linear Algebra  Different Type of Matrix And Matrix Algebra  Matrix Factorization  Applications

BHU Banaras Hindu University 3 DST-CIMS Vector Matrix Model in IR A collection consisting of the following five documents is queried for latent semantic indexing (q): d1 = LSI tutorials and fast tracks. d2 = Books on semantic analysis. d3 = Learning latent semantic indexing. d4 = Advances in structures and advances in indexing. d5 = Analysis of latent structures. Rank documents in decreasing order of relevance to the query? Recommendation System: Item based collaborative filtering Item1Item2Item3Item4Item5 Alice5344 ? User131233 User243435 User333154 User415521 Classification

BHU Banaras Hindu University 4 DST-CIMS Blind Source Separation Source Measured

BHU Banaras Hindu University 5 DST-CIMS Imaging Application

BHU Banaras Hindu University 6 DST-CIMS Vector Space Def.: Algebraic structure with sets and binary operations is vector space if

BHU Banaras Hindu University 7 DST-CIMS Vector Space Note: 1. Elements of V are called as vector and F are scalar. 2. Vector do not mean vector quantity as defined in vector algebra as directed line segment. 3. We say vector space V over field F and denote it as V(F). Linear Algebra:

BHU Banaras Hindu University 8 DST-CIMS Vector Space Linear Combination: Subspace : Generator: e.g.

BHU Banaras Hindu University 9 DST-CIMS Vector Space Linear Span: Note: Linear Dependence (LD): Linear Independence (LI): Basis: Dimension: e.g.

BHU Banaras Hindu University 10 DST-CIMS Vector Space Inner Product Norm / Length: Distance: Note: Orthogonality:

BHU Banaras Hindu University 11 DST-CIMS Linear Transformation Definition (LT): Linear Operator: Range Space of LT: Null Space of LT: Note: Rank and Nullity of LT: Note: Non-Singular Transform: Singular Transform:

BHU Banaras Hindu University 12 DST-CIMS Matrices Definition: Unit / Identity Matrix: Diagonal Matrix: Scalar Matrix:

BHU Banaras Hindu University 13 DST-CIMS Matrices Upper Triangular Matrix: Lower Triangular Matrix: Symmetric : Skew Symmetric:

BHU Banaras Hindu University 14 DST-CIMS Transpose : Matrices Trace : Row /Column Vector Representation of Matrix:

BHU Banaras Hindu University 15 DST-CIMS Matrices Row Space And Row Rank of Matrix : Column Space And Column Rank of Matrix : Rank of Matrix : Determinant of Square Matrix:

BHU Banaras Hindu University 16 DST-CIMS Determinant Some Properties of Determinant:

BHU Banaras Hindu University 17 DST-CIMS Cofactor Expansion Minors: Leading Minors: Cofactors :

BHU Banaras Hindu University 18 DST-CIMS Cofactor Expansion Evaluation of Determinant: Cofactor Matrix: Inverse of Matrix: Singular and Non Singular Matrix:

BHU Banaras Hindu University 19 DST-CIMS Cofactor Expansion Invertbility of Matrix: Rank of Matrix:

BHU Banaras Hindu University 20 DST-CIMS LT using Matrix Example:

BHU Banaras Hindu University 21 DST-CIMS Eigen Value and Eigen Vector Eigen Value and Eigen Vector of LT: Eigen Value and Eigen Vector of Matrix:

BHU Banaras Hindu University 22 DST-CIMS Eigen Value and Eigen Vector Properties:

BHU Banaras Hindu University 23 DST-CIMS Similarity of Matrix Diagonalizable Matrix: Def.

BHU Banaras Hindu University 24 DST-CIMS Singular Value Decomposition Similarity of Matrix A singular value and corresponding singular vectors of a rectangular matrix A are, respectively, a scalar σ and a pair of vectors u and v that satisfy With the singular values on the diagonal of a diagonal matrix Σ and the corresponding singular vectors forming the columns of two orthogonal matrices U and V, we have : Since U and V are orthogonal, this becomes the singular value decomposition: Def.:

BHU Banaras Hindu University 25 DST-CIMS LU Factorization: Similarity of Matrix Cholesky Factorization: The Cholesky factorization expresses a symmetric matrix as the product of a triangular matrix and its transpose. where R is an upper triangular matrix. Not all symmetric matrices can be factored in this way; the matrices that have such a factorization are said to be positive definite. The Cholesky factorization allows the linear system: to be replaced by to form triangular system of equation. Solved easily by forward and backward substitution. LU factorization, or Gaussian elimination, expresses any square matrix A as the product of a permutation of a lower triangular matrix and an upper triangular matrix where L is a permutation of a lower triangular matrix with ones on its diagonal and U is an upper triangular matrix. QR Factorization: The orthogonal, or QR, factorization expresses any rectangular matrix as the product of an orthogonal or unitary matrix and an upper triangular matrix. where Q is orthogonal or unitary, R is upper triangular.

BHU Banaras Hindu University 26 DST-CIMS APPLICATION Documents Ranking

BHU Banaras Hindu University 27 DST-CIMS Rank documents in decreasing order of relevance to the query? Documents Ranking A collection consisting of the following five documents: d1 = LSI tutorials and fast tracks. d2 = Books on semantic analysis. d3 = Learning latent semantic indexing. d4 = Advances in structures and advances in indexing. d5 = Analysis of latent structures. queried for latent semantic indexing (q). Decreasing order of cosine similarities Assume that: 1. Documents are linearized, tokenized, and their stop words removed. Stemming is not used. Survival terms are used to construct a term-document matrix A. This matrix is populated with term weights :

BHU Banaras Hindu University 28 DST-CIMS Documents Ranking Procedure: d1d1 d2d2 d3d3 d4d4 d5d5 LSI10000 Tutorials10000 fast10000 tracks10000 books01000 semantic01100 analysis01001 learning00100 latent00101 indexing00110 advances00020 structures00011 d1 = LSI tutorials and fast tracks. d2 = Books on semantic analysis. d3 = Learning latent semantic indexing. d4 = Advances in structures and advances in indexing. d5 = Analysis of latent structures. Documents in collection: Term-Document Matrix

BHU Banaras Hindu University 29 DST-CIMS Documents Ranking Step1: Weight Matrix d1d1 d2d2 d3d3 d4d4 d5d5 1log(5/1)0000 0000 0000 0000 0 000 01log(5/2) 00 0 00 001log(5/1)00 001log(5/2)01log(5/1) 001log(5/2)1log(5/1)0 0002log(5/1)0 0001log(5/1) A= = d1d1 d2d2 d3d3 d4d4 d5d5 0.69900000 0000 0000 0000 0 000 00.3979 00 0 00 000.699000 000.397900.6990 000.39790.69900 0001.39800 0000.6990 d1d1 0 0 0 0 0 1 0 0 1 1 0 0 q=

BHU Banaras Hindu University 30 DST-CIMS Documents Ranking Step2: Normalization : An=An= d1d1 d2d2 d3d3 d4d4 d5d5 0.50000000 0000 0000 0000 00.7790000 00.44340.405400 00.4434000.5774 000.712100 000.405400.5774 000.40540.26400 0000.92770 0000.26400.5774 d1d1 0 0 0 0 0 0 0 0 0 qn=qn= 00000 00 00

BHU Banaras Hindu University 31 DST-CIMS Documents Ranking Step3: Compute AnAn An=An= d1d1 d2d2 d3d4d5 00.25600.70220.15240.3334 Documents rank as follows: Explain any difference in computed results. Exercises 1. Repeat the above calculations, this time including all stopwords. Explain any difference in computed results. 2. Repeat the above calculations, this time scoring global weights using IDF probabilistic (IDFP):

BHU Banaras Hindu University 32 DST-CIMS APPLICATION Latent Semantic Indexing (LSI) Using SVD

BHU Banaras Hindu University 33 DST-CIMS Latent Semantic Indexing  Use of LSI to cluster term, and find the terms that could be used to expand or reformulate the query. d1 = Shipment of gold damaged in a fire. d2 = Delivery of silver arrived in a silver truck. d3 = Shipment of gold arrived in a truck. Example: Collection consist of following documents: SVD Assume that the query is gold silver truck.

BHU Banaras Hindu University 34 DST-CIMS Latent Semantic Indexing (Procedure) Step1: Score term weights and construct the term – document matrix A and query matrix. d1d1 d2d2 d3d3 a111 arrived011 damaged100 delivery010 fire100 gold101 in111 of111 shipment101 silver020 truck011 111 011 100 010 100 101 111 111 101 020 011 A= q= 0 0 0 0 0 1 0 0 0 1 1

BHU Banaras Hindu University 35 DST-CIMS Latent Semantic Indexing (Procedure) Step2-1: Decompose matrix A using SVD procedure into U, S and V matrices. 111 011 100 010 100 101 111 111 101 020 011 A=U= -0.49447-0.64918-0.57799 -0.645820.719447-0.25556 -0.58174-0.246910.774995 V=

BHU Banaras Hindu University 36 DST-CIMS Latent Semantic Indexing (Procedure) Step2-2: Decompose matrix A using SVD procedure into U, S and V matrices. U= -0.49447-0.64918-0.57799 -0.645820.719447-0.25556 -0.58174-0.246910.774995 V= -0.42012-0.0748-0.04597 -0.299490.2000920.407828 -0.12063-0.27489-0.4538 -0.157560.304648-0.20065 -0.12063-0.27489-0.4538 -0.26256-0.379450.154674 -0.42012-0.0748-0.04597 -0.42012-0.0748-0.04597 -0.26256-0.379450.154674 -0.315120.609295-0.40129 -0.299490.2000920.407828 4.09887200 02.3615710 001.273669 Step3: Rank 2 Approximation : 4.0988720 02.361571 -0.42012-0.0748 -0.299490.200092 -0.12063-0.27489 -0.157560.304648 -0.12063-0.27489 -0.26256-0.37945 -0.42012-0.0748 -0.42012-0.0748 -0.26256-0.37945 -0.315120.609295 -0.299490.200092 Uk=Uk= -0.49447-0.64918 -0.645820.719447 -0.58174-0.24691 Vk=Vk=

BHU Banaras Hindu University 37 DST-CIMS Latent Semantic Indexing (Procedure) Step 4: Find the new term vector coordinates in this reduced 2-dimensonal space. Rows of U holds eigenvector values. These are coordinates of the individual term vectors. Thus from the reduced matrix (U k ) : 1a -0.42012-0.0748 2arrived -0.299490.200092 3Damaged -0.12063-0.27489 4delivery -0.157560.304648 5fire -0.12063-0.27489 6gold -0.26256-0.37945 7in -0.42012-0.0748 8of -0.42012-0.0748 9shipment -0.26256-0.37945 10silver -0.315120.609295 11truck -0.299490.200092 Step 5: Find the new query vector coordinates in the reduced 2-dimensional space. Using -0.42012-0.0748 -0.299490.200092 -0.12063-0.27489 -0.157560.304648 -0.12063-0.27489 -0.26256-0.37945 -0.42012-0.0748 -0.42012-0.0748 -0.26256-0.37945 -0.315120.609295 -0.299490.200092 00000100011 q== [ -0.2140 -0.1821 ]

BHU Banaras Hindu University 38 DST-CIMS Latent Semantic Indexing (Procedure) Step 6: Group terms into clusters Grouping is done by comparing cosine angles between any two pair of vectors. The following clusters are obtained: 1. a, in of 2. gold, shipment 3. damaged, fire 4. arrived, truck 5. silver 6. delivery Some vectors are not shown since these are completely superimposed. This is the case of points 1 – 4. If unit vectors are used and small deviation ignored, clusters 3 and 4 and clusters 4 and 5 can be merged.

BHU Banaras Hindu University 39 DST-CIMS Latent Semantic Indexing (Procedure) Step 7: Find terms that could be used to expand or reformulate the query The query is gold silver truck. Note that in relation to the query, clusters 1, 2 and 3 are far away from the query. Similarity wise these could be viewed as belonging to a “long tail”. If we insist in combining these with the query, possible expanded queries could be gold silver truck shipment gold silver truck damaged gold silver truck shipment damaged gold silver truck damaged in a fire shipment of gold silver truck damaged in a fire etc… Looking around the query, the closer clusters are 4, 5, and 6. We could use these clusters to expand or reformulate the query. For example, the following are some of the expanded queries one could test. gold silver truck arrived delivery gold silver truck gold silver truck delivery gold silver truck delivery arrived etc… Documents containing these terms should be more relevant to the initial query.

BHU Banaras Hindu University 40 DST-CIMS APPLICATION Latent Semantic Indexing (LSI) Exercise

BHU Banaras Hindu University 41 DST-CIMS Latent Semantic Indexing (Exercise) The svd was the original factorization proposed for Latent Semantic Indexing (LSI), the process of replacing a term-document matrix A with a low-rank approximation Ap which reveals implicit relationships among documents that don’t necessarily share common terms. Example: TermD1D2D3D4D5 twain53650301 clemens102040430 huckleberry3010255270  A query on clemens will retrieve D1, D2, D3, and D4.  A query on twain will retrieve D1, D2, and D4. For p = 2, the svd gives TermD1D2D3D4D5 twain4965734-5 clemens2322143021 huckleberry259345763  Now a query on clemens will retrieve all documents.  A query on twain will retrieve D1, D2, D4, and possibly D3.  The negative entry is disturbing to some and motivates the nonnegative factorizations.

BHU Banaras Hindu University 42 DST-CIMS References 1.Linear Algebra –I module 1, Vector and Matrices, by A.M. MATHAI, Centre for Mathematical Sciences (CMS) Pala. 2.Linear Algebra –II module 2, Determinants and Eigenvalues by A.M. MATHAI, Centre for Mathematical Sciences (CMS) Pala. 3.Introduction to Linear Algebra, Wellesley – Cambridge Press, 1993. 4.Matrix Computation, C. Golub and C. Van Loan, Johns Hopkins University Press, 1989. 5.Linear Algebra, A. R. Vasishtha and J.N. Sharma, Krishana Prakashan. 6.Matrices, A. R. Vasishtha and J.N. Sharma, Krishana Prakashan. 7.Linear Algebra, Ramji Lal, Sail Publication, Allahabad. 8.An Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze, Cambridge University Press.

BHU Banaras Hindu University 1 DST-CIMS International Workshop on Machine Learning and Text Analytics (MLTA2013) Linear Algebra for Machine Learning and.

Similar presentations

Presentation on theme: "BHU Banaras Hindu University 1 DST-CIMS International Workshop on Machine Learning and Text Analytics (MLTA2013) Linear Algebra for Machine Learning and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BHU Banaras Hindu University 1 DST-CIMS International Workshop on Machine Learning and Text Analytics (MLTA2013) Linear Algebra for Machine Learning and.

Similar presentations

Presentation on theme: "BHU Banaras Hindu University 1 DST-CIMS International Workshop on Machine Learning and Text Analytics (MLTA2013) Linear Algebra for Machine Learning and."— Presentation transcript:

Similar presentations

About project

Feedback