Latent Semantic Analysis John Martin Small Bear Technologies, Inc.

Slides:



Advertisements
Similar presentations
3D Geometry for Computer Graphics
Advertisements

Covariance Matrix Applications
Information retrieval – LSI, pLSI and LDA
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Dimensionality Reduction PCA -- SVD
15-826: Multimedia Databases and Data Mining
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
An Introduction to Latent Semantic Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Singular Value Decomposition
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Paper Summary of: Modelling Retrieval and Navigation in Context by: Massimo Melucci Ahmed A. AlNazer May 2008 ICS-542: Multimedia Computing – 072.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
NLP: Why? How much? How? Peter Wiemer-Hastings. Why NLP? Intro: once upon a time, I was a grad student and worked on MUC. Learned: –the NLP was as good.
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Domain-Specific Software Engineering Alex Adamec.
Chapter 2 Dimensionality Reduction. Linear Methods
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
MATLAB for Engineers 4E, by Holly Moore. © 2014 Pearson Education, Inc., Upper Saddle River, NJ. All rights reserved. This material is protected by Copyright.
Speech Analysing Component in Automatic Tutoring Systems Presentation by Doris Diedrich and Benjamin Kempe.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
SINGULAR VALUE DECOMPOSITION (SVD)
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
CSE 4705 Artificial Intelligence
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
Dimension Reduction in Workers Compensation
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Deutero-Isaiah and Latent Semantic Analysis
CSc4730/6730 Scientific Visualization
Design open relay based DNS blacklist system
HCC class lecture 13 comments
Latent Semantic Indexing
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Analysis John Martin Small Bear Technologies, Inc.

Problem

Don't just search... Understand

© 2011 Small Bear Technologies, Inc.4 The Goal Maintaining and increasing the value of information by facilitating the understanding of meaning Collecting information is not the problem – gaining understanding is the issue Too much information is just as useless at too little information

Some Terms Data – individual items or facts Information – data with context Meaning – how we understand collections of information

© 2011 Small Bear Technologies, Inc.6 Unstructured Text News feeds Call center logs traffic Surveys Social network postings Publishing Observational data

© 2011 Small Bear Technologies, Inc.7 Automated Methods Required Volume of information Speed of change/production Complexity Need impartial/consistent analysis

© 2011 Small Bear Technologies, Inc.8 Some Common Methods Lexical matching Statistical evaluation Vector space models Rule based systems Parts of speech analysis

© 2011 Small Bear Technologies, Inc.9 The Problem Failure to capture meaning and provide insight Methods not universally applicable Language or domain dependent Need for specialized prior knowledge of data Require human interaction Tagging, Keyword identification, Categorizations Not practical for large data sets

© 2011 Small Bear Technologies, Inc.10 The Cost Too much information is just as useless at too little information Failure to understand the information we have leads to: Lost opportunities Unsatisfied customers Inability to fulfill mission Financial repercussions

© 2011 Small Bear Technologies, Inc.11 Latent Semantic Analysis Theory of meaning [8] Creates a mapping of meaning acquired from the text itself Computational model Can perform many of the cognitive tasks that humans do essentially as well as humans [7]

© 2011 Small Bear Technologies, Inc.12 A Cognitive Model LSA processing constructs a mapping of meaning in a semantic space The mapping gives the meaning of words and documents not vice versa

© 2011 Small Bear Technologies, Inc.15 Compositionality Constraint The meaning of a document is the sum of the meaning of its words The meaning of a word is defined by the documents in which it appears (and does not appear)

© 2011 Small Bear Technologies, Inc.16 LSA Space Construction LSA models a document as a simple linear equation A collection of documents (corpus) is a large set of simultaneous equations

© 2011 Small Bear Technologies, Inc.17 Processing a Corpus Divide text corpus into units (documents) Typically paragraphs of text Raw matrix is constructed from units One row for each word type One column for each unit (document) Cells contain the number of times a particular word appears in a particular document Weighting functions may be applied [5]

© 2011 Small Bear Technologies, Inc.18 Term x Document Matrix D1D2D3D4D5... Dn T T T T T Tm100022

© 2011 Small Bear Technologies, Inc.19 Sparse Matrix The weighted term by document matrix represents a large set of simultaneous equations The term by document matrix is sparse Typically less than 1% of the values are nonzero [2]

© 2011 Small Bear Technologies, Inc.20 Solving Simultaneous Equations The system of simultaneous equations is solved for the meaning of each word type and document Sparse matrix Singular Value Decomposition Lanczos algorithm is typically used Only solve for a reduced number of dimensions Produces vectors representing the meaning of each term and document

© 2011 Small Bear Technologies, Inc.21 Singular Value Decomposition [10] The rows of matrix U are the vectors for the word types Columns of U are the eigenvectors defining the axes for word type space

© 2011 Small Bear Technologies, Inc.22 Singular Value Decomposition [10] The rows of matrix V are the text unit (document) vectors Columns of V are eigenvectors defining the axes for document space

© 2011 Small Bear Technologies, Inc.23 Orthogonal Axes Every dimension is independent of every other dimension In the Term/Document matrix direct comparison was not possible

© 2011 Small Bear Technologies, Inc.24 Dimensional Reduction With enough variables every object is different With too few variables every object is the same [a r f j] [a r c g] Typically solve for 300 – 500 dimensions [10]

© 2011 Small Bear Technologies, Inc.25 Consider a geographic map Knoxville Cincinnati Atlanta Nashville Knoxville Nashville Atlanta Cincinnati

© 2011 Small Bear Technologies, Inc.26 Semantic Space Vectors represent the meaning of a document (or term) Items similar in meaning are near each other in the semantic space

© 2011 Small Bear Technologies, Inc.27 Computational Issues Nontrivial computation Large sparse symmetric eigenproblem Scalability concerns [11] Size of document set Speed of processing Accuracy issues Finite arithmetic introduces significant error

© 2011 Small Bear Technologies, Inc.28 Misconceptions and Misunderstandings Driven by term co-occurrence Word order issues Data collection size Content and Meaning

© 2011 Small Bear Technologies, Inc.29 Co-occurrence LSA starts with a kind of co-occurrence Appearing in the same document does not make words similar Similarity is determined by the effect of the word meaning on the system of equations

© 2011 Small Bear Technologies, Inc.30 Word Order Word order effects almost entirely within single sentences Research indicates only around 10% of meaning is word order dependent (for English) [“order syntax? Much. Ignoring word Missed by is how”] [8]

© 2011 Small Bear Technologies, Inc.31 Data Collection Size Beware of small data collections Generally - Use at least 100,000 documents

© 2011 Small Bear Technologies, Inc.32 Content LSA builds its notion of meaning from the content of the data collection Garbage In = Garbage Out

© 2011 Small Bear Technologies, Inc.33 The Problem Revisited Failure to capture meaning and provide insight Methods not universally applicable Need for specialized prior knowledge of data Require human interaction Not practical for large data sets

© 2011 Small Bear Technologies, Inc.34 Operations Retrieval Clustering Comparison Interpretation Completion

© 2011 Small Bear Technologies, Inc.35 Applications of LSA Library Illustration Retrieval Content analysis Evaluation of “fit” into an existing collection Comparison of multiple collections Indexing of multilingual collections

© 2011 Small Bear Technologies, Inc.36 Applications of LSA Repairing/cleaning data Education Grading Summarizing Non-textual applications Bio-informatics Personality profiles/compatibility analysis

© 2011 Small Bear Technologies, Inc.37 Conclusion LSA offers powerful capabilities for gaining insight and understanding the contents of a data collection LSA provides analysis techniques not available with other Text Analytic methods Small Bear Technologies provides the core technology, tools, and support for performing Latent Semantic Analysis

© 2011 Small Bear Technologies, Inc.38 Suggested Reading Handbook of Latent Semantic Analysis; Landauer, T., McNamara D., Dennis, S., Kintsch, W., Eds.; Lawrence Erlbaum Associates, Inc.: Mahwah, New Jersey, “Indexing by Latent Semantic Analysis” Deerwester, S.; Dumais, S.; Furnas, G.; Landauer, T.; Harshman, R., Journal of the American Society for Information Sciences 1990, 41, “Improving the retrieval of information from external sources” Dumais, S., Behavior Research Methods, Instruments, & Computers 1991, 23, “An Introduction to Latent Semantic Analysis” Landauer, T.; Foltz, P.; Laham, D., Discourse Processes 1998, 25, “A solution to Plato's problem: The Latent Semantic Analysis Theory of acquisition, induction, and representation of Knowledge” Landauer, T.; Dumais, S., Psychological Review 1997, 104,

© 2011 Small Bear Technologies, Inc.39 References [1] Berry, M.W., Large Sparse Singular Value Computations. In International Journal of Supercomputer Applications, 1992, Vol. 6, pp [2] Berry, M.W., & Browne, M., Understanding Search Engines: Mathematical Modeling and Text Retrieval (2 nd ed.) SIAM, Philadelphia, [3] Berry, M.W., & Martin, D., Principle Component Analysis for Information Retrieval Applications. In Statistics: A series of textbooks and monographs: Handbook of Parallel Computing and Statistics, Chapman & Hall/CRC, Boca Raton, 2005, pp [4] Deerwester, S., Dumais, S, Furnas, G., Landauer, T., & Harshman, R, Indexing by Latent Semantic Analysis. In Journal of the American Society of Information Sciences, 1990, Vol. 41, pp [5] Dumais, S., Improving the Retrieval of Information from External Sources. In Behavior Research Methods, Instruments, and Computers, 1991, Vol. 23, pp

© 2011 Small Bear Technologies, Inc.40 References (cont.) [6] Grimes, S., Text Analytics 2009: User Perspectives on Solutions and Providers. Alta Plana Research: [7] Landauer, T.K., On the Computational Basis of Cognition: Arguments from LSA. In The Psychology of Learning and Motivation. B.H. Ross (Ed.), Academic Press, New York, 2002; pp [8] Landauer, T. K., LSA as a Theory of Meaning. In The Handbook of Latent Semantic Analysis, Landauer, McNamara, Dennis, & Kintsch (Eds.), Lawrence Erlbaum Associates, Inc., Mahwah, New Jersey, 2007; pp [9] Landauer, T.K., & Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. In Psychological Review, 1997, Vol. 104, pp [10] Martin, D., Berry, M., Mathematical Foundations Behind Latent Semantic Analysis, In The Handbook of Latent Semantic Analysis, Landauer, McNamara, Dennis, & Kintsch (Eds.), Lawrence Erlbaum Associates, Inc., Mahwah, New Jersey, 2007; pp

References (cont.) [11] Martin, D., Martin, J., Berry, M., Browne, M., Out-of-Core SVD performance for document indexing. In Applied Numerical Mathematics, 2007, Vol. 14, No. 10.

John Martin Small Bear Technologies, Inc.