Latent Semantic Analysis Probabilistic Topic Models & Associative Memory.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Information retrieval – LSI, pLSI and LDA
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Text Similarity David Kauchak CS457 Fall 2011.
NLP Document and Sequence Models. Computational models of how natural languages work These are sometimes called Language Models or sometimes Grammars.
Probabilistic inference in human semantic memory Mark Steyvers, Tomas L. Griffiths, and Simon Dennis 소프트컴퓨팅연구실오근현 TRENDS in Cognitive Sciences vol. 10,
Statistical Topic Modeling part 1
Latent Semantic Analysis
Extracting Semantic Representations with Probabilistic Topic Models Mark SteyversUC Irvine Tom Griffiths Padhraic Smyth Dave Newman Brown University UC.
Dimensional reduction, PCA
Latent Dirichlet Allocation a generative model for text
Analyzing Federal Funding, Scientific Publications and with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
A probabilistic approach to semantic representation Paper by Thomas L. Griffiths and Mark Steyvers.
Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Tom Griffiths, UC Berkeley.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.
Concepts & Categorization. Geometric (Spatial) Approach Many prototype and exemplar models assume that similarity is inversely related to distance in.
1Ort ML A Figures and References to Topic Models, with Applications to Document Classification Wolfgang Maass Institut für Grundlagen der Informationsverarbeitung.
Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Tom Griffiths, UC Berkeley.
LSA, pLSA, and LDA Acronyms, oh my!
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Introduction to Forensic Science and the Law
A Day in the Life of a Math Teacher’s Child This little piggy went to market…. How many…?
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho UCLA.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Criminal Trial Participant and their roles. Judge “Trier of Law” Admissibility of evidence Interprets/explains the law Instructs jury on the law/their.
American Criminal Justice: The Process
Mini-course on Artificial Neural Networks and Bayesian Networks Michal Rosen-Zvi Mini-course on ANN and BN, The Multidisciplinary Brain Research center,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
The Nature of Evidence Chapter 3 ©2010 Elsevier, Inc.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Genre: Drama Author’s Purpose: Entertain Comprehension Skill: Compare & Contrast Compare & ContrastCompare & Contrast By: Douglas Love Blame it on the.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Link Distribution on Wikipedia [0407]KwangHee Park.
Natural Language Processing Topics in Information Retrieval August, 2002.
Chapter 5 (cont’d).  When awaiting trial, the accused should consult a criminal defense lawyer  Accused has the right to make suggestions to the lawyer.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Information Retrieval (4) Prof. Dragomir R. Radev
KNN & Naïve Bayes Hongning Wang
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
U NSUPERVISED T OPIC M ODELING Daphna Weinshall B Slides credit: Thomas Huffman, Tom Landauer, Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Text Based Information Retrieval
Vector-Space (Distributional) Lexical Semantics
Latent Variables, Mixture Models and EM
American Criminal Justice: The Process
Procedures for a CRIMINAL case
Latent Dirichlet Analysis
Text Categorization Assigning documents to a fixed set of categories
Michal Rosen-Zvi University of California, Irvine
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
WHO’S WHO IN THE COURT ROOM?
Conceptual grounding Nisheeth 26th March 2019.
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

The Psychological Problem How do we learn semantic structure? Covariation between words and the contexts they appear in (e.g. LSA) How do we represent semantic structure? Semantic Spaces (e.g. LSA) Probabilistic Topics

Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD RIVER STREAM MONEY BANK Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors

Critical Assumptions of Semantic Spaces (e.g. LSA) Psychological distance should obey three axioms Minimality Symmetry Triangle inequality

For conceptual relations, violations of distance axioms often found Similarities can often be asymmetric “North-Korea” is more similar to “China” than vice versa “Pomegranate” is more similar to “Apple” than vice versa Violations of triangle inequality: AB BC AC Euclidian distance:AC  AB + BC

Triangle Inequality in Semantic Spaces might not always hold w1w1 PLAY SOCCER THEATER Cosine similarity: cos(w 1,w 3 ) ≥ cos(w 1,w 2 )cos(w 2,w 3 ) – sin(w 1,w 2 )sin(w 2,w 3 ) w2w2 w3w3 Euclidian distance: AC  AB + BC

Nearest neighbor problem (Tversky & Hutchinson (1986) In similarity data, “Fruit” is nearest neighbor in 18 out of 20 fruit words In 2D solution, “Fruit” can be nearest neighbor of at most 5 items High-dimensional solutions might solve this but these are less appealing

Probabilistic Topic Models A probabilistic version of LSA: no spatial constraints. Originated in domain of statistics & machine learning (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003) Extracts topics from large collections of text Topics are interpretable unlike the arbitrary dimensions of LSA

DATA Corpus of text: Word counts for each document Topic Model Find parameters that “reconstruct” data Model is Generative

Probabilistic Topic Models Each document is a probability distribution over topics (distribution over topics = gist) Each topic is a probability distribution over words

Document generation as a probabilistic process TOPICS MIXTURE TOPIC TOPIC WORD WORD for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic

loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream Example Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet(  ) Mixture components ~ Dirichlet(  ).7

DOCUMENT 2: river ? stream ? bank ? stream ? bank ? money ? loan ? river ? stream ? loan ? bank ? river ? bank ? bank ? stream ? river ? loan ? bank ? stream ? bank ? money ? loan ? river ? stream ? bank ? stream ? bank ? money ? river ? stream ? loan ? bank ? river ? bank ? money ? bank ? stream ? river ? bank ? stream ? bank ? money ? DOCUMENT 1: money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? money ? stream ? bank ? money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? bank ? money ? stream ? Inverting (“fitting”) the model Mixture components Mixture weights TOPIC 1 TOPIC 2 ? ? ?

Application to corpus data TASA corpus: text from first grade to college representative sample of text 26,000+ word types (stop words removed) 37,000+ documents 6,000,000+ word tokens

Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:

Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

Three documents with the word “play” (numbers & colors  topic assignments)

No Problem of Triangle Inequality SOCCER MAGNETIC FIELD TOPIC 1 TOPIC 2 Topic structure easily explains violations of triangle inequality

Applications

Enron data 500,000 s 5000 authors

Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis

Applying Model to Psychological Data

BASEBALL BAT BALL GAME PLAY STAGE Network of Word Associations THEATER (Association norms by Doug Nelson et al. 1998)

BASEBALL BAT BALL GAME PLAY STAGE THEATER Explaining structure with topics topic 1 topic 2

Modeling Word Association Word association modeled as prediction Given that a single word is observed, what future other words might occur? Under a single topic assumption: Response Cue

Observed associates for the cue “play”

Model predictions RANK 9

Median rank of first associate Median Rank

Recall: example study List STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy FALSE RECALL: “Sleep” 61%

Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list

Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)