Concepts & Categorization
Measurement of Similarity Geometric approach Featural approach both are vector representations
Vector-representation for words Words represented as vectors of feature values Similar words have similar vectors
How to get vector representations Multidimensional scaling on similarity ratings Tversky’s (1977) contrast model Latent Semantic Analysis (Landauer & Dumais, 1997) Topics Model (e.g., Griffiths & Steyvers, 2004)
Multidimensional Scaling (MDS) Approach Suppose we have N stimuli Measure the (dis)similarity between every pair of stimuli (N x (N-1) / 2 pairs). Represent each stimulus as a point in a multidimensional space. Similarity is measured by geometric distance, e.g., Minkowski distance metric:
Multidimensional Scaling Represent observed similarities by a multidimensional space – close neighbors should have high similarity Multidimensional Scaling: iterative procedure to place points in a (low) dimensional space to model observed similarities
Data: Matrix of (dis)similarity
MDS procedure: move points in space to best model observed similarity relations
Example: 2D solution for bold faces
2D solution for fruit words
Critical Assumptions of Geometric Approach Psychological distance should obey three axioms –Minimality –Symmetry –Triangle inequality
For conceptual relations, violations of distance axioms often found Similarities can often be asymmetric “North-Korea” is more similar to “China” than vice versa “Pomegranate” is more similar to “Apple” than vice versa Violations of triangle inequality: “Lemon” “Orange”“Apricot”
Triangle Inequality and similarity constraints on words with multiple meanings AB BC Euclidian distance:AC AB + BC FIELD MAGNETIC SOCCER AC
Nearest neighbor problem (Tversky & Hutchinson (1986) In similarity data, “Fruit” is nearest neighbor in 18 out of 20 items In 2D solution, “Fruit” can be nearest neighbor of at most 5 items High-dimensional solutions might solve this but these are less appealing
Feature Contrast Model (Tversky, 1977) Represent stimuli with sets of discrete features Similarity is an –increasing function of common features –decreasing function of distinct features Common features Features unique to I Features unique to J a,b, and c are weighting parameters
Contrast model predicts asymmetries Weighting parameter b > c pomegranate is more similar to apple than vice versa because pomegranate has fewer distinctive features
Contrast model predicts violations of triangle inequality Weighting parameter a > b > c (common feature should be weighted more)
Additive Tree solution
Latent Semantic Analysis (LSA) Landauer & Dumais (1997) Assumptions 1) words similar in meaning occur in similar verbal contexts (e.g., magazine articles, book chapters, newspaper articles) 2) we can count number of times words occur in documents and construct a large word x document matrix 3) this co-occurrence matrix contains a wealth of latent semantic information that can be extracted by statistical techniques 4) words can be represented as points in a multidimensional space
FIELD GRASS CORN BASEBALL MAJORFOOTBALL Latent Semantic Analysis (Landauer & Dumais, ’97) MEADOW (high dimensional space) Information in matrix is compressed; relationships between words through other words are used.
Problem: LSA has to obey triangle inequality AB BC Euclidian distance:AC AB + BC FIELD MAGNETIC SOCCER AC
The Topics Model (Griffith & Steyvers, 2002 & 2003) A probabilistic version of LSA: no spatial constraints. Each document (i.e. context) is a mixture of topics. Each topic is a distribution over words Each word chosen from a single topic: word probability in topic j probability of topic j in document
P( w | z ) HEART0.3 LOVE0.2 SOUL0.2 TEARS0.1 MYSTERY0.1 JOY0.1 P( z = 1 ) P( w | z ) SCIENTIFIC0.4 KNOWLEDGE0.2 WORK0.1 RESEARCH0.1 MATHEMATICS0.1 MYSTERY0.1 P( z = 2 ) TOPIC MIXTURE A toy example MIXTURE COMPONENTS wiwi Words can occur in multiple topics
P( w | z ) HEART0.3 LOVE0.2 SOUL0.2 TEARS0.1 MYSTERY0.1 JOY0.1 P( z = 1 ) = 1 P( w | z ) SCIENTIFIC0.4 KNOWLEDGE0.2 WORK0.1 RESEARCH0.1 MATHEMATICS0.1 MYSTERY0.1 P( z = 2 ) = 0 TOPIC MIXTURE All probability to topic 1… MIXTURE COMPONENTS wiwi Document: HEART, LOVE, JOY, SOUL, HEART, ….
P( w | z ) HEART0.3 LOVE0.2 SOUL0.2 TEARS0.1 MYSTERY0.1 JOY0.1 P( z = 1 ) = 0 P( w | z ) SCIENTIFIC0.4 KNOWLEDGE0.2 WORK0.1 RESEARCH0.1 MATHEMATICS0.1 MYSTERY0.1 P( z = 2 ) = 1 TOPIC MIXTURE All probability to topic 2 … MIXTURE COMPONENTS wiwi Document: SCIENTIFIC, KNOWLEDGE, SCIENTIFIC, RESEARCH, ….
P( w | z ) HEART0.3 LOVE0.2 SOUL0.2 TEARS0.1 MYSTERY0.1 JOY0.1 P( z = 1 ) = 0.5 P( w | z ) SCIENTIFIC0.4 KNOWLEDGE0.2 WORK0.1 RESEARCH0.1 MATHEMATICS0.1 MYSTERY0.1 P( z = 2 ) = 0.5 TOPIC MIXTURE Mixing topic 1 and 2 MIXTURE COMPONENTS wiwi Document: LOVE, SCIENTIFIC, HEART, SOUL, KNOWLEDGE, RESEARCH, ….
Application to corpus data TASA corpus: text from first grade to college –representative sample of text 26,000+ word types (stop words removed) 37,000+ documents 6,000,000+ word tokens
THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE A selection from 500 topics
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Polysemy: words with multiple meanings represented in different topics
No Problem of Triangle Inequality SOCCER MAGNETIC FIELD TOPIC 1 TOPIC 2 Topic structure easily explains violations of triangle inequality
How to get vector representations Multidimensional scaling on similarity ratings Tversky’s (1977) contrast model Latent Semantic Analysis (Landauer & Dumais, 1997) Topics Model (e.g., Griffiths & Steyvers, 2004)