Hierarchical Bayesian Models for Aggregating Retrieved Memories across Individuals Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Michael Lee Brent Miller Pernille Hemmer Bill Batchelder Paolo Napoletano
Thomas Jefferson Andrew Jackson James Monroe George Washington John Adams Andrew Jackson Thomas Jefferson James Monroe John Adams George Washington Ordering problem: time what is the correct order of these Presidents?
Goal: aggregating responses 3 D A B C A B D C B A D CA C B D A D B C Aggregation Algorithm A B C D ground truth = ? group answer
Bayesian Approach 4 D A B C A B D C B A D CA C B D A D B C Generative Model A B C D ground truth = latent common cause
Important notes: No communication between individuals There is always a true answer (ground truth) Aggregation algorithm never has access to ground truth ground truth only used for evaluation 5
Matching problem: 6 RembrandtVan GoghMonetRenoir A B C D
Wisdom of crowds phenomenon Crowd estimate is often better than any individual in the crowd (Think of independent noise influencing each individual) 7
Examples of wisdom of crowds phenomenon 8 Who wants to be a millionaire? Galton’s Ox (1907): Median of individual estimates comes close to true answer
Limitations of Current “Wisdom of Crowds” Research Studies restricted to numeric or categorical judgments simple averaging schemes: Mode Median Mean No treatment of individual differences every “vote” is treated equally downplayed role of expertise 9
Cultural Consensus Theory (CCT) E.g. Romney, Batchelder, and Weller (1987) Finds the “answer key” to multiple choice questions when ground truth is lost takes person and item differences into account Informal version of CCT also developed for ranking data 10
Research Goals Generalize “wisdom of crowds” effect to more complex data Aggregation of permutations Ranking data Matching (assignment) data 11
Hierarchical Bayesian Models Probability distributions over all permutations of items with N items, there are N! combinations e.g., when N=44, we have 44! > 10^53 combinations Approximate inference methods: MCMC Cognitively plausible generative processes Treatment of individual differences 12
Part I Ordering Problems 13
Experiment 1 Task: order all 44 US presidents Methods 26 participants (college undergraduates) Names of presidents written on cards Cards could be shuffled on large table 14
= 1= 1+1 Measuring performance Kendall’s Tau: The number of adjacent pair-wise swaps Participant Ordering Ground Truth = 2
Empirical Results 16 (random guessing)
Probabilistic models Thurstone (1927) Mallows (1957) Plackett-Luce (1975) Lebanon-Mao (2008) Spectral methods Diaconis (1989) Heuristic methods from voting theory Borda count … however, many of these approached developed for preference rankings Many approaches for analyzing rank data… 17
Bayesian Thurstonian Approach 18 Each item has a true coordinate on some dimension A B C
Bayesian Thurstonian Approach 19 A B C … but there is noise because of encoding and/or retrieval error Person 1
Bayesian Thurstonian Approach 20 Each person’s mental representation is based on (latent) samples of these distributions B C A B C Person 1 A
Bayesian Thurstonian Approach 21 B C A B C The observed ordering is based on the ordering of the samples A < B < C Observed Ordering: Person 1 A
Bayesian Thurstonian Approach 22 People draw from distributions with common mean but different variances Person 1 B C A B C A < B < C Observed Ordering: Person 2 A B C B C Observed Ordering: A < C < B A A
Graphical Model Notation 23 j=1..3 shaded = observed not shaded = latent
Graphical Model of Bayesian Thurstonian Model 24 j individuals Latent ground truth Individual ability Mental representation Observed ordering
Inference Need the posterior distribution Markov Chain Monte Carlo Gibbs sampling on Metropolis-hastings on and Draw 400 samples group ordering based on average of across samples 25
Wisdom of Crowds effect 26 model’s ordering is as good as best individual
Inferred Distributions for 44 US Presidents 27 George Washington (1) John Adams (2) Thomas Jefferson (3) James Madison (4) James Monroe (6) John Quincy Adams (5) Andrew Jackson (7) Martin Van Buren (8) William Henry Harrison (21) John Tyler (10) James Knox Polk (18) Zachary Taylor (16) Millard Fillmore (11) Franklin Pierce (19) James Buchanan (13) Abraham Lincoln (9) Andrew Johnson (12) Ulysses S. Grant (17) Rutherford B. Hayes (20) James Garfield (22) Chester Arthur (15) Grover Cleveland 1 (23) Benjamin Harrison (14) Grover Cleveland 2 (25) William McKinley (24) Theodore Roosevelt (29) William Howard Taft (27) Woodrow Wilson (30) Warren Harding (26) Calvin Coolidge (28) Herbert Hoover (31) Franklin D. Roosevelt (32) Harry S. Truman (33) Dwight Eisenhower (34) John F. Kennedy (37) Lyndon B. Johnson (36) Richard Nixon (39) Gerald Ford (35) James Carter (38) Ronald Reagan (40) George H.W. Bush (41) William Clinton (42) George W. Bush (43) Barack Obama (44) median and minimum sigma
Model is calibrated 28 Individuals with large sigma are far from the truth
Alternative Models Many heuristic methods from voting theory E.g., Borda count method Suppose we have 10 items assign a count of 10 to first item, 9 for second item, etc add counts over individuals order items by the Borda count i.e., rank by average rank across people 29
Model Comparison 30
Experiment 2 78 participants 17 problems each with 10 items Chronological Events Physical Measures Purely ordinal problems, e.g. Ten Amendments Ten commandments 31
Ordering states west-east 32 Oregon (1) Utah (2) Nebraska (3) Iowa (4) Alabama (6) Ohio (5) Virginia (7) Delaware (8) Connecticut (9) Maine (10)
Ordering Ten Amendments 33
Ordering Ten Commandments 34 Worship any other God (1) Make a graven image (7) Take the Lord's name in vain (2) Break the Sabbath (3) Dishonor your parents (4) Murder (6) Commit adultery (8) Steal (5) Bear false witness (9) Covet (10)
Average results over 17 Problems 35 Individuals Mean Thurstonian Model Borda count Mode Individuals
Effect of Group Composition How many individuals do we need to average over? 36
Effect of Group Size: random groups 37
Experts vs. Crowds Can we find experts in the crowd? Can we form small groups of experts? Approach Form a group for some particular task Select individuals with the smallest sigma (“experts”) based on previous tasks Vary the number of previous tasks 38
Group Composition based on prior performance 39 T = 0 # previous tasks T = 2 T = 8 Group size (best individuals first)
Methods for Selecting Experts 40 Endogenous: no feedback required Exogenous: selecting people based on actual performance
Model incorporating overall person ability 41 j individuals Overall ability Task specific ability m tasks j individuals
Average results over 17 Problems 42 Mean new model
Part II Ordering Problems in Episodic Memory 43
Another ordering problem: 44 A B C D time
Experiment 3 26 participants 6 videos 3 videos with stereotyped event sequences (e.g. wedding) 3 videos “unpredictable” videos (e.g., example video) extracted 10 stills for testing Method study video followed by immediate ordering test of 10 items 45
Bayesian Thurstonian Model 46 = 3
Two other examples 47 = 1 = 0
Overall Results 48 Mean
Part III Matching Problems 49
Example Matching Problem (one-to-one) 50 Dutch Danish Yiddish Thai Vietnamese Chinese Georgian Russian Japanese A B C D E F G H I godt nytår gelukkig nieuwjaar a gut yohr С Новым Годом สวัสดีปีใหม่ Chúc Mừng Nǎm Mới გილოცავთ ახალ წელს
Experiment 17 Participants 8 matching problems, e.g. car logo’s and brand names first and last names philosophers flags and countries greek symbols and letter names Number of items varied between 10 and 24 with 24 items, we have 24! possibilities 51
Overall Results 52
Heuristic Aggregation Approach Combinatorial optimization problem maximizes agreement in assigning N items to N responses Hungarian algorithm construct a count matrix M M ij = number of people that paired item i with response j find row and column permutations to maximize diagonal sum O( n 3 ) 53
Hungarian Algorithm Example 54 = correct= incorrect
Hungarian Algorithm Results (2) 55
Bayesian Matching Model 56 Proposed process: - match “known” items - guess between remaining ones Individual differences: -some items easier to know -some participants know more Dutch Danish Yiddish Russian godt nytår gelukkig nieuwjaar a gut yohr С Новым Годом
Graphical Model 57 i items Latent ground truth Observed matching Knowledge State Prob. of knowing j individuals person ability item easiness
Overall Modeling Results 58
Calibration at level of items and people (for paintings problem) 59 ITEMS INDIVIDUALS
How predictive are subject provided confidence ratings? 60 # guesses estimated by individual Accuracy # guesses estimated by model (based on variable A) r=-.42 r=-.77
Part IV Open Issues 61
When do we get wisdom of crowds effect? Independent errors different people knowing different things Population response centered around ground truth Some minimal number of individuals individuals often sufficient 62
What are methods for finding experts? 1) Self-reported expertise: unreliable has led to claims of “myth of expertise” 2) Based on explicit scores by comparing to ground truth but ground truth might not be immediately available 3) Endogenously discover experts Use the crowd to discover experts Small groups of experts can be effective 63
What to do about systematic biases? In some tasks, individuals systematically distort the ground truth spatial and temporal distortions memory distortions (e.g. false memory) decision-making distortions Does this diminish the wisdom of crowds effect? maybe… but a model that predicts these systematic distortions might be able to “undo” them 64
Can we build domain specific models? Thurstonian model applied to wide variety of problems How about domain specific models? e.g., apply serial recall models to serial recall better specify sources of noise model systematic biases 65
That’s all 66 Do the experiments yourself:
Other slides 67
Results separated by problem 68
Notes Noise in Thurstonian models acquisition / encoding noise retrieval noise Link to crowd within (Ed Vul) are our results due to wisdom of crowds or individuals? Probably a bit of both and we cannot tell with our experiments However, there is probably a fair amount of encoding noise that would not benefit from repeated measurements within individuals Different individuals probably do know different things 69
To Do Compare explicitly estimated number of guesses with latent confidence Identifiability issue fix mean A? Hierarchical model test on small numbers of subjects Model comparisons on small sets of subjects 70 TO DO: look at kurtosis of sigma distributions
Modeling Group Serial Recall Goal: infer distribution over orderings of events given verbal reports i.e., P( original order | verbal report ) Many models for serial recall, e.g. Estes Perturbation model (1972) Shiffrin & Cook (1978) SOB (2002) Simple (2007) but many of these models do not have a likelihood function p( item 1, item 2, …, item N | memory contents ) 71
Bayesian Algorithm: not every person has equal weight 72 = correct= incorrect
Summary of Findings Extended wisdom of crowds to combinatorial problems approximate inference (MCMC) to infer probability distributions over permutations Bayesian methods that are calibrated we can tell who is likely to be accurate without having ground truth available 73
Graphical Model 74 i items Latent ground truth Observed matching Knowledge State Prob. of knowing j individuals item and person parameters
When do we get Wisdom of Crowds effect? Analyze model performance in a variety of tasks 75
MDS solution of pairwise tau distances 76 distance to truth
MDS solution of pairwise tau distances 77
Modeling Performance Across Task Current model is applied independently across tasks Extend hierarchical model with random effects approach to tasks Each person has a an overall ability (Pearson’s “g” ) Ability in a specific task is varies around overall ability 78