Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006
Outline Motivation Approximate sorting Lower bound Upper bound Aggregation Algorithm Experimental results Conclusion
Motivation Find preference structure of consumer w.r.t. a set of products Common: assign value function to products Value function determines a ranking of products Elicitation: pairwise comparisons Problem: deriving metric value function from non-metric information We restrict ourselves to finding ranking
Efficiency measure: number of comparisons Comparison based sorting algorithm Lower Bound: comparisons As set of products can be large this is too much Motivation Find for every respondent a ranking individually
Motivation Possible solutions: Approximation Aggregation Modeling and distribution assumptions
Approximation (joint work with J. Giesen and M. Stojaković) 1. Lower bound (proof) 2. Algorithm
Approximation Consumer’s true ranking of n products corresponds to: Identity increasing permutation id on {1,.., n} Wanted: Approximation of ranking corresponds to: s.t. small
Metric on S n Needed: metric on Meaningful in the market research context Spearman’s footrule metric D: Note:
We show: To approximate ranking within expected distance at least comparisons necessary comparisons always sufficient
Lower bound : randomized approximate sorting algorithm If for every input permutation the expected distance of the output to id is at most r, then A performs at least comparisons in the worst case. A,
Lower bound: Proof Assume less than comparisons for every input. Expected distance of larger than r. There is a, s.t. expected distance of larger than r. Contradiction. Fix deterministic algorithm. Then for at least permutations: output at distance more than 2r. Follows Yao’s Minimax Principle
Lower bound: Lemma For r>0 : ball centered at with radius r id r Lemma:
uniquely determined by the sequence Lower bound: Proof of Lemma # sequences of n non-negative integers whose sum is at most r: If then For sequence of non-negative integers : at most 2 n permutations satisfy
Lower bound: deterministic case For fixed, the number of input permutations which have output at distance more than 2r to id is more than Now to show:
k comparisons 2 k classes of same outcome Lower bound: deterministic case
k comparisons 2 k classes of same outcome Lower bound: deterministic case
For in the same class: Lower bound: deterministic case For in the same class:
Lower bound: deterministic case
At most 2 k input permutations have same output Lower bound: deterministic case
At most input permutations with output in Lower bound: deterministic case
At least input permutations with output outside Lower bound: deterministic case
Upper Bound Algorithm (suggested by Chazelle) approximates any ranking within distance with less than comparisons.
Algorithm Partitioning of elements into equal sized bins Elements within bin smaller than any element in subsequent bin. No ordering of elements within a bin Output: permutation consistent with sequence of bins
Algorithm Round 0 1 2
Analysis of algorithm m rounds 2 m bins Output : ranking consistent with ordering of bins Running Time Median search and partitioning of n elements: less than 6n comparisons (algorithm by Blum et al) m rounds less than 6nm comparisons Distance Set
Algorithm: Theorem Any ranking consistent with bins computed in rounds, i.e. with less than comparisons has distance at most
Approximation: Summary For sufficiently large error: less comparisons than for exact sorting: error, const: comparisons error : comparisons For real applications: still too much Individual elicitation of value function not possible Second approach: Aggregation
Aggregation (joint work with J. Giesen and D. Mitsche) Motivation: We think that population splits into preference/ customer types Respondents answer according to their type (but deviation possible) Instead of Individual preference analysis or aggregation over the population aggregate within customer types
Aggregation Idea: Ask only a constant number of questions (pairwise comparisons) Ask many respondents Cluster the respondents according to answers into types Aggregate information within a cluster to get type rankings Philosophy: First segment then aggregate
Algorithm The algorithm works in 3 phases: (1) Estimate the number k of customer types (2) Segment the respondents into the k customer types (3) Compute a ranking for each customer type
Algorithm if respondent i prefers y over x if respondent i prefers x over y if respondent i has not compared x and y Every respondent performs pairwise comparisons. Basic data structure: matrix A = [a ij ] Entry a ij in {-1,1,0}, refers to respondent i and the j-th product pair (x,y)
Algorithm Define B = AA T Then B ij = number of product pairs on which respondent i and j agree minus number of pairs on which they disagree (not counting 0’s).
Algorithm: phase 1 Phase 1: Estimation of number k of customer types Use matrix B Analyze spectrum of B We expect: k largest eigenvalues of B to be substantially larger than the other eigenvalues Search for gap in the eigenvalues
Algorithm: phase 2 Phase 2: Cluster respondents into customer types Use again matrix B Compute projector P onto the space spanned by the eigenvectors to the k largest eigenvalues of B Every respondent corresponds to a column of P Cluster columns of P
Algorithm: phase 2 Intuition for using projector – example on graphs:
Algorithm: phase Ad =
Algorithm: phase P =
Algorithm: phase P’ =
Algorithm: phase 2 Embedding of the columns of P
Algorithm: phase 3 Phase 3: Compute the ranking for each type For each type t compute characteristic vector c t : For each type t compute A T c t if entry for product pair (x,y) is positive: x preferred over y by t negative: y preferred over x by t zero : type t is indifferent if respondent i belongs to that type otherwise
Experimental study On real world data 21 data sets from Sawtooth Software, Inc. (Conjoint data sets) Questions: Do real populations decompose into different customer types Comparison of our algorithm to Sawtooth’s algorithm
Conjoint structures Attributes: Sets A 1,.. A n, |A i |=m i An element of A i is called level of the i-th attribute A product is an element of A 1 x …x A n In practical conjoint studies: Example: Car Number of seats = {5, 7} Cargo area = {small, medium, large} Horsepower = {240hp, 185hp} Price = {$29000, $33000, $37000} …
Quality measures Difficulty: we do not know the real type rankings We cannot directly measure quality of result Other quality measures: Number of inverted pairs : average number of inversions in the partial rankings of respondents in type i with respect to the j-th type ranking Deviation probability Hit Rate (Leave one out experiments)
# respondents = 270 Size of study: 8 x 3 x 4 = 96 # questions = 20 Study 1 Largest eigenvalues of matrix B
# respondents = 270 Size of study: 8 x 3 x 4 = 96 # questions = 20 Study 1 two types Size of clusters: 179 – 91 Ranking for type 1 Ranking for type 21-p Type % Type % Number of inversions and deviation probability
Hitrates: Sawtooth: ? Our algorithm: 69% # respondents = 270 Size of study: 8 x 3 x 4 = 96 # questions = 20 Study 1
# respondents = 539 Size of study: 4 x 3 x 3 x 5 = 180 # questions = 30 Study 2 Largest eigenvalues of matrix B
# respondents = 539 Size of study: 4 x 3 x 3 x 5 = 180 # questions = 30 Study 2 four types Size of clusters: 81 – 119 – 130 – 209 Ranking for type 1 Ranking for type 2 Ranking for type 3 Ranking for type 4 1-p Type % Type % Type % Type % Number of inversions and deviation probability
# respondents = 539 Size of study: 4 x 3 x 3 x 5 = 180 # questions = 30 Study 2 Hitrates: Sawtooth: 87% Our algorithm: 65%
# respondents = 1184 Size of study: 9 x 6 x 5 = 270 # questions = 48 Study 3 Size of clusters: 6 – 3 – 1164 – 8 – 3 Size of clusters: 3 – 1175 – 6 1-p = 12% Largest eigenvalues of matrix B
# respondents = 1184 Size of study: 9 x 6 x 5 = 270 # questions = 48 Study 3 Hitrates: Sawtooth: 78% Our algorithm: 62%
# respondents = 300 Size of study: 6 x 4 x 6 x 3 x 2= 3456 # questions = 40 Study 4 Largest eigenvalues of matrix B
# respondents = 300 Size of study: 6 x 4 x 6 x 3 x 2= 3456 # questions = 40 Study 4 Hitrates: Sawtooth: 85% Our algorithm: 51%
Aggregation - Conclusion Segmentation seems to work well in practice. Hitrates not good Reason: information too sparse Additional assumptions necessary Exploit conjoint structure Make distribution assumptions
Thank you!
Yao’s Minimax Principle : finite set of input instances : finite set of deterministic algorithms C(i,a): cost of algorithm a on input i, where i and a For all distributions p over and q over