Download presentation
Presentation is loading. Please wait.
1
Learning and Decision-making from Rank Data
Lirong Xia IJCAI Tutorial, 2017
2
Decision Making from Rank Data
≻ ≻ Kyle Stan Eric ≻ ≻ Criteria Quality Fairness Strategy-proofness Computation ≻ ≻ ≻ ≻ Plurality
3
Other Examples Rank vs. Cardinal data Easier to elicit Big data
4
The Condorcet Jury theorem [Condorcet 1785]
Given two alternatives {O,M}. 0.5<p<1, Suppose each agent’s preferences is generated i.i.d., such that w/p p, the same as the ground truth w/p 1-p, different from the ground truth Then, as n→∞, the majority of agents’ preferences converges in probability to the ground truth “lays, among other things, the foundations of the ideology of the democratic regime” (Paroush 1998) Pr( | ) = Pr( | ) = p>0.5
5
End of story? Problem solved!
Great statistical guarantee Perfectly fair Strategy-proof Easy to compute What if there are at least 3 alternatives?
6
Overview 1. (2 hr) Learning 2. (3 slides) Elicitation 3. (1 hr)
… 3. (1 hr) Decision-making
7
Rank data Linear order O≻M≻N Weak order O=M≻N Partial order Pairwise
8
Focus of this tutorial: linear orders
≻ ≻ ≻ ≻ ≻ ≻
9
Rank the futures of AI according to AI100, Stanford University, 2016
10
Outline 1. Learning (2 hr) 2. Elicitation (3 slides)
… (1 hr) 3. Decision-making
11
1. Learning Statistical models Inference algorithms
Distance-based models Mallows’ model, Condorcet’s model Random Utility Models Plackett-Luce model Mixture models Inference algorithms Maximum likelihood estimator (MLE) Bayesian
12
Statistical models A statistical model has three parts Notes
A parameter space: Θ A sample space: S = Rankings(A)n A = the set of alternatives; n=#voters assuming votes are i.i.d. A set of probability distributions over S : {Prθ (s) for each s∈Rankings(A) and θ∈ Θ} Notes parameter space is not necessary Prθ is sometimes written as Pr (s|θ)
13
Example Condorcet’s model for two alternatives Parameter space Θ={ , }
Sample space S = { , }n Probability distributions, i.i.d. Pr( | ) = Pr( | ) = p>0.5
14
Mallows’ model [Mallows-1957]
Fixed dispersion 𝜑 <1 Parameter space all full rankings over candidates Sample space i.i.d. generated full rankings Probabilities: PrW(V) ∝ 𝜑 Kendall(V,W)
15
Example: Mallows for Probabilities: 𝑍=1+2𝜑+ 2𝜑 2 + 𝜑 3 > > 1 𝑍
Stan Kyle Eric Probabilities: 𝑍=1+2𝜑+ 2𝜑 2 + 𝜑 3 > > 1 𝑍 > > 𝜑 𝑍 > > > > > > Truth 𝜑 𝑍 𝜑 2 𝑍 > > > > 𝜑 2 𝑍 𝜑 3 𝑍
16
Condorcet’s model [Condorcet-1785, Young-1988, ES UAI-14, APX NIPS-14]
Fixed dispersion 𝜑 <1 Parameter space all binary relations over candidates Sample space i.i.d. generated binary relations Probabilities: PrW(V) ∝ 𝜑 Kendall(V,W)
17
Distance-based models
Given a metric space (T,d(.,.)) and a fixed dispersion 𝜑 <1 Distance-based model Parameter space: Θ =T Sample space S=Tn i.i.d. generated Probabilities: PrW(V) ∝ 𝜑 d(V,W)
18
Random utility model (RUM) [Thurstone 27]
Continuous parameters: Θ=(θ1,…, θm) m: number of alternatives Each alternative is modeled by a utility distribution μi θi: a vector that parameterizes μi An agent’s latent utility Ui for alternative ci is generated independently according to μi(Ui) Agents rank alternatives according to their perceived utilities Pr(c2≻c1≻c3|θ1, θ2, θ3) = PrUi ∼ μi (U2>U1>U3) θ3 θ2 θ1 U3 U1 U2
19
Generating a preference-profile
Pr(Data |θ1, θ2, θ3) = ∏V∈Data Pr(V |θ1, θ2, θ3) Parameters θ3 θ2 θ1 Agent n Agent 1 … Pn= c1≻c2≻c3 P1= c2≻c1≻c3
20
Plackett-Luce model μi’s are Gumbel distributions
A.k.a. the Plackett-Luce (P-L) model [BM 60, Yellott 77] Common parameterization λ1,…,λm Pros: Computationally tractable Analytical solution to the likelihood function The only RUM that was known to be tractable Widely applied in Economics [McFadden 74], learning to rank [Liu 11], and analyzing elections [GM 06,07,08,09] Cons: may not be the best model (later) cm-1 is preferred to cm c1 is the top choice in { c1,…,cm } c2 is the top choice in { c2,…,cm }
21
Plackett-Luce Model: visualization
Drawing balls out of an urn without replacement Ball sizes: parameters λ1,…,λm Ranking: sequence of balls Pr( ≻ ≻ ≻ | 0.4, 0.3, 0.2, 0.1)? 0.4 1 × × 0.3 0.4 0.4 0.3 0.2 0.1 0.2
22
RUM with Gaussian distributions
μi’s are Gaussian distributions Thurstone’s Case V [Thurstone 27] Pros: Intuitive Flexible Cons: believed to be computationally intractable No analytical solution for Pr(P | θ) is known … Um: from -∞ to ∞ Um-1: from Um to ∞ U1: from U2 to ∞
23
Mixture models Given k models The mixture model
𝑀 1 = Θ 1 ,𝑆, 𝑃𝑟 1 , …, 𝑀 𝑘 = Θ 𝑘 ,𝑆, 𝑃𝑟 𝑘 The mixture model Parameter space: ( Θ 1 ,…, Θ 𝑘 , 𝛼 ) 𝛼 ∈ 𝑅 ≥0 𝑘 : mixing coefficient, 𝛼 ⋅ 1 =1 Sample space: S Probability distributions α1 Θ1 𝑃𝑟 𝑖 Θ2 α2 θi s∈S … Θk αk
24
Mixtures of Plackett-Luce Models
0.2 0.4 0.3 0.1 Model 1 Model 2 + 𝜃 (1) = 0.5 𝜃 (2) = 0.2 0.2 0.1 Parameter space has new part… …“with alpha probability a ranking is generated from the left model, with probability 1-alpha a ranking is generated from the right mode” Better fitness Clustering
25
Irish Election Voting Blocs
Gormley-Murphy 08: EMM algorithm Previously, Gormley & Murphy in 2008 studied Irish election datasets and applied their algorithm… 5 candidates 4 model components State of the art: Expectation-Minorize-Majorization (EMM) algorithm I. C. Gormley and T. B. Murphy, “Exploring voting blocs within the Irish electorate: A mixture modeling approach,” J. Amer. Statistical Assoc., vol.103, no. 483, pp. 1014–1027, Sep
26
Many models, but What if the model is wrong?
All models are wrong , but some are useful Which model is useful (for what purpose)? model selection (automatically done in machine learning) ---George Box
27
Model selection Model selection criteria: T parameters, n data
log-likelihood: log Pr(D| θ*) predictive log-likelihood: E log Pr(Dtest| θ*) Akaike information criterion (AIC): 2T-2log Pr(D| θ*) Bayesian information criterion (BIC): T log n-2log Pr(D| θ*) Tested on datasets at Preflib.org 209 datasets on full rankings
28
AIC T: # parameters; n: # data points PL kPL RUM kRUM 23.4% 10.0%
2T-2log Pr(D| θ*) T: # parameters; n: # data points PL kPL RUM kRUM 23.4% 10.0% 90.4% 79.4% 39.2% 76.6% 20.6% 90.0% 60.8% 60.3% kPL: mixtures of Plackett-Luce kRUM: mixtures of RUM with Gaussian distributions
29
BIC klog n-2log Pr(D| θ*) PL kPL RUM kRUM 23.4% 15.8% 66.0% 59.8%
T: # parameters; n: # data points PL kPL RUM kRUM 23.4% 15.8% 66.0% 59.8% 34.0% 76.6% 40.2% 84.2% kPL: mixtures of Plackett-Luce kRUM: mixtures of RUM with Gaussian distributions
30
Observations on model fitness
For Preflib full ranking data 1st RUM (Gaussian) mixture model 2nd Plackett-Luce mixture model 3rd RUM (Gaussian) 4th Plackett-Luce Some models are more useful Next: how to make use of them
31
1. Learning Statistical models Inference algorithms
Distance-based models Mallows’ model, Condorcet’s model Random Utility Models Plackett-Luce model Mixture models Inference algorithms Maximum likelihood estimator (MLE) Bayesian approach
32
Maximum likelihood estimators (MLE)
Model: M “Ground truth” θ … V1 V2 Vn For any profile P=(V1,…,Vn), The likelihood of θ is L(θ, P)=Prθ(P)=∏V∈P Prθ (V) The MLE mechanism MLE(P)=argmaxθ L(θ, P)
33
Bayesian approach Given a profile P=(V1,…,Vn), and a prior distribution 𝜋 over Θ Step 1: calculate the posterior probability over Θ using Bayes’ rule Pr(θ|P) ∝ 𝜋(θ) Prθ(P) Step 2: make a decision based on the posterior distribution Maximum a posteriori (MAP) estimation MAP(d)=argmaxθ Pr(θ|P) Technically equivalent to MLE under uniform prior
34
Examples Θ={ , } Pr( | ) S = { , }n = Pr( | )
Θ={ , } S = { , }n Probability distributions: Data P = } MLE L(O)=PrO(O)6 PrO(M)4 = L(M)=PrM(O)6 PrM(M)4 = L(O)>L(M), O wins MAP: prior O:0.2, M:0.8 Pr(O|P) ∝0.2 L(O) = 0.2 × Pr(M|P) ∝0.8 L(M) = 0.8 × Pr(M|P)> Pr(O|P), M wins Pr( | ) = Pr( | ) = 0.6
35
Computing MLE of Mallows
MLE of Mallows’ model = Kemeny’s rule Given P, compute argminW ΣV∈P d(V,W) NP-hard [Bartholdi, Tovey, & Trick, SCW 1989] Practical algorithms Integer Linear Program [Conitzer et al. AAAI-06] Parameterized analysis [Betzler et al. TCS-09] PTAS [Kenyon-Mathieu&Schudy STOC-07] 104 algorithms and combinations [Ali&Meila MSS-12]
36
No edges in both directions
ILP for MLE of Mallows For each pair of alternatives a, b there is a binary variable xab xab = 1 if a>b in W xab = 0 if b>a in W min Σa,b(#b>a in P) xab s.t. for all a, b, xab+xba=1 for all a, b, c, xab+xbc+xca≤2 No need to worry about cycles of >3 vertices No edges in both directions No cycle of 3 vertices
37
Bayesian inference: sampling
Goal: given a prior and data P, sample W with probability Pr(W|P) Idea: use many samples to approximate the posterior distribution How to do it even for one ranking (n=1)? Same as generating a single data Brute force? O(m!) Markov Chain Monte-Carlo
38
Repeated insertion model [Doignon et al. 2004]
Sampling one ranking from Mallows’ model W.l.o.g. ground truth ranking W=a1≻a2 ≻…≻am Insert the alternatives one by one Let V=∅, then for each i ≤ m Insert ai to position j of V with probability 𝜑 𝑗−𝑖 1+𝜑+⋯ 𝜑 𝑖−1 Example a4≻a2≻a3≻a1 a3≻a2≻a1 𝜑 2 1+𝜑+ 𝜑 2 ∝ϕ3 𝜑 1+𝜑 𝜑 1+𝜑+ 𝜑 2 ∝ϕ2 a2≻a4≻a3≻a1 a2≻a1 a2≻a3≻a1 a1 ∝ϕ 1 1+𝜑+ 𝜑 2 a2≻a3≻a4≻a1 1 1+𝜑 ∝1 a1≻a2 a2≻a1≻a3 a2≻a3≻a1≻a4
39
Why correct? Let V=∅, then for each i ≤ m
Insert ai to position j of V with probability 𝜑 𝑗−𝑖 1+𝜑+⋯ 𝜑 𝑖−1 Given a ranking Vi over {a1,…,ai} and adjacent insertions of ai+1, denoted by Qi+1 =… ai+1 ≻aj*…(insertion at position j), and Ri+1 =… aj* ≻ai+1 … (insertion at position j+1), We want to prove PrW(Ri+1) = φ PrW(Qi+1) For each full ranking Q that extends Qi+1 Exchange ai+1 and aj*, resulting R that extends Ri+1 KendallTau(W, R)=KendallTau(W, Q)-1 PrW(Ri+1)=PrW(R extending Ri+1 )
40
Metropolis-Hastings for Mallows [HHX. UAI-15]
≻ ≻ V : min ( Pr 𝑊 𝑃 Pr 𝑉 𝑃 ,1) /2 ≻ ≻ W : ≻ ≻ W : ≻ ≻ V : otherwise
41
Theoretical guarantee
Theorem 1. The mixing time of the Markov chain for Mallows’ model is φ−kmax Poly(other inputs) φ: dispersion parameter of the model kmax: maximum cut in the weighted majority graph Theorem 2. The Markov chain for Condorcet’s model is rapid mixing
42
Maximum cut kmax kmax=14 Profile: { × 2 × 3 × 4 }
× 4 } Weighted majority graph: ≻ ≻ ≻ ≻ ≻ ≻ 9 1 kmax=14 5
43
Proof: canonical paths [Sinclair, 1992; Jerrum and Sinclair 1989, 1996; Liu 1996]
All rankings W 𝛾VW V ≻ ≻ ≻ ≻ 𝛾V’W’ W’ V’ 𝛤={one path for every pair} No edge is “congested”
44
Cannonical paths for Mallows
≻ ≻ ≻ ≻ ≻ ≻ ≻ ≻ V= =W The mixing time is mainly determined by the maximum edge loading [Sinclare 92] max 𝑒=𝐸→𝐸′ 1 Pr 𝐸|𝑃 Pr(𝐸→𝐸′) ∑ 𝛾 𝑉𝑊 ∋ 𝑒 Pr (𝑉|𝑃) Pr (𝑊|𝑃) | 𝛾 𝑉𝑊 |
45
Another Markov Chain for Mallows [Raman & Joachims L@S-15]
Metropolis-Hastings In each step Generate the proposal ranking W from PrV(.) Accept with probability Pr 𝑊 𝑃 Pr 𝑉 𝑃 Used for peer grading No theoretical guarantee known
46
Learning mixture of Mallows
47
Mixture models Given k models The mixture model
𝑀 1 = Θ 1 ,𝑆, 𝑃𝑟 1 , …, 𝑀 𝑘 = Θ 𝑘 ,𝑆, 𝑃𝑟 𝑘 The mixture model Parameter space: ( Θ 1 ,…, Θ 𝑘 , 𝛼 ) 𝛼 ∈ 𝑅 ≥0 𝑘 : mixing coefficient, 𝛼 ⋅ 1 =1 Sample space: S Probability distributions α1 Θ1 𝑃𝑟 𝑖 Θ2 α2 θi s∈S … Θk αk
48
Learning mixture of Mallows from partial rankings [Lu&Boutilier JMLR-14]
Data are partial rankings e.g. pairwise comparisons Model every agent has a latent full ranking R each pair of alternatives is independently chosen w/p β (missing at random [Ghahramani&Jordan 95]) projection of R on the pairs is observed as data … M1 M2 Mk α1 α2 αk θi 𝑃𝑟 𝑖 β R Pairs
49
Lu&Boutlier’s algorithm for computing MLE of mixture of Mallows
Generalized repeated insertion model chain rule: Pr(Insert ai|previous partial ranking) Monte-Carlo Expectation-Maximization: in each iteration, Sample the latent ranking and membership for each agent approximate sampler + Metropolis-Hastings Compute the new parameters to maximize the expected log-likelihood of the latent variables
50
MLE for Plackett-Luce model
Drawing balls out of an urn without replacement Ball sizes: parameters λ1,…,λm Ranking: sequence of balls Pr( ≻ ≻ ≻ | 0.4, 0.3, 0.2, 0.1)? 0.4 1 × × 0.4 0.3 0.2
51
Iterative Luce Spectral Ranking [Maystre &Grossglauser NIPS-15]
Choice data (not rankings!) (ai , A) ∈P: ai ∈A is chosen over other alternatives in A Log-likelihood 𝑎 𝑖 , 𝐴 ∈𝑃 ln 𝜆 𝑖 − ln 𝜆 𝐴 , where 𝜆 𝐴 = 𝑗∈𝐴 𝜆 𝑗 First-order conditions for all i≤m, 𝑎 𝑖 , 𝐴 ∈𝑃 ( 1 𝜆 𝑖 − 1 𝜆 𝐴 ) − 𝑎 𝑗 , 𝐴 ∈𝑃: 𝑎 𝑖 ∈𝐴 𝜆 𝐴 𝜆 𝑖 𝑗≠𝑖 𝑓( 𝐷 𝑗≻𝑖 , 𝜆 ) = 𝑗≠𝑖 𝜆 𝑗 𝑓( 𝐷 𝑖≻𝑗 , 𝜆 ) 𝑓(𝑆, 𝜆 ) = 𝑎,𝐴 ∈𝑆 𝜆 𝐴 Di≻j : votes where ai is chosen over aj {(ai , A) ∈P: aj ∈A} Namely, 𝜆 is the stationary distribution of a Markov chain M(P, 𝜆 )
52
The ILSR algorithm [Maystre &Grossglauser NIPS-15]
Idea: iteratively compute the Markov chain M(P, 𝜆 ) and its stationary distribution 𝜆 ILSR Choose an arbitrary 𝜆 Repeat Let M=0 For each (ai , A) ∈P and each aj ∈ A\{ai} Mji ← Mji + 1/𝜆A 𝜆 ←stationary distribution of M
53
Rank-breaking framework [Azari Soufiani, Parkes, & Xia NIPS-13, ICML-14]
Data D (rankings) Data G(D) (pairwise) G Θ′
54
Rank-breaking A breaking is a mapping Examples
{rankings} → 2{pairwise comparisons} represented by a graph G over {1,…, m} G(R)={ci≻cj: ci≻Rcj, (PosR(ci), PosR(cj)) in G} Examples Full breaking Adjacent breaking Position-k breaking 1 2 1 2 1 2 6 3 6 3 6 3 5 4 5 4 5 4
55
Examples { c1≻c2≻c3 , c2≻c1≻c3 } { , } D =
G: Adjacent Position-2 Full Examples 1 2 c1 c2 c2 c1 3 { c1≻c2≻c3 , c2≻c1≻c3 } { , } D = GFull(D)={ c1≻c2, c1≻c3, c2≻c3, c2≻c1, c2≻c3, c1≻c3 } GAdjacent(D)={ c1≻c2, c2≻c3, c2≻c1, c1≻c3 } GPos-2(D) ={ c2≻c3, c1≻c3 } c3 c3
56
Generalized method of moments
g(R, θ): a vector-valued function, called moment conditions, such that for any θ, Ed| θ g(d, θ)=0 GMM [Hansen 82] Input: data D Output: argminθ |Σd in D g(d, θ)|2 Under mild conditions, GMM is consistent and asymptotically normal
57
Moment conditions in GMMG
Input Dp Given θ, for any i≠j How? gij (R, θ) = Freq(ci≻cj)×PrMr(cj≻ci|θ,G) − Freq(cj≻ci)×PrMr(ci≻cj|θ,G) constant Freq(ci≻cj) Frequency of ci≻cj in Dp PrMr(ci≻cj| θ,G) = constant Freq(cj≻ci) Frequency of cj≻ci in Dp PrMr(cj≻ci|θ,G) ER|θ,G gij (R, θ) = ER|θ,G Freq(ci≻cj)×PrMr(cj≻ci|θ,G) − ER|Θ,G Freq(cj≻ci)×PrMr(ci≻cj|θ,G) = = PrMr(ci≻cj|θ,G) PrMr(cj≻ci|θ,G)
58
Good breakings Theorem. For PL, a breaking G is consistent if and only if it is the union of Position-k breakings top-k breaking are consistent union of Position-1, …, Position-k Theorem. For any model in the location family where the PDF of each utility distribution is symmetric, the only consistent breaking is the full breaking RUM (Gaussian) is not consistent for any non-full breakings top-3 breaking 1 2 6 3 5 4
59
Running time for PL MM [Hansen Ann. Stat.-04]
O(m3n) per iteration ILSR [Maystre &Grossglauser NIPS-15] O(m2n+m2.376) per iteration RGMMG with full breaking O(m2n+m2.376) breaking: O(m2n); optimization O(m2.376) RGMMG with top-k breaking O(mkn+m2.376) top-3 breaking 1 2 6 3 5 4
60
Efficiency tradeoff For top-k breakings, tradeoff between m=10, n=1000
computational efficiency statistical efficiency (MSE, Kendall correlation) m=10, n=1000
61
Weighted breaking [Khetan&Oh JMLR-16]
a ≻b ≻c ≻d a≻c, a≻d} + b≻d} + Complexity for breaking: O(m2n) Minimax to some extent Works well in practice Can be extended to certain partial orders
62
Mixtures of Plackett-Luce Models
0.2 0.4 0.3 0.1 Model 1 Model 2 + 𝜃 (1) = 0.5 𝜃 (2) = 0.2 0.2 0.1 Parameter space has new part… …“with alpha probability a ranking is generated from the left model, with probability 1-alpha a ranking is generated from the right mode” Better fitness Clustering
63
Irish Election Voting Blocs
Gormley-Murphy 08: EMM algorithm Previously, Gormley & Murphy in 2008 studied Irish election datasets and applied their algorithm… 5 candidates 4 model components State of the art: Expectation-Minorize-Majorization (EMM) algorithm I. C. Gormley and T. B. Murphy, “Exploring voting blocs within the Irish electorate: A mixture modeling approach,” J. Amer. Statistical Assoc., vol.103, no. 483, pp. 1014–1027, Sep
64
Are the voting blocs correct?
The identifiability of mixtures of Plackett-Luce model was unknown A model is identifiable, if for all 𝛾,𝜂∈Θ, Pr ∙ 𝛾 = Pr ∙ 𝜂 ⟹ 𝛾=𝜂 Risky to interpret learnt parameters in a non- identifiable model + or + ?
65
Identifiability of k-PL [Zhao, Piech, Xia ICML-16]
Identifiability theorems. The mixtures of 𝑘 Plackett-Luce over 𝑚 alternatives is non-identifiable: 𝑚≤2𝑘−1 Gormley & Murphy voting blocs: 𝑚 = 5, 𝑘 = 4 identifiable: 𝑚≥4, 𝑘=2 generically identifiable: 𝑘≤ 𝑚−2 2 ! flawed!
66
Topics covered so far Distance-based RUMs
Mallows Condorcet Mixture Plackett-Luce General RUM MLE ILP Easy LB JAIR-14 ILSR, Rank-breaking 2 ZPX. IJCAL-16 Bayesian HHX. UAI-15 1 3 4 1. Bayesian for PL [Guiver&Snelson ICML-09; Caron, Teh &Murphy Ann. App. Stat. 14] 2. [Zhao and Xia ongoing] 3. Open 4. Open
67
Outline 1. Learning (2 hr) 2. Elicitation (3 slides)
… (1 hr) 3. Decision-making
68
Elicitation Obtaining full rankings are hard Best questions to ask?
e.g. rank 10 alternatives at OPRA what if you can ask each agent no more than 3 pairwise comparisons Best questions to ask? randomly? most informative? How to measure information in one question?
69
Experimental design Greedy w.r.t. expected information gain a?b
P∪{a ≻ b} Inf( )=fa P P∪{b ≻ a} Inf( )=fb Inf( )=f Expected information gain for a?b: Pr(a ≻ b|P) fa + Pr(b ≻ a|P) fb - f Ask the question with max expected information gain
70
How does this work exactly?
Inf(Posterior) D-Optimality: Shannon information E-Optimality: Minimum eigenvalue of the Fisher information matrix Rank-Optimality [APX. UAI-13]: improve the certainty in the most uncertain pairwise comparison Potentially biased 1,1,1,1,1,1,1,1,2,2,2 Might be better than random [APX. UAI-13]
71
Outline 1. Learning (2 hr) 2. Elicitation (3 slides)
… (1 hr) 3. Decision-making
72
3. Decision-Making Statistical decision theory Fairness
Bayesian estimator Fairness Social choice Impossibility theorem New mechanisms Automated mechanism design
73
Decision making under uncertainty
You have a biased coin: head w/p p You observe 10 heads, 4 tails Do you think the next two tosses will be two heads in a row? Credit: Panos Ipeirotis & Roy Radner MLE-based approach there is an unknown but fixed ground truth p = 10/14=0.714 Pr(2heads|p=0.714) =(0.714)2=0.51>0.5 Yes! Bayesian the ground truth is captured by a belief distribution Compute Pr(p|Data) assuming uniform prior Compute Pr(2heads|Data)=0.485 No!
74
Statistical decision theory
Given statistical model: Θ, S, Prθ (s) decision space: D loss function: L(θ, d)∈ℝ Make a good decision based on data decision function f : data⟶D Bayesian expected lost: ELB(data, d) = Eθ|data L(θ,d) Frequentist expected lost: ELF(θ, f ) = Edata|θ L(θ, f (data))
75
Bayesian estimators Inputs Bayesian estimator
statistical model + prior decision space loss function: L(θ, d)∈ℝ unknown ground truth decision to make r : Data⟶D with minimum Bayesian expected lost: r (P) = argmind Eθ|P L(θ,d)
76
Top 250 movies “Complex voter weighting system” Claimed to be accurate
a “true Bayesian estimate” Claimed to be fair
77
IMDb Votes/Ratings Top Frequently Asked Questions
Different Voice Q: “This is unfair ! ” “That film / show has received awards, great reviews, commendations and deserves a much higher vote!” IMDB: “…only votes cast by IMDb users are counted. We do not delete or alter individual votes” IMDb Votes/Ratings Top Frequently Asked Questions
78
How to measure fairness?
Social choice theory
79
Measuring fairness of mechanisms with ties
Strict Condorcet criterion Weak Condorcet winners (if exist) must win Fairness for obviously strong candidates { , } > > > > 2 2 r { , } = { } > > > >
80
Fairness Axiom: Condorcet Criterion
(non-strict) Condorcet criterion Condorcet winner (if exist) must win Fairness for obviously strong candidates { , } > > > > ×2 1 3 3 r { , } = { } > > ×2 > >
81
Fairness Axiom: Neutrality
Fairness for candidates r { , } = { } > > > > r { , } = { } > > > >
82
Fairness Axiom: Anonymity
Fairness for voters > > r { , } = { } > > r { , } = { } > > > >
83
Fairness Axiom: Monotonicity
Weak form of strategy-proofness Fairness for non-sophisticated voters > > r { , } = { } > > ∈ r { , } > > > >
84
An impossibility theorem in social choice
Theorem. No single-winner mechanism satisfies anonymity and neutrality proof: > > Alice > > Bob ≠ W.O.L.G. Anonymity Neutrality
85
Can Bayesian estimators be fair?
OK, no mechanism is fair by all means Can Bayesian estimators be fair?
86
Fairness of Bayesian estimators [Xia UAI-16]
Impossibility theorem: No Bayesian estimator of any finite model satisfies strict Condorcet criterion How about other fairness axioms?
87
Mallows’ model [Mallows-1957]
Fixed dispersion 𝜑 <1 Parameter space all full rankings over candidates Sample space i.i.d. generated full rankings Probabilities: PrW(V) ∝ 𝜑 Kendall(V,W)
88
A Bayesian estimator 𝑓 Ma Top (Mallows with the top loss) [Young 1988]
Mallows’ model Decision: a set of winners Loss: the top loss function L(W, a) =0 if a is top-ranked in W, otherwise it is 1 Uniform prior
89
A New Mechanism 𝑓 Co Borda (Condorcet with Borda loss)
Condorcet’s model Decision: a set of winners Loss: the Borda loss function L(W, a) = # alternatives who beats a in W Uniform prior
90
Properties of Bayesian estimators [Xia UAI-16]
Anonymity, neutrality, monotonicity Strict Condorcet Condorcet Complexity 𝑓 Ma Top ✔ ✗ ✔ iff 𝜑(1− 𝜑 𝑚−1 ) 1−𝜑 ≤1 NPH [PRS UAI-12] 𝑓 Co Borda 𝜑≤ 1 𝑚−1 P 𝑓 Pair 1 𝑓 Pair 2 m: number of alternatives 𝑓 Pair 1 and 𝑓 Pair 2 are Bayesian estimators of a new model
91
Fine, it would be even better if…
I have a dream, a dream deeply rooted in the AI dream---one day machine intelligence will rise up and help people resolve disputes and make better group decisions. A mechanism with these properties A mechanism for my problem More fair More accurate In P …
92
Strategy-proofness as data generation [Xia AAMAS-13, blue sky track]
≻ ≻ r( ) = ≻ ≻ ≻ ≻ Strategy-proofness ≻ ≻ r( ) ≠ ≻ ≻ ≻ ≻ ≻ ≻ r( ) ≠ ≻ ≻ ≻ ≻ …
93
Automated design [Xia AAMAS-13]
Algorithm Data from experts [Procaccia et al. AIJ-09] New data by axioms Learn a mechanism with minimum αSPErrSP + αIIAErrIIA Promising idea? How to efficiently generate new data? Any theoretical guarantee? How to model simplicity of the mechanism?
94
Which class of mechanisms to learn from?
Not too restrictive should include many existing mechanisms Not too general e.g. “the set of all functions” Easy to learn e.g. linear classifiers in some sense
95
Generalized Decision Scoring Rules [Xia EC-14, Xia&Conitzer EC-08]
Given a preference space P and a decision space D A Generalized Decision Scoring Rule (GDSR) is defined by two functions and a number K f: PℝK g: PreOrders({1,…,K})D The rule is denoted by GDSR(f,g) P = ( V1 , … , Vn ) g(PreOrder(f(P))) … f (V1) f (Vn) + + PreOrder(f(P))
96
Example: approval voting
f g (1,0,0) + (1,0,1) + (0,1,0) = (2,1,1) > =
97
Other examples Resolute and irresolute voting rules
All generalized scoring rules [XC-EC 08] including positional scoring rules, Copeland, etc. excluding Dodgson’s rule Preference functions Kemeny, Slater Multi-winner rules Chamberlin & Courant’s rule excluding Monroe’s rule Statistical estimators MLE
98
Summary 1. Learning (2 hr) 2. Elicitation (3 slides)
Models Algorithms … (1 hr) 3. Decision-making Statistical decision theory Fairness New mechanisms
99
Bridging Theory and Practice
Online Preference Reporting and Aggregation
100
Applications UI design, Human Computer Interaction
Learning and elicitation
101
Applications Learning
102
Group decision support system
Applications Group decision support system [Xia EC-12]
103
Future Thank you! Learning System building Decision-Making
model building algorithm design deep learning System building e.g. OPRA Decision-Making fair and ethical decision [Rossi 2016] automated design COMSOC-18 at RPI 2 year postdoc with Lirong
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.