Mathematical Theories of Interaction with Oracles Liu Yang Carnegie Mellon University 1© Liu Yang 2013.

Slides:



Advertisements
Similar presentations
Optimal Lower Bounds for 2-Query Locally Decodable Linear Codes Kenji Obata.
Advertisements

A Fast PTAS for k-Means Clustering
Computational Learning Theory
1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Chapter 7 Sampling and Sampling Distributions
6.896: Topics in Algorithmic Game Theory Lecture 21 Yang Cai.
LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.
Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.
Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.
Chapter 8 Estimation Understandable Statistics Ninth Edition
Simple Linear Regression Analysis
Oct 9, 2014 Lirong Xia Hypothesis testing and statistical decision theory.
Algorithmic mechanism design Vincent Conitzer
The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.
6.896: Topics in Algorithmic Game Theory Lecture 20 Yang Cai.
Online Learning for Online Pricing Problems Maria Florina Balcan.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
The Pumping Lemma for CFL’s
Auction Theory Class 5 – single-parameter implementation and risk aversion 1.
Blackbox Reductions from Mechanisms to Algorithms.
Yang Cai Oct 01, An overview of today’s class Myerson’s Auction Recap Challenge of Multi-Dimensional Settings Unit-Demand Pricing.
Online Mechanism Design (Randomized Rounding on the Fly)
Secret Sharing, Matroids, and Non-Shannon Information Inequalities.
Seminar in Auctions and Mechanism Design Based on J. Hartline’s book: Approximation in Economic Design Presented by: Miki Dimenshtein & Noga Levy.
An Approximate Truthful Mechanism for Combinatorial Auctions An Internet Mathematics paper by Aaron Archer, Christos Papadimitriou, Kunal Talwar and Éva.
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
A general agnostic active learning algorithm
Bundling Equilibrium in Combinatorial Auctions Written by: Presented by: Ron Holzman Rica Gonen Noa Kfir-Dahav Dov Monderer Moshe Tennenholtz.
Item Pricing for Revenue Maximization in Combinatorial Auctions Maria-Florina Balcan, Carnegie Mellon University Joint with Avrim Blum and Yishay Mansour.
Active Learning of Binary Classifiers
Active Perspectives on Computational Learning and Testing Liu Yang Slide 1.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Yang Cai Sep 15, An overview of today’s class Myerson’s Lemma (cont’d) Application of Myerson’s Lemma Revelation Principle Intro to Revenue Maximization.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Auction Seminar Optimal Mechanism Presentation by: Alon Resler Supervised by: Amos Fiat.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Preference elicitation Communicational Burden by Nisan, Segal, Lahaie and Parkes October 27th, 2004 Jella Pfeiffer.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Approximation Algorithms based on linear programming.
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
Comp/Math 553: Algorithmic Game Theory Lecture 10
Online Allocation and Pricing with Economies of Scale
Dana Ron Tel Aviv University
Vapnik–Chervonenkis Dimension
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
On the effect of randomness on planted 3-coloring models
Computational Learning Theory
The Byzantine Secretary Problem
CSCI B609: “Foundations of Data Science”
Switching Lemmas and Proof Complexity
Presentation transcript:

Mathematical Theories of Interaction with Oracles Liu Yang Carnegie Mellon University 1© Liu Yang 2013

Thesis Committee Avrim Blum (co-chair) Jaime Carbonell (co-chair) Manuel Blum Sanjoy Dasgupta (UC, San Diego) Yishay Mansour (Tel Aviv University) Joel Spencer (Courant Institute, NYU)

Outline Active Property Testing -Do we need to imitate human to advance AI? -I see air planes can fly without flapping their wings. © Liu Yang 20133

Property Testing Given access to massive dataset: want to quickly determine if a given fn f has some given property P or is far from having it Goal: test from very small num of queries. One motivation: preprocessing step before learning © Liu Yang 20134

Property Testing © Liu Yang Instance space X = R n (Distri D over X) Tested function f : X->{-1,1} A property P of Boolean fn is a subset of all Boolean fns h : X -> {-1,1} (e.g ltf) dist D (f, P):=min g P P x~D [f(x) g(x)] Standard Type of query: membership query (ask for f(x) at arbitrary point x)

Property Testing: An Example E.g. Union of d Intervals UINT 4 ? Accept! UINT 3 ? Depend on ε - Model selection: testing can tell us how big d need be to be close to target (double and guess, d = 2, 4, 8, 16, ….) If f P should accept w/ prob 2/3 If dist(f,P)>ε should reject w/ prob 2/3 6© Liu Yang 2013

Property Testing and Learning : Motivation What is Property Testing for ? - Quickly tell if the right fn class to use - Estimate complexity of fn without actually learning Want to do it with fewer queries than learning 7© Liu Yang 2013

Standard Model uses Membership Query Results of Testing basic Boolean fns using MQ: Constant QC for UINTd, dictator, ltf, … However … 8© Liu Yang 2013

Membership Query is Unrealistic for ML Problems: An Object Recognition example Recognizing cat/dog ? MQ gives … Is this a dog or a cat? 9© Liu Yang 2013

An example: movie reviews An example: movie reviews Is this a positive or negative review ? Typical representation in ML (bag-of-words): {fell, holding, interest, movie, my, of, short, this} The original review (human labelers see): This movie fell short of holding my interest. © Liu Yang Object a human expert labels has more structure than internal representation used by alg. - MQs construct ex.s in internal representation. - Can be very difficult to order constructed examples words so a human can label the example (esp for long reviews)

Passive : Waste Too Many Queries ML people move on Can we SAVE #queries ? Passive Model (sample from D) query samples exist in ; but quite wasteful (many examples uninformative) NATURE 11© Liu Yang 2013

Active Testing 12© Liu Yang 2013 Alg can ask for labels but only pts in the pool Goal: small #queries Pool of unlabeled data (poly-size)

Property Tester Definition.Definition. An s-sample, q-query ε-tester for P over the distribution D is a randomized algorithm A that draws s samples from D, sequentially queries for the value of f on q of those samples, and then 1. Accepts w.p. at least 2/3 when f P 2. Rejects w.p. at least 2/3 when dist D (f,P)>ε cheap expensive 13© Liu Yang 2013

Definition.Definition. An s-sample, q-query ε-tester for P over the distribution D is a randomized algorithm A that draws s samples from D, sequentially queries for the value of f on q of those samples, and then 1. Accepts w.p. at least 2/3 when f P 2. Rejects w.p. at least 2/3 when dist D (f,P)>ε cheap expensive 14© Liu Yang 2013 Active tester: s = poly(n) Passive tester: s = q MQ tester: s = (D= Unif)

Active Property Testing Testing as preprocessing step of learning Need an example? where Active testing - get same QC saving as MQ - better in QC than Passive - need fewer queries than Learning Union of d Intervals, active testing help! Testing tells how big d need to be close to target - #Label: Active Testing need O(1), Passive Testing need Θ(d), Active Learning need Θ(d) 15© Liu Yang 2013

Our Results MQ-like on testing UINTd Passive-like on testing Dictator Active TestingPassive TestingActive Learning Union of d IntervalsO(1)Θ(d 1/2 )Θ(d) DictatorΘ(log n) Linear Threshold FnO(n 1/2 ) ~ Θ(n 1/2 )Θ(n) Cluster AssumptionO(1)Ω(N 1/2 )Θ(N) MQ-like Passive-like NEW !! 16© Liu Yang 2013

Testing Unions of Intervals 0 Testing Unions of Intervals Theorem. Testing UINTd in the active testing model can be done using O(1) queries. Recall: Learning requires Ω(d) examples. 17© Liu Yang 2013

Testing Unions of Intervals (cont.) Suppose uniform distribution Definition: Fixδ>0. The localδ-noise sensitivity of fn f: [0, 1]->{0, 1} at x [0; 1] is. The noise sensitivity of f is Proposition: Fixδ>0. Let f: [0, 1] -> {0,1} be a union of d intervals. NS δ (f) dδ. easy har d Lemma: Fix δ= ε 2 /(32d). Let f : [0, 1] -> {0, 1} be a fn with noise sensitivity bounded by NS δ (f) dδ(1 + ε/4 ). Then f is ε-close to a union of d intervals. 18© Liu Yang 2013

Easy Lemma Lemma. If f is a union of d intervals, NS δ (f) dδ. Proof sketch: - The probability that x lands within distance δ of any of the boundaries is at most 2d*2δ. -The probability that y crosses a boundary given that x is within distance δ of it is 1/4. -P(f(x)f(y)| |x-y|<δ) (2d*2δ)*(1/4) = dδ. © Liu Yang

Hard Lemma Lemma. Fix δ = ε 2 /(32d). If f is ε-far from a union of d intervals, then NS δ (f) > (1+ε/4)dδ. Proof strategy: If NS δ (f) is small, do self-correction. g(x) = E[f(y) | yÃ[x-δ,x+δ]], f(x) = round g(x) to 0 if ¿ or to 1 if 1-¿ © Liu Yang

Hard Lemma Lemma. Fix δ = ε 2 /(32d). If f is ε-far from a union of d intervals, then NS δ (f) > (1+ε/4)dδ. Proof strategy: - Argue dist(f,f) ε/2. - Show f is union of d(1 + ε/2) intervals. - Implies dist(f,P) ε/2. © Liu Yang

z z at δ Nr δ δ δ δ !!! δ δ 22© Liu Yang 2013 Uniform

Testing Unions of Intervals © Liu Yang Theorem. Testing UINT d in the active testing model can be done using O(1) queries. If non-uniform distribution, use data to stretch/squash the axis, makes the distribution near-uniform Total num unlabeled samples: O(d 1/2 ).

Testing Linear Threshold Fns 24© Liu Yang 2013 Linear Threshold Functions (LTF): f(x) = sign( ), for w,x 2 R n

Testing Linear Threshold Fns Theorem. We can efficiently test LTFs under the Gaussian distribution with Õ(n 1/2 ) labeled examples in both active and passive testing models. We have lower bounds of ~ Ω (n 1/3 ) for active testing and ~ Ω (n 1/2 ) for passive testing. Learning LTFs need Ω(n) under Gaussian. So testing is better than learning in this case. 25© Liu Yang 2013

Testing Linear Threshold Fns [MORS10] => suffices to estimate E[f(x) f(y) ] up to ± poly(ε). Intuition: LTF is characterized by a nice linear relation between angle ( ) and probability of having same label (f(x)f(y)=1). 26© Liu Yang 2013

Testing Linear Threshold Fns [MORS10] => suffices to estimate E[f(x) f(y) ] up to ± poly(ε). Could take m random pairs and use empirical average. - But most pairs x,y would have n 1/2 (CLT) So would need m = Ω(n) to get within ± poly(ε). Solution: take O(n 1/2 ) random points and average f(x)f(y) over all O(n) pairs x,y. - Concentration inequalities for U-statistics [Arcones,95] imply this works. 27© Liu Yang 2013

General Testing Dimension Testing dim characterize (up to constant factors) the intrinsic #label requests needed to test the given property w.r.t. the given distribution All our lower bounds are proved via testing dim 28© Liu Yang 2013

Minimax Argument min Alg max f P(Alg mistaken) = max π 0 min Alg P(Alg mistaken) wolg, π 0 =α π + (1-α) π, π Π 0,π Π ε Let π S, π S be induced distributions on labels of S. For a given π 0, min Alg P(Alg makes mistake|S) 1-d S (π, π) 29© Liu Yang 2013

Passive Testing Dim Define d passive largest q in N, s.t. Theorem: Sample Complexity of passive testing is Θ(d passive ). 30© Liu Yang 2013 Compare with VC-dimension: Want exists set S s.t. all labelings occur at least once.

Active Testing Dim Fair(π,π,U): distri. of labeled (y; l): w.p.½ choose y~π U, l= 1; w.p.½ choose y~π U, l= 0. err*(H; P): err of optimal fn in H w.r.t data drawn from distri. P over labeled egs. Given u=poly(n) unlabeled egs, d active (u): largest q in N s.t. Theorem: Active testing w/ failure prob 1/8 using u unlabeled egs needs Ω(d active (u)) label queries; can be done w/ O(u) unlabeled egs and O(d active (u)) label queries 31© Liu Yang 2013

Application: Dictator fns Theorem: For dictator functions under the uniform distribution, d active (u)=Θ(log n) (for any large-enough u=poly(n)). Corollary: Any class that contains dictator functions requires log(n) queries to test in the active model, including poly-size decision trees, functions of low Fourier degree, juntas, DNFs, etc. 32© Liu Yang 2013

Application: Dictator fns Theorem: For dictator functions under the uniform distribution, d active (u)=Θ(log n) (for any large-enough u=poly(n)). π = unif over dictator fns π = unif over all Boolean fns 33© Liu Yang 2013

Application: LTFs Theorem. For LTFs under the standard n-dim Gaussian distrib, d passive = Ω((n/logn) 1/2 ) and d active (u) = Ω((n/logn) 1/3 ) (for any u=poly(n)). - π: distrib over LTFs obtained by choosing w~N(0, I nxn ) and outputting f(x) = sgn(w x). - π: uniform distrib over all functions. 34© Liu Yang Obtain d passive :bound tvd(distrib of Xw/n, N(0, I qxq )). - Obtain d active : similar to dictator LB but rely on strong concentration bounds on spectrum of random matrices

Open Problem Matching lb/ub for active testing LTF: n? Tolerant Testing ε/2 vs. ε (UINTd, LTF) Testing LTF under general distrib. 35© Liu Yang 2013

Outline Learnability of DNF with Representation Specific Queries - Liu: We do statistical learning for … - Marvin: but we haven't not done well at the fundamentals, e.g. knowledge representation. © Liu Yang

Learning DNF formulas Poly-sized DNF: # terms = n O(1) e.g. f=(x1 x2) (x1 x4) -Natural form of knowledge representation -PAC-learning DNF appears to be very hard. 37© Liu Yang 2013 Your ticket : n: number of var.s Concept space C: collection of fn h: {0, 1}^n -> {0,1} Unknown target fn f*: the true labeling fn Err(h) = Px~D[h(x) ~= f*(x)] (Distri. D over X) Best known alg in standard model is exponential over arbitrary distri; Over Unif, no known poly time alg

New Models: Interaction with Oracles 38© Liu Yang Boolean queries: K(x, y) = 1 if share some term - Numerical queries: K(x, y) = #terms share Hi, Tim, do x and y have some term in common ? Yes! Imagine …

Query: Similarity about TYPE 39© Liu Yang 2013 Fraud Detection Type of Query: pair of POSITIVE ex.s from a random dataset, teacher says YES if they share some term; or report how many terms they share. Question: can we efficiently learn DNF with this type of query? Identity theft Stolen cards Stolen cards BIN attack x y What if have similarity info about TYPE ? Fraud detection: fraudulent of same type ? YES! x and y share a term Skimming Term 1 of xTerm 2 of xTerm 3 of x

Warm Up: Disjoint DNF w/Boolean Queries Use similarity queries to partition positive ex.s into t buckets, one per term. Separately learn a conjunction for each bucket (intersect the pos ex.s in it) OR the results 40© Liu Yang 2013

Pos Result 1: Weakly Disjoint DNF w/Boolean Queries -Distinguishing ex for T 1 : ex. sat. T 1 & no other term -Weakly disjoint: for each term, poly(n, 1/ε) fraction rand. ex.s sat. it & no other term. - Neighbor-method: get all its neighbors in the graph and learn a conjunc. - Neighbor-method w.p. 1-δ, produce an ε-accu. DNF if weakly disjoint. 41© Liu Yang 2013 T1T1 Graph: - Nodes: pos examples - Edge exists if K(.) = 1

Hardness Results Boolean Queries Thm. Learning DNF from random data under arb. distri. w/ Boolean queries is as hard as learning DNF from random data under arb. distri. w/ only labels (no queries). 42© Liu Yang 2013 m K (giant 1, giant 2) = 1 - Group-learn: tell data from D+ or D- - Reduction from group-learn DNF in std. model to our model - How to use our alg A to group-learn ? - Simulate the oracle by always saying yes whenever there is a query made to two pos ex.s; Given the output of A, we give a group-learn alg for the original problem n var.s

Hardness Results Approx Numerical Queries Thm. Learning DNF from random data under arbitrary distri. w/ approx-numerical-queries is as hard as learning DNF from random data under arb. distri. w/ only labels i.e. if C is #terms x i and x j sat in common, oracle returns a value in [(1 – τ )C, (1 + τ )C]. © Liu Yang

Pos Result 3: learn DNF w/ Numerical Queries -Sample m = O((t/ ε ) log(t/( εδ) )) landmark points -Landmark F i (x) is sum-of-monotone-terms fn (rm terms not sat by pos x i ). Fi(·) = K(x i, ·), K is numerical query -Use subroutine to learn hypo. h i (x) ε /(2m)-accu w.r.t. Fi. Subroutine: learn a sum of monotone t terms over unif., using time & samples poly(t, n, 1/ε). -Combine all hypo.s h i to h: h(x) = 0 if h i (x) = 0 for all i, else h(x) = 1. © Liu Yang Thm. Under unif distri., w/ numerical queries, can learn any poly(n)-term DNF. f(x) = T 1 (x)+T 2 (x)+ … +T t (x)

Learn Sum of Monotone Terms Estimate Fourier coeffi. of S Inclusion check: mag. ε/(16t) ? otw x 1 | x 2 |x 3 |x 4 |x 5 |x 6 |x 7 |x 8 |x 9 S = {x 1 } Output x 1 x 3 x 4 x 9 S = {x 1, x 2 } S = {x 1, x 3 } S = {x 1, x 3 x 4 } YES S = {x 1, x 3 x 4, x 5 } S = {x 1, x 3 x 4, x 6 } S = {x 1, x 3 x 4, x 7 } S = {x 1, x 3 x 4, x 8 } S = {x 1, x 3 x 4, x 9 } - Greedy: - Inclusion Check: - Fourier coeffi. of S: © Liu Yang

Learn Sum of Monotone Terms : Greedy Alg Examine each parity fn of size 1 & est its Fourier coeffi. (up to θ/4 accu.). Set θ = ε/(8t) Place all coeffi. of mag. θ/2 into a list L 1. For j = 2, 3,... repeat: - For each parity fn Φ S in list L j-1 and each x i not in S, est Fourier coeffi. of - If est. is θ/2, add it to list L j (if not already in) - maintain list L j : size-j parity fns w/ coeffi. mag. θ. Construct fn g: weight sum of parities for identified coeff. Output fn h(x) = [g(x)] © Liu Yang Inclusion check

Other Positive Results BinaryNumeric O(log(n)) terms DNF (any distrib.) 2-term DNF (any distrib.) DNF: each var in at most O(log(n)) terms (Unif) log(n)-Junta (Unif) log(n)-Junta (any distrib) DNF having 2 O(log(n)) terms (Unif.) Open problems: - learn arbitrary DNF (unif, Boolean queries)? - learn arbitrary DNF (any distri. numerical queries)?

Outline Active Learning with a Drifting Distribution Active Learning with a Drifting Distribution - If not every poem has a proof, can we at least try to make every theorem proved beautiful like a poem? © Liu Yang

Active Learning with a Drifting Distrib: Model Scenario: - Unobservable seq. of distrib.s with each - Unobservable time-indep. regular cond. distrib. represent by fn - : an infinite seq. of indep. r. v., s.t., and cond. distrib. Of Y t given X t satisfies Active learning protocol At each time t, alg is presented with X t, and is required to predict a label, then it may optionally request to see true label value Y t Interested in cumulative #mistakes up to time T and total #labels requested up to time T © Liu Yang

x1x1 x2x2 xtxt x3x3 Data Space D2D2 D1D1 D3D3 DtDt x4x4 D4D4 © Liu Yang Distrib. Space

Definition and Notations Instance space X = R n Distribution space of distributions on X Concept space C of classifiers h: X -> {-1,1} - Assume C has VC dimension vc < D t : Data distrib. on X at t Unknown target fn h*: true labeling fn Err t (h) = P x~D t [h(x) h*(x)] In realizable case, h* in C and err t (h*) = 0. For, © Liu Yang

Def: disagreement coefficient, tvd The disagreement coefficient of h* under a distri. P on X, is define as, (r > 0) Total variation distance of probability measures P and Q on a sigma-algebra of subsets of the sample space is defined via © Liu Yang

Assumptions Independence of the X t variables Vc-dim < Assumption 1 (totally bounded) : is totally bounded (i.e. satisfies ) - For each ε > 0, denote a minimal subset of s.t. s.t. (i.e. a minimal ε-cover of ) Assumption 2 (poly-covers) where c,m 0 are constants.

Realizable-case Active Learning CAL © Liu Yang

Sublinear Result: Realizable Case © Liu Yang Theorem. If is totally bounded, then CAL, achieves an expected mistake bound And if, then CAL makes an E[#queries] [Proof Sketch]: Partition D into buckets of diam < eps. Pick a time T_eps past all indices from finite buckets and all the infinite bucket has at least

Number of Mistakes Alternative scenario: - Let P i be in bucket i - Swap the L( ε ) samples for bucket i with L( ε ) samples from P i - L( ε ) large enough so E[diam(V)] alternative < sqrt{eps}. - Note: E[diam(V)] E[diam(V)] alternative + sum L( ε ) t values ||P_i – D_t|| < ε + L( ε )* ε. So E[diam] -> 0 as T -> - E[#mistake] - Since

Number of Queries E[#queries] P(make query) = E[P(DIS(V t-1 ))] Let then and E[#queries] =>

Explicit Bound: Realizable Case © Liu Yang Theorem. If poly-covers assumption is satisfied ( ) then CAL achieves an expected mistake bound and E[#queries] such that where [Proof Sketch] Fix any ε >0, and enumerate For t in N, let K(t) be the index k of the closest to D t. Alternative data sequence: Let be indep., with This way all samples corresp. to distrib.s in a given bucket all came from same distri. Let V t be the corresponding version spaces.

E[#mistakes] Classic PAC bound => (#previous distrib.s in D t 's bucket) So (each bucket has at most T samples) So E[#mistakes] Take to get the stated theorem. To bound E[#queries], again it is just showed this is So Again, taking gives the stated result. © Liu Yang

Learning with Noise Noise conditions Strictly benign noise condition: Special case: Tsybakov's noise conditions η satisfies strictly benign noise condition and for some c > 0 and α0, Unif Tsybakov assumption: Tsybakov Assumption is satisfied for all with the same c and α values. © Liu Yang and

Agnostic CAL [DHM] © Liu Yang Based on subroutine:

Tsybakov Noise: Sublinear Results & Explicit Bound Theorem. If is totally bounded and η satisfies strictly benign noise condition, then ACAL achieves an excess expected mistake bound and if additionally, then ACAL makes an expected number of queries © Liu Yang Theorem. If poly-covers Assumption and Unif Tsybakov assumption are satisfied, then ACAL achieves an expected excess number of mistakes ACAL achieves expected #mistakes and expected #queries such that, for

Outline Transfer Learning Transfer Learning - Do not ask what Bayesians can do for - Do not ask what Bayesians can do for Machine Learning, ask what Machine Machine Learning, ask what Machine Learning can do for Bayesians Learning can do for Bayesians

Transfer Learning Principle: solving a new learning problem is easier given that weve solved several already ! How does it help? - New task directly ``related to previous task [e.g., Ben-David & Schuller 03; Evgeniou, Micchelli, & Pontil 2005] - Previous tasks give us useful sub-concepts [e.g., Thrun 96] - Can gather statistical info on the variety of concepts [e.g., Baxter 97; Ando & Zhang 04] Example: Speech Recognition - After training a few times, figured out the dialects. - Next time, just identify the dialect. - Much easier than training a recognizer from scratch

prior h1*h1* x 1 1,y 1 1 … x 1 k,y 1 k Task 1 hT*hT* x T 1,y T 1 … x T k,y T k … Task T Model of Transfer Learning Motivation: Learners often Not Too Altruistic h2*h2* x 2 1,y 2 1 … x 2 k,y 2 k Task 2 Layer 1: draw task i.i.d. from unknown prior Layer 2: per task, draw data i.i.d. from target Better Estimate of Prior !!

- Marvin: so you assume learning French is similar to learning English? - Marvin: so you assume learning French is similar to learning English? - Liu: It indeed seems many English words have a French counterpart … - Liu: It indeed seems many English words have a French counterpart …

Identifiability of priors from joint distribs Let prior π be any distribution on C - example: (w, b) ~ multivariate normal Target h* π ~ π Data X = (X 1, X 2, …) i.i.d. D indep h* π Z(π) = ((X 1, h* π (X 1 ), (X 2, h* π (X 2 ), …). Let [m] = {1, …, m}. Denote X I = {X i } i I (I : subset of natural numbers) Z I (π) = {(X i, h* π (X i ))} i I Theorem: Z [VC] (π 1 ) = d Z [VC] (π 2 ) iff π 1 = π 2.

Identifiability of priors by VC-dim joint distri. Threshold: - for two points x 1, x 2, if x 1 < x 2, then Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(+,-)=0, So Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.) - for any k > 1 points, can directly to reduce number of labels in the joint prob from k to 1 P( (-+) ) = P( (-+) ) = P( (.+) ) - P( (++) ) = P( (.+) ) - P( (+.) ) + P( (+-) ) (unrealized labeling !!) = P( (.+) ) - P( (+.) )

Theorem: Z [VC] (π 1 ) = d Z [VC] (π 2 ) iff π 1 = π 2. Proof Sketch Let ρ m (h,g) = 1/m Σ i=1 m II(h(X m ) g(X m )) Then vc < implies w.p.1 forall h, g C with h g lim m -> ρ m (h,g) = ρ(h,g) > 0 ρ is a metric on C by assumption, so w.p.1 each h in C labels -seq (X 1, X 2 …) distinctly (h(X 1 ), h(X 2 ), …) => w.p.1 conditional distribution of the label seq Z(π)|X identifies π => distrib of Z(π) identifies π i.e. Z (π 1 ) = d Z (π 2 ) implies π 1 = π 2

Identifiability of Priors from Joint Distributions lower–dim cond distrib y closer to

Identifiability of Priors from Joint Distributions

Transfer Learning Setting Collection Π of distribs on C. (known) Target distrib π* Π. (unknown) Indep target fns h 1 *, …, h T * ~ π* (unknown) Indep i.i.d. D data sets X (t) = (X 1 (t), X 2 (t), …), t [T]. Define Z (t) = ((X 1 (t), h t *(X 1 (t) )), (X 2 (t), h t *(X 2 (t) )), …). Learning alg. gets Z (1), then produces ĥ 1, then gets Z (2), then produces ĥ 2, etc. in sequence. Interested in: values of ρ(ĥ t, h*(t)), and the number of h* t (X j (t) ) value alg. needs to access.

Estimating the prior Principle: learning would be easier if know π* Fact: π* is identifiable by distrib of Z [VC] (t) Strategy: Take samples Z [VC] (i) from past tasks 1, …, t-1, use them to estimate distrib of Z [VC] (i), convert that into an estimate π t-1 of π*, Use π t-1 in a prior-dependent learning alg for new task h t * Assume Π is totally bounded in total variation Can estimate π* at a bounded rate: || π* - π t ||< δ t converges to 0 (holds whp)

Transfer Learning Given a prior-dependent learning A(ε, π), with E[# labels accessed] =Λ(ε, π) and producing ĥ with E[ρ(ĥ, h*)]ε For t = 1,…, T If δ t-1 > ε/4, run prior-indep learning on Z [VC/ε] (t) to get ĥ t Else let π t = argmin π B(π t-1, δ t-1 ) Λ(ε/2, π) and run A(ε/2, π t ) on Z (t) to get ĥ t Theorem: Forall t, E[ρ(ĥ t, h t *)] ε, and limsup T -> E[#labels accessed]/T Λ(ε/2, π*) + vc.

- Yonatan: Ill send you an to summarize what we just discussed. - Yonatan: Ill send you an to summarize what we just discussed. - Liu: Thank you but I now invented a model to transfer knowledge with provable guarantees; - Liu: Thank you but I now invented a model to transfer knowledge with provable guarantees; so I use that all the time. so I use that all the time. - Yonatan: But thats asymptotic guarantee. My life span is finite. So Im still gonna to send you an . - Yonatan: But thats asymptotic guarantee. My life span is finite. So Im still gonna to send you an .

Outline Online Allocation and Pricing with Economies of Scale Online Allocation and Pricing with Economies of Scale © Liu Yang Jamie Dimon: Economies of scale are a good thing. - Jamie Dimon: Economies of scale are a good thing. If we didn't have them, we'd still be living in tents and eating buffalo.

Setting Christmas season - Nov : customer survey - Dec : purchasing and selling Buyers arrive online one at a time w/ val.s on items sampled iid from some unknown distri.

Thrifty Santa Claus Each shopper wants only one item though it might prefer some items than others Minimize total cost to seller Buyers: binary valuation Goal of seller: sat. everyone

Hardness: Set-Cover If costs much more rapidly, then even if all customers' val.s known up front, would be (roughly) a set-cover problem and could not hope to achieve cost o(log n) times optimal. Natural case : for each good, cost (to the seller) for ordering T copies is sublinear in T. Production cost Marginal cost #copies α = 1 α = 0 α in (0, 1) α = 1 α = 0 α in (0, 1)

Thrifty Santa Claus : Results Mar-cost non-increa, exists optimal strategy? - order items by some perm.; give new buyer earliest item it desires in the perm. What if n (#buyers) >> k (#items) AND mar- cost not too rapidly? (rate 1/T α for 0α<1) - can efficiently perform allocation w/ cost a const. factor greater than OPT

Algorithm Alg: use initial buyers to learn about distri. determine how best to allocate to new buyers. If cost fn c(x) = Σ i=1 x 1/i α, for α in [0,1) - run greedy weighted set cover => total cost 1/(1-α) {± OPT}. Essentially smooth variant of set-cover If ave-cost within some factor of mar-cost, have a greedy alg w/ const. approx ratio

Sample Complexity Analysis How complicated the allocation rule needs to be to achieve good perf.? Theorem

Outline Factor Models for Correlated Auctions Factor Models for Correlated Auctions © Liu Yang

The Problem Auctioneer sells good to a group of n buyers. Seller wants to maximize his revenue. Each buyer maximize his utility of getting good: val. - price Seller doesnt know exact val.s of players He knows distri D from which vec. of val.s (v1, …, vn) is drawn.

Our Contribution When D is a product distri., - Myerson gives dominant strategy truthful auction General correlated distr.s, not known - how to create truthful auctions - how to use player js bid to capture info about player i. What if correlation between buyer val.s driven by common factors?

Example Two firms produce same type of good Each firms value: production cost need to hire workers (W) & rent capital (Z) l i : #workers firm i needs to produce one unit K i : amount of capital firm needs ε i :fixed costs unique to firm i. firms costs: C i = l i W + k i Z + ε i firms costs correlated : hire workers & rent capital from the same pool.

The Factor Model Factor model as V = F + U where - V: vec. of observations - λ : matrix of coefficients - F : vec. of factors - U: vec. of idiosyncratic components ind. of each other & ind. of the factors

Discussions Possible that: - Designer & bidders might not know common factors - Bidders might only know their val. - seller only knows joint distri. of bidders val.s, Seller RECOVER factor model by making inferences over observed bids. Aggregate info.: common factors inferred from collective knowledge of all players.

The Auction

The Auction (cont.) Thm: When correlation follows this factor model, this auction is dominant strategy truthful, ex-post individually rational, and asymptotically optimal.

Dominant Strategy Truthfulness Toss a coin & choose between: - 2nd price auction: truthful - mechanism M estimates factors from a random set of bidders S: bidders in S receive utility 0 regardless of allocation & price output by M Players S incentivized truthful for small incentive they get from participating in 2nd price auction.

Dominant Strategy Truthfulness (Cont.) Remaining bidders set R = {1, …, n} - S receive incentives from both 2 nd price auction and mechanism M. M offers them allocation and price vec.s x(b R ), p(b R ) by running Myerson (b R,V R |^f) on players' bids, and on cond. distri.s estimated for these players. No player in R can influence the estimated conditional distri. V R |^f, and Myerson's optimal auction is truthful.

Thanks ! 94© Liu Yang 2013 Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.