Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew University Jerusalem *while visiting Microsoft

In this talk… Revisit classic learning problems – e.g. learn DNFs from random examples (drawn from product distributions) Barrier = worst case complexity Solve in a new model! Smoothed analysis sheds light on hard problem instance structure Also show: DNF can be recovered from heavy “Fourier coefficients”

P.A.C. learning AND’s!? X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data  (x j from D, f ( x j ) )  j≤m Noiseless x2˄x4˄x7x2˄x4˄x7 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 f(x) x1x1 11100101+1 x2x2 10011111–1–1 x3x3 01111111–1 x4x4 11100001+1+1 x5x5 11111011+1 [Valiant84]

P.A.C. learning AND’s!? X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data  (x j from D, f ( x j ) )  j≤m Noiseless x2˄x4˄x7x2˄x4˄x7 NIGERIABANKVIAGRAADAMLASERSALEFREEINf(x) x1x1 YES NO YESNOYESSPAM x2x2 YESNO YES LEGIT x3x3 NOYES LEGIT x4x4 YES NO YESSPAM x5x5 YES NOYES SPAM [Valiant84]

P.A.C. learning AND’s!? X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data  (x j from D, f ( x j ) )  j≤m Output: h: X → {–1,+1} with err(h)=Pr x← D [h(x)≠f(x)] ≤ ε Noiseless x2˄x4˄x7x2˄x4˄x7 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 f(x) x1x1 11100101+1 x2x2 10011110–1–1 x3x3 01111111 x4x4 01100001–1–1 x5x5 11111011 *OPTIONAL* “Proper” learning: h is an AND 1. Succeed with prob. ≥ 0.99 2. m = # examples = poly(n/ε) 3. Polytime learning algorithm [Valiant84]

P.A.C. learning AND’s!? Agnostic [Kearns Schapire Sellie92] X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= Input: training data  (x j from D, f ( x j ) )  j≤m Output: h: X → {–1,+1} with err(h)=Pr x← D [h(x)≠f(x)] ≤ ε x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 f(x) x1x1 11100101+1 x2x2 10011110–1–1 x3x3 01111111 x4x4 01100001–1–1 x5x5 11111011 1. Succeed with prob. ≥ 0.99 2. m = # examples = poly(n/ε) 3. Polytime learning algorithm x2˄x4˄x7x2˄x4˄x7 opt + min AND g err(g)

PACAgnostic AND, e.g.,EASY? Decision trees, e.g.,?? DNF, e.g.,?? (x1˄x4)˅(x2˄x4˄x7˄x9)(x1˄x4)˅(x2˄x4˄x7˄x9) x2˄x4˄x7˄x9x2˄x4˄x7˄x9 x1x1 x2x2 x7x7 x9x9 x2x2 +– +–+– 10 0101 0101 Uniform D +Mem queries [Kushilevitz-Mansour’91; Goldreich-Levin’89] Uniform D +Mem queries [Jackson’94]   Uniform D +Mem queries [Gopalan-K-Klivans’08] Mem queries [ Bshouty’94 ] Some related work

PACAgnostic AND, e.g.,EASY? Decision trees, e.g.,?? DNF, e.g.,?? (x1˄x4)˅(x2˄x4˄x7˄x9)(x1˄x4)˅(x2˄x4˄x7˄x9) x2˄x4˄x7˄x9x2˄x4˄x7˄x9 x1x1 x2x2 x7x7 x9x9 x2x2 +– +–+– 10 0101 0101 Product D +Mem queries [Kushilevitz-Mansour’91; Goldreich-Levin’89] Product D +Mem queries [Jackson’94]   Product D +Mem queries [Gopalan-K-Klivans’08] Product D [KST’09] (smoothed analysis) Mem queries [ Bshouty’94 ] Product D [KST’09] (smoothed analysis)

Outline 1.PAC learn decision trees over smoothed (constant-bounded) product distributions Describe practical heuristic Define smoothed product distribution setting Structure of Fourier coeff’s over random prod. dist. 2.PAC learn DNFs over smoothed (constant-bounded) product distribution Why DNF can be recovered from heavy coefficients (information-theoretically) 3.Agnostically learn decision trees over smoothed (constant-bounded) product distributions Rough idea of algorithm

Feature Construction “Heuristic” Approach: Greedily learn sparse polynomial, bottom-up, using least-squares regression 1.Normalize input (x 1,y 1 ),(x 2,y 2 ),…,(x m,y m ) so that each attribute x i has mean 0 & variance 1 2.  := {1,x 1,x 2,…,x n } 3. Repeat m ¼ times:  :=   { t·x i } for t  of min regression error, e.g., for : ≈ [SuttonMatheus91]

Guarantee for that Heuristic For μ [0,1]ⁿ, let π μ be the product distribution where E x←π μ [x] = μ. Theorem 1. For any size s decision tree f: {0,1}ⁿ → {–1,+1}, with probability ≥ 0.99 over uniformly random μ [0.49,0.51]ⁿ and m=poly(ns/ε) training examples  (x j,f(x j ))  j≤m with x j iid from π μ, the heuristic outputs h with Pr x←π μ [sgn(h(x))≠f(x)] ≤ ε.

Guarantee for that Heuristic For μ [0,1]ⁿ, let π μ be the product distribution where E x←π μ [x] = μ. Theorem 1. For any size s decision tree f: {0,1}ⁿ → {–1,+1} and any ν [.02,.98]ⁿ, with probability ≥ 0.99 over uniformly random μ ν+[–.01,.01]ⁿ and m=poly(ns/ε) training examples  (x j,f(x j ))  j≤m with x j iid from π μ, the heuristic outputs h with Pr x←π μ [sgn(h(x))≠f(x)] ≤ ε. *same statement for DNF alg.

x1x1 x2x2 x7x7 x9x9 x2x2 –1 +1 –1+1 cube ν+ [-.01,.01]ⁿ f:{0,1} n  {-1,1} prod dist π μ (x (1), f(x (1) )),…,(x (m),f(x (m) )) LEARNING ALG. iid … Pr[ h(x)  f(x) ] ≤  h Smoothed analysis ass.

“Hard” instance picture μ[0,1] n, μ i =Pr[x i =1] x1x1 x2x2 x7x7 x9x9 x2x2 –1 +1 –1+1 red = heuristic fails f:{0,1} n  {-1,1} prod dist π μ can’t be this

“Hard” instance picture μ[0,1] n, μ i =Pr[x i =1] x1x1 x2x2 x7x7 x9x9 x2x2 –1 +1 –1+1 red = heuristic fails f:{0,1} n  {-1,1} prod dist π μ Theorem 1 “Hard” instances are few and far between for any tree

Fourier over product distributions x {0,1}ⁿ, μ [0,1]ⁿ, Coordinates normalized to mean 0, var. 1

Heuristic over product distributions (μ can easily be estimated from data) (easy to appx any individual coefficient) 1) 2)Repeat m ¼ times: where S is chosen to maximize

Example f(x) = x 2  x 4  x 9 For uniform μ = (.5,.5,.5), x i ϵ {–1,+1} f(x) = x 2 x 4 x 9 For μ = (.4,.6,.55), † f(x)=.9x 2 x 4 x 9 +.1x 2 x 4 +.3x 4 x 9 +.2x 2 x 9 +.2x 2 –.2x 4 +.1x 9 x2x2 x4x4 x4x4 x9x9 x9x9 + x9x9 x9x9 + ––––+++ † figures not to scale

Fourier structure over random product distributions Lemma For any f:{0,1}ⁿ→{–1,1}, α,β > 0, and d ≥ 1,

Fourier structure over random product distributions Lemma For any f:{0,1}ⁿ→{–1,1}, α,β > 0, and d ≥ 1, Lemma Let p:Rⁿ→R be a degree-d multilinear polynomial with leading coefficient of 1. Then, for any ε>0, e.g., p(x)=x 1 x 2 x 9 +.3x 7 –0.2

An older perspective [ Kushilevitz-Mansour’91 ] and [ Goldreich-Levin’89 ] find heavy Fourier coefficients Really use the fact that Every decision tree is well approximated by it’s heavy coefficients because In smoothed product distribution setting, Heuristic finds heavy (log-degree) coefficients

Outline 1.PAC learn decision trees over smoothed (constant-bounded) product distributions Describe practical heuristic Define smoothed product distribution setting Structure of Fourier coeff’s over random prod. dist. 2.PAC learn DNFs over smoothed (constant-bounded) product distribution Why DNF can be recovered from heavy coefficients (information-theoretically) 3.Agnostically learn decision trees over smoothed (constant-bounded) product distributions Rough idea of algorithm

Learning DNF Adversary picks DNF f(x)=C 1 (x)˅C 2 (x)˅…˅C s (x) (and ν ϵ [.02,.98]ⁿ) Step 1: find f ≥ε [BFJKMR’94, Jackson’95]: “KM gives weak learner” combined with careful boosting. Cannot use boosting in smoothed setting  Solution: learn DNF from f ≥ε alone! – Design a robust membership query DNF learning algorithm, and give it query access to f ≥ε

DNF learning algorithm f(x)=C 1 (x)˅C 2 (x)˅…˅C s (x), e.g., (x 1 ˄x 4 )˅(x 2 ˄x 4 ˄x 7 ˄x 9 ) C i is “linear threshold function,” e.g. sgn(x 1 +x 4 -1.5) [KKanadeMansour’09] approach + other stuff

I’m a burier (of details) burier noun, pl. –s, One that buries.

DNF recoverable from heavy coef’s Information-theoretic lemma (uniform distribution) For any s-term DNF f and any g: {0,1}ⁿ→{–1,1}, Thanks, Madhu! Maybe similar to Bazzi/Braverman/Razborov?

DNF recoverable from heavy coef’s Information-theoretic lemma (uniform distribution) For any s-term DNF f and any g: {0,1}ⁿ→{–1,1}, Proof f(x)=C 1 (x)˅…˅C s (x), where f(x)ϵ{–1,1} but Cᵢ(x)ϵ{0,1}.

Outline 1.PAC learn decision trees over smoothed (constant-bounded) product distributions Describe practical heuristic Define smoothed product distribution setting Structure of Fourier coeff’s over random prod. dist. 2.PAC learn DNFs over smoothed (constant-bounded) product distribution Why heavy coefficients characterize a DNF 3.Agnostically learn decision trees over smoothed (constant-bounded) product distributions Rough idea of algorithm

Agnostically learning decision trees Adversary picks arbitrary f:{0,1}ⁿ→{–1,+1} and ν ϵ [.02,.98]ⁿ Nature picks μ ϵ ν + [–.01,.01]ⁿ These determine best size-s decision tree f* Guarantee: get err(h) ≤ opt + ε opt = err(f*)

Agnostically learning decision trees Design robust membership query learning algorithm that works as long as queries are to g where. Solve: Robustness: (Appx) solved using [GKK’08] approach

The gradient-project descent alg. Find f ≥ε :{0,1}ⁿ→R using heuristic. h¹ =0 For t=1,…,T : – h t+1 = proj s ( KM( ) ) Output h(x) = sgn(h t (x)-θ) for t≤T, θ[–1,1] that minimize error on held-out data set Closely following [GopalanKKlivans’08]

projection proj s (h) = From [GopalanKKlivans’08]

Conclusions Smoothed complexity [SpielmanTeng01] – Compromise between worst-case/average-case – Novel application to learning over product dist’s Assumption: not completely adversarial relationship between target f and dist. D Weaker than “margin” assumptions Future work – Non-product distributions – Other smoothed anal. app.

Thanks! Sorry!

Average-case complexity [JacksonServedio05] [JS05] give a polytime algorithm that learns most DTs under uniform distribution on {0,1} n Random DTs sometimes easier than real ones “Random is not typical” courtesy of Dan Spielman

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.

Similar presentations

Presentation on theme: "Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.

Similar presentations

Presentation on theme: "Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew."— Presentation transcript:

Similar presentations

About project

Feedback