Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009.

Slides:

Advertisements

Similar presentations

Quantum Lower Bound for the Collision Problem Scott Aaronson 1/10/2002 quant-ph/ I was born at the Big Bang. Cool! We have the same birthday.

Advertisements

Quantum t-designs: t-wise independence in the quantum world Andris Ambainis, Joseph Emerson IQC, University of Waterloo.

How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.

Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Complexity Theory Lecture 6

Ryan ODonnell Carnegie Mellon University Karl Wimmer CMU & Duquesne Amir Shpilka Technion Rocco Servedio Columbia Parikshit Gopalan UW & Microsoft SVC.

Ryan Donnell Carnegie Mellon University O. 1. Describe some TCS results requiring variants of the Central Limit Theorem. Talk Outline 2. Show a flexible.

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.

Generalization and Specialization of Kernelization Daniel Lokshtanov.

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

Kevin Matulef MIT Ryan O’Donnell CMU Ronitt Rubinfeld MIT Rocco Servedio Columbia.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.

Longin Jan Latecki Temple University

Dictator tests and Hardness of approximating Max-Cut-Gain Ryan O’Donnell Carnegie Mellon (includes joint work with Subhash Khot of Georgia Tech)

New Algorithms and Lower Bounds for Monotonicity Testing of Boolean Functions Rocco Servedio Joint work with Xi Chen and Li-Yang Tan Columbia University.

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR) TexPoint fonts used in EMF. Read the TexPoint manual before.

Computational problems, algorithms, runtime, hardness

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas Dana Ron Alex Samorodnitsky.

Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.

1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.

Probably Approximately Correct Model (PAC)

Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,

Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.

1 On The Learning Power of Evolution Vitaly Feldman.

Fourier Analysis of Boolean Functions Juntas, Projections, Influences Etc.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

Learning and testing k-modal distributions Rocco A. Servedio Columbia University Joint work (in progress) with Ilias Diakonikolas UC Berkeley Costis Daskalakis.

On Testing Computability by small Width OBDDs Oded Goldreich Weizmann Institute of Science.

Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.

Approximating Submodular Functions Part 2 Nick Harvey University of British Columbia Department of Computer Science July 12 th, 2015 Joint work with Nina.

Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.

Logic Circuits Chapter 2. Overview  Many important functions computed with straight-line programs No loops nor branches Conveniently described with circuits.

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.

Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

Smooth Boolean Functions are Easy: Efficient Algorithms for Low-Sensitivity Functions Rocco Servedio Joint work with Parikshit Gopalan (MSR) Noam Nisan.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.

Complexity Theory and Explicit Constructions of Ramsey Graphs Rahul Santhanam University of Edinburgh.

Machine Learning: Ensemble Methods

L is in NP means: There is a language L’ in P and a polynomial p so that L1 ≤ L2 means: For some polynomial time computable map r :  x: x  L1 iff.

Dana Ron Tel Aviv University

Vitaly Feldman and Jan Vondrâk IBM Research - Almaden

Computational Learning Theory

Introduction to Machine Learning

Joint work with Avishay Tal (IAS) and Jiapeng Zhang (UCSD)

NP-Completeness Yin Tat Lee

Pseudo-derandomizing learning and approximation

Linear sketching with parities

Computational Learning Theory

Linear sketching over

Learning, testing, and approximating halfspaces

Computational Learning Theory

Linear sketching with parities

including joint work with:

CPS 173 Computational problems, algorithms, runtime, hardness

A Quarter-Century of Efficient Learnability

Switching Lemmas and Proof Complexity

Presentation transcript:

Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009

Overview Halfspaces over testing learning approximation

Joint work with: Ronitt Rubinfeld Kevin Matulef Ryan O’Donnell Ilias Diakonikolas

Approximation Given a function goal is to obtain a “simpler” function such that Measure distance between functions under uniform distribution.

Approximating classes of functions Interested in statements of the form: “Every function in class has a simple approximator.” Every -size decision tree can be -approximated by a decision tree of depth Example statement:

Testing Goal: infer “global” property of function via few “local” inspections Tester makes black-box queries to arbitrary Tester must output “yes” whp if “no” whp if is -far from every Any answer OK if is -close to some distance Usual focus: information-theoretic # queries required oracle for

Some known property testing results parity functions [BLR93] deg- polynomials [AKK+03] literals [PRS02] conjunctions [PRS02] -juntas [FKRSS04] -term monotone DNF [PRS02] -term DNF [DLM+07] size- decision trees [DLM+07] -sparse polynomials [DLM+07] Class of functions over # of queries

We’ll get to learning later

Halfspaces A function is a halfspace if such that for all Also called linear threshold functions (LTFs), threshold gates, etc. Fundamental to learning theory –Halfspaces are at the heart of many learning algorithms: Perceptron, Winnow, boosting, Support Vector Machines,… Well studied in complexity theory

Some examples of halfspaces Weights can be all the same… (decision list) …but don’t have to be…

What’s a “simple” halfspace? Every halfspace has a representation with integer weights: –finite domain, so can “nudge” weights to rational #s, scale to integers Some halfspaces over require integer weights [MTT61, H94] Low-weight halfspaces are nice for complexity, learning. is equivalent to

Approximating halfspaces using small weights? Let be an arbitrary halfspace. If is a halfspace which -approximates how large do the weights of need to be? Consider (view as n-bit binary numbers) This is a halfspace: but it’s easy to -approximate with weight Any halfspace for requires weight … Let’s warm up with a concrete example.

Approximating all halfspaces using small weights? So there are halfspaces that require weight but can be -approximated with weight Let be an arbitrary halfspace. If is a halfspace which -approximates how large do the weights of need to be? Can every halfspace be approximated by a small-weight halfspace? Yes

Every halfspace has a low-weight approximator Can’t do better in terms of ; may need some Dependence on must be [H94] Theorem: [S06] Let be any halfspace. For any there is an -approximator with integer weights that has How good is this bound?

Idea behind the approximation Let If weights decrease rapidly, then is well approximated by a junta WOLOG have Key idea: look at how these weights decrease. If weights decrease slowly, then is “nice” – can get a handle on distribution of

A few more details Def: Critical index of is the first index such that is “small relative to the remaining weights”: Let How do these weights decrease? critical index

Sketch of approximation: case 1 “First weights all decrease rapidly” – factor of Remaining weight after very small Can show is -close to, so can approximate just by truncating has relevant variables so can be expressed with integer weights each at most Critical index is first index such that First case:

Why does truncating work? Have only if either or each of these weights small, so unlikely by Hoeffding bound unlikely by more complicated argument (split up into blocks; symmetry argument on each block bounds prob by ½; use independence) Let’s write for

Sketch of approximation: case 2 Second case: Critical index is first index such that “weights are smooth” Intuition: behaves like Gaussian Can show it’s OK to round weights to small integers (at most )

Why does rounding work? Let so Haveonly if either or each small, so unlikely by Hoeffding bound unlikely since Gaussian is “anticoncentrated” }

Sketch of approximation: case 2 Second case: Critical index is first index such that “weights are smooth” Intuition: behaves like Gaussian Can show it’s OK to round weights to small integers (at most ) Need to deal with first weights, but at most many – they cost at most END OF SKETCH

Extensions Let be any halfspace. For any there is an -approximator with integer weights that has We saw: Recent improvement [DS09]: replace with For with bit flipped Standard fact: Every halfspace has (but can be much less)

Proof uses structural properties of halfspaces from testing & learning. Can be viewed as (exponential) sharpening of Friedgut’s theorem: Every Boolean is -close to a function on variables. We show: Every halfspace is -close to a function on variables. approximation Combines Littlewood-Offord type theorems on “anticoncentration” of delicate linear programming arguments Gives new proof of original bound that does not use the “critical index”

So halfspaces have low-weight approximators. What about testing? Use approximation viewpoint: two possibilities depending on critical index. First case: critical index large close to junta halfspace over variables Implicitly identify the junta variables (high influence) Do Occam-type “implicit learning” similar to [DLMORSW07] (building on [FKRSS02]): check every possible halfspace over the junta variables –If is a halfspace, it’ll be close to some function you check –If far from every halfspace, it’ll be close to no function you check

So halfspaces have low-weight approximators. What about testing? Second case: critical index small every restriction of high-influence vars makes “regular” –all weights & influences are small Low-influence halfspaces have nice Fourier properties Can use Fourier analysis to check that each restriction is close to a low-influence halfspace Also need to check: –cross-consistency of different restrictions (close to low-influence halfspaces with same weights)? – global consistency with a single set of high-influence weights most s

A taste of Fourier A helpful Fourier result about low-influence halfspaces: “Theorem”: [MORS07] Let be any Boolean function such that: all the degree-1 Fourier coefficients of are small the degree-0 Fourier coefficient synchs up with the degree-1 coeffs Then is close to a halfspace

A taste of Fourier A helpful Fourier result about low-influence halfspaces: “Theorem”: [MORS07] Let be any Boolean function such that: all the degree-1 Fourier coefficients of are small the degree-0 Fourier coefficient synchs up with the degree-1 coeffs Then is close to a halfspace – in fact, close to the halfspace Useful for soundness portion of test

Testing halfspaces When all the dust settles: Theorem: [MORS07] The class of halfspaces over is testable with queries. approximation testing

What about learning? Learning halfspaces from random labeled examples is easy using poly-time linear programming. 1.The RFA model 2.Agnostic learning under uniform distribution ? ! There are other harder learning models…

The RFA learning model Introduced by [BDD92]: “restricted focus of attention” For each labeled example the learner gets to choose one bit of the example that he can see (plus the label of course). Examples are drawn from uniform distribution over Goal is to construct -accurate hypothesis Question: [BDD92, ADJKS98, G01] Are halfspaces learnable in RFA model?

The RFA learning model in action learneroracle May I have a random example, please? Sure, which bit would you like to see? Oh, man…uh, x 7. Thanks, I guess Watch your manners Here’s your example:

Very brief Fourier interlude Every has a unique Fourier representation The coefficients are sometimes called the Chow parameters of

Another view of the RFA learning model Every has a unique Fourier representation The coefficients are sometimes called the Chow parameters of RFA model: learner gets Not hard to see: In the RFA model, all the learner can do is estimate the Chow parameters With examples, can estimate any given Chow parameter to additive accuracy

( Approximately) reconstructing halfspaces from their (approximate) Chow parameters Theorem [C61]: If is a halfspace & has for all then Perfect information about Chow parameters suffices for halfspaces: To solve 1-RFA learning problem, need a version of Chow’s theorem which is both robust and effective robust: only get approximate Chow parameters (and only hope for approximation to ) effective: want an actual poly(n) time algorithm!

Previous results Theorem: Let be a weight- halfspace. Let be any Boolean function satisfying for all Then is an -approximator for [Goldberg01] proved: [ADJKS98] proved: Theorem: Let be any halfspace. Let be any function satisfying for all Then is an -approximator for Good for low-weight halfspaces, but could be Better bound for high-weight halfspaces, but superpolynomial in n. Neither of these results is algorithmic.

Robust, effective version of Chow’s theorem Theorem: [OS08] For any constant and any halfspace given accurate enough approximations of the Chow parameters of algorithm runs in time and w.h.p. outputs a halfspace that is -close to Fastest runtime dependence on of any algorithm for learning halfspaces, even in usual random-examples model –Previous best runtime: time for learning to constant accuracy –Any algorithm needs examples, i.e. bits of input Corollary: [OS08] Halfspaces are learnable to any constant accuracy in time in the RFA model.

A tool from testing halfspaces If itself is a low-influence halfspace, means we can plug in degree-1 Fourier coefficients as weights and get a good approximator. Also need to deal with high-influence case…a hassle, but doable. Recall helpful Fourier result about low-influence halfspaces: “Theorem”: Let be any function which is such that: all the degree-1 Fourier coefficients of are small the degree-0 Fourier coefficient synchs up with the degree-1 coeffs Then is close to We know (approximations to) these in the RFA setting! polynomial time!

Recap of whole talk approximation testing learning 1.Every halfspace can be approximated to any constant accuracy with small integer weights. 2.Halfspaces can be tested with queries. 3.Halfspaces can be efficiently learned from (approximations of) their degree-0 and degree-1 Fourier coefficients. Halfspaces over

Future directions Better quantitative results (dependence on ?) –Testing: –Approximating: –Learning (from Chow parameters): What about {approximating, testing, learning} w.r.t. other distributions? –Rich theory of distribution-independent PAC learning –Less fully developed theory of distribution-independent testing [HK03,HK04,HK05,AC06] –Things are harder; what is doable? –[GS07] Any distribution-independent algorithm for testing whether is a halfspace requires queries.

Thank you for your attention

II. Learning a concept class Setup: Learner is given a sample of labeled examples Target function is unknown to learner Each example in sample is independent, uniform over Goal: For every, with probability learner should output a hypothesis such that “PAC learning concept class under the uniform distribution”