Lecture 4. Importance of being small

Slides:

Advertisements

Similar presentations

Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.

Advertisements

PAC Learning 8/5/2005. purpose Effort to understand negative selection algorithm from totally different aspects –Statistics –Machine learning What is.

Tests of Hypotheses Based on a Single Sample

Lecture 6. Prefix Complexity K The plain Kolmogorov complexity C(x) has a lot of “minor” but bothersome problems Not subadditive: C(x,y)≤C(x)+C(y) only.

Weakening the Causal Faithfulness Assumption

Phil 148 Explanations. Inferences to the Best Explanation. IBE is also known as ‘abductive reasoning’ It is the kind of reasoning (not deduction) that.

Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.

Announcements CS Ice Cream Social 9/5 3:30-4:30, ECCR 265 includes poster session, student group presentations.

Machine Learning Week 2 Lecture 2.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Probably Approximately Correct Model (PAC)

Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.

EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.

Machine Learning CMPT 726 Simon Fraser University

Scientific Thinking - 1 A. It is not what the man of science believes that distinguishes him, but how and why he believes it. B. A hypothesis is scientific.

CS 4700: Foundations of Artificial Intelligence

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp

Lecture 7: Inductive Inference MDL and PAC Learning.

Statistical Hypothesis Testing. Suppose you have a random variable X ( number of vehicle accidents in a year, stock market returns, time between el nino.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Naive Bayes Classifier

New Bulgarian University 9th International Summer School in Cognitive Science Simplicity as a Fundamental Cognitive Principle Nick Chater Institute for.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.

Biological Science.

Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.

Chapter 6 Bayesian Learning

Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

A Syntactic Justification of Occam’s Razor 1 John Woodward, Jerry Swan Foundations of Reasoning Group University of Nottingham Ningbo, China 宁波诺丁汉大学

Landau and Lifshitz, Classical Field Theory. Chapter One, The Principle of Relativity. In a separate document I have explained that section one of L&L’s.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Chapter Nine Hypothesis Testing.

Lecture 1.31 Criteria for optimal reception of radio signals.

Hypothesis testing and statistical decision theory

Chapter 7. Classification and Prediction

Information Complexity Lower Bounds

12. Principles of Parameter Estimation

Naive Bayes Classifier

Computational Learning Theory

Computational Learning Theory

Ordering of Hypothesis Space

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Data Mining Lecture 11.

Lecture 6. Prefix Complexity K

Asymmetries in Maxwell’s Electrodynamics

CSI 5388:Topics in Machine Learning

Computational Learning Theory

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Computational Learning Theory

The probably approximately correct (PAC) learning model

Computational Learning Theory Eric Xing Lecture 5, August 13, 2010

CS344 : Introduction to Artificial Intelligence

Machine Learning: UNIT-3 CHAPTER-2

Machine Learning: Lecture 6

Machine Learning: UNIT-3 CHAPTER-1

Lecture 14 Learning Inductive inference

12. Principles of Parameter Estimation

INTRODUCTION TO Machine Learning 3rd Edition

Presentation transcript:

Lecture 4. Importance of being small Neural networks can approximate any function. Overfiting is a major concern. In some sense, it is possible to view the development of deep learning from an angle of reducing the (Kolmogorov) complexity of neural networks: CNN, RNN, dropout, regularization, and esp. depth. Occam’s Razor: Commonly attributed to William of Ockham (1290--1349). This was formulated about fifteen hundred years after Epicurus. In sharp contrast to the principle of multiple explanations, it states: Entities should not be multiplied beyond necessity. Commonly explained as: when have choices, choose the simplest theory. Bertrand Russell: ``It is vain to do with more what can be done with fewer.'‘ Newton (Principia): ``Natura enim simplex est, et rerum causis superfluis non luxuriat''.

Example. Inferring a deterministic finite automaton (DFA) A DFA accepts: 1, 111, 11111, 1111111; and rejects: 11, 1111, 111111. What is it? There are actually infinitely many DFAs satisfying these data. The first DFA makes a nontrivial inductive inference, the 2nd does not. The 2nd one “over fits” the data, cannot make further predictions. 1 1 1 1 1 1 1

Exampe. History of Science Maxwell's (1831-1879)'s equations say that: (a) An oscillating magnetic field gives rise to an oscillating electric field; (b) an oscillating electric field gives rise to an oscillating magnetic field. Item (a) was known from M. Faraday's experiments. However (b) is a theoretical inference by Maxwell and his aesthetic appreciation of simplicity. The existence of such electromagnetic waves was demonstrated by the experiments of H. Hertz in 1888, 8 years after Maxwell's death, and this opened the new field of radio communication. Maxwell's theory is even relativistically invariant. This was long before Einstein’s special relativity. As a matter of fact, it is even likely that Maxwell's theory influenced Einstein’s 1905 paper on relativity which was actually titled `On the electrodynamics of moving bodies'. J. Kemeny, a former assistant to Einstein, explains the transition from the special theory to the general theory of relativity: At the time, there were no new facts that failed to be explained by the special theory of relativity. Einstein was purely motivated by his conviction that the special theory was not the simplest theory which can explain all the observed facts. Reducing the number of variables obviously simplifies a theory. By the requirement of general covariance Einstein succeeded in replacing the previous ‘gravitational mass' and `inertial mass' by a single concept. Double helix vs triple helix --- 1953, Watson & Crick

Bayesian Inference Bayes Formula: P(H|D) = P(D|H)P(H)/P(D) By Occam’s razor, P(H)=2-K(H), (smallest most likely). Take -log, maximize P(H|D) becomes minimize: -logP(D|H) + K(H) (modulo logP(D), constant). where -log P(D|H) is the coding length of D given H. K(H) is the smallest description of model H (Kolmogorov complexity of H).

Note The next 4 pages are only for theoretically oriented students to read. Here is an informal statement: given data (positive and negative examples drawn from distribution D), if you find a model M that agrees with the data, and size of M is polynomially smaller than the data, then with high probability (according to D) , M is correct with a small number of errors.

PAC Learning (L. Valiant, 1983) Fix a distribution for the sample space (P(v) for each v in sample space). A concept class C is pac-learnable (probably approximately correct learnable) iff there exists a learning algorithm A such that, for each f in C and  (0 <  < 1), algorithm A halts in a polynomial number of steps and examples, and outputs a concept h in C which satisfies the following. With probability at least 1- , Σf(v) ≠ h (v) P(v) < 

Simplicity means understanding We will prove that given a set of positive and negative data, any consistent concept of size `reasonably' shorter than the size of data is an `approximately' correct concept with high probability. That is, if one finds a shorter representation of data, then one learns. The shorter the conjecture is, the more efficiently it explains the data, hence the more precise the future prediction. Let α < 1, β ≥ 1, m be the number of examples, and s be the length (in number of bits) of the smallest concept in C consistent with the examples. An Occam algorithm is a polynomial time algorithm which finds a hypothesis h in C consistent with the examples and satisfying K(h) ≤ sβ mα

Occam Razor Theorem (Blummer, Ehrenfeucht, Haussler, Warmuth, This kolmogorov complexity version is from Li-Tromp-Vitanyi, IPL) Theorem. A concept class C is polynomially pac-learnable if there is an Occam algorithm for it. Proof. Fix an error tolerance  (0 <  <1). Choose m such that m ≥ max { (2sβ/ )1/(1- α) , 2/  log 1/  }. I.e. m is a polynomial in s and 1/ . Let S be a set of r concepts, and let f be one of them. Claim The probability that any concept h in S satisfies P(f ≠ h) ≥  and is consistent with m independent examples of f is less than (1-  )m r. Proof: Let Eh be the event that hypothesis h agrees with all m examples of f. If P(h ≠ f ) ≥ , then h is a bad hypothesis. That is, h and f disagree with probability at least  on a random example. The set of bad hypotheses is denoted by B. Since the m examples of f are independent, P( Eh ) ≤ (1-  )m . Since there are at most r bad hypotheses, P( Uh in B Eh) ≤ (1- )m r. QED (claim)

Proof of the theorem continues The postulated Occam algorithm finds a hypothesis of Kolmogorov complexity at most sβmα. The number r of hypotheses of this complexity satisfies log r ≤ sβmα . By assumption on m, r ≤ (1-  )-m/ 2 (Use  < - log (1- ) <  /(1- ) for 0 <  <1). Using the claim, the probability of producing a hypothesis with error larger than  is less than (1 -  )m r ≤ (1-  )m/2 < . The last inequality is by substituting m. QED