1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics.

Slides:



Advertisements
Similar presentations
Advanced topics in Financial Econometrics Bas Werker Tilburg University, SAMSI fellow.
Advertisements

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
1 Modeling and Optimization of VLSI Interconnect Lecture 9: Multi-net optimization Avinoam Kolodny Konstantin Moiseev.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
The General Linear Model. The Simple Linear Model Linear Regression.
Visual Recognition Tutorial
MIT and James Orlin © Nonlinear Programming Theory.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood (ML)
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Maximum likelihood (ML) and likelihood ratio (LR) test
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Optimality Conditions for Nonlinear Optimization Ashish Goel Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A.
Maximum likelihood (ML)
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Today Wrap up of probability Vectors, Matrices. Calculus
Scientific Computing Partial Differential Equations Poisson Equation Calculus of Variations.
Theory of Integration in Statistics
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
The mean value theorem and curve sketching
Statistical Decision Theory
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Simplex method (algebraic interpretation)
استاد : دکتر گلبابایی In detail this means three conditions:  1. f has to be defined at c.  2. the limit on the left hand side of that equation has.
Block 4 Nonlinear Systems Lesson 14 – The Methods of Differential Calculus The world is not only nonlinear but is changing as well 1 Narrator: Charles.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Chapter 5.6 From DeGroot & Schervish. Uniform Distribution.
USSC3002 Oscillations and Waves Lecture 10 Calculus of Variations Wayne M. Lawton Department of Mathematics National University of Singapore 2 Science.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
USSC3002 Oscillations and Waves Lecture 10 Calculus of Variations Wayne M. Lawton Department of Mathematics National University of Singapore 2 Science.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
In Chapters 6 and 8, we will see how to use the integral to solve problems concerning:  Volumes  Lengths of curves  Population predictions  Cardiac.
Signal & Weight Vector Spaces
Introduction to Optimization
Joint Moments and Joint Characteristic Functions.
INTEGRALS We saw in Section 5.1 that a limit of the form arises when we compute an area. We also saw that it arises when we try to find the distance traveled.
5 INTEGRALS.
Inequalities for Stochastic Linear Programming Problems By Albert Madansky Presented by Kevin Byrnes.
Linear & Nonlinear Programming -- Basic Properties of Solutions and Algorithms.
Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
1 Unconstrained and Constrained Optimization. 2 Agenda General Ideas of Optimization Interpreting the First Derivative Interpreting the Second Derivative.
Approximation Algorithms based on linear programming.
D Nagesh Kumar, IISc Water Resources Systems Planning and Management: M2L2 Introduction to Optimization (ii) Constrained and Unconstrained Optimization.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Topic Overview and Study Checklist. From Chapter 7 in the white textbook: Modeling with Differential Equations basic models exponential logistic modified.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Primbs, MS&E345 1 Measure Theory in a Lecture. Primbs, MS&E345 2 Perspective  -Algebras Measurable Functions Measure and Integration Radon-Nikodym Theorem.
7 INVERSE FUNCTIONS.
Announcements Topics: Work On:
Chapter 11 Optimization with Equality Constraints
Lebesgue measure: Lebesgue measure m0 is a measure on i.e., 1. 2.
Support Vector Machines
Chap 3. The simplex method
Ying shen Sse, tongji university Sep. 2016
Outline Unconstrained Optimization Functions of One Variable
Derivatives in Action Chapter 7.
Presentation transcript:

1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics

2 Relation between Statistics and Differentiation Statistical Concepts/Techniques Use of Differentiation Theory Study of shapes of univariate pdfs An easy application of first- order and second-order derivatives Calculation/stablization of variance of a random variable An application of Taylor’s theorem Calculation of Moments from MGF/CF Differentiating MGF/CF

3 Description of a density/ a model dy/dx=k, dy/dx=kx Optimize some risk functional/regularized functional/ empirical risk functional with/without constraints Needs heavy tools of nonlinear optimization Techniques that depend on multivariate differential calculus and functional differential calculus Relation between Statistics and Differentiation Influence function to assess robustness of a statistical An easy application of directional derivative in function space

4 Relation between Statistics and Differentiation Classical delta theorem to find asymptotic distribution An application of ordinary Taylor’s theorem Von Mises CalculusExtensive application of functional differential calculus Relation between probability measures and probability density functions Radon Nikodym theorem

5 Monotone Function f(x) Monotone Increasing Monotone Decreasing Strictly Increasing Non Decreasing Strictly Decreasing Non Increasing

6 Increasing/Decreasing test

7 Example of Monotone Increasing Function 0

8 ab Maximum/Minimum Is there any sufficient condition that guarantees existence of global max/global min/both?

9 If the function is continuous and its domain is compact, the function attains its extremum It’s a very general result It holds for any compact space other compact set of R n. Any convex ( concave) function attains its global min ( max). Without satisfying any of the above conditions some functions may have global min ( max). Some Results to Mention Firstly, proof of existence of extremum Calculation of extremum Then

10 What Does Say about f Fermat’s Theorem: if f has local maximum or minimum at c, and if exist, then but converse is not true

11 Concave Convex Point of inflection c Concavity If for all x in (a,b), then the graph of f concave on (a,b). If then f has a point of inflection at c.

12 Maximum/Minimum Let f(x) be a differential function on an interval I f is maximum at If for all x in an interval, then f is maximum at first end point of the interval if left side is closed and minimum at last end point if right side is closed. If for all x in an interval, then f is minimum at first end point of the interval if left side is closed and maximum at last end point if right side is closed.

13 Concave Convex point of inflection Normal Distribution The probability density function is given as,  continuous on R  f(x)>=0  Differentiable on R

14 Take log both side Put first derivative equal to zero Now, Normal Distribution

15 Normal Distribution Therefore f is maximum at

16 Normal Distribution Put 2 nd derivative equal to zero Therefore f has point of inflection at

17 Convex Concave Logistic Distribution The distribution function is defined as,

18 Logistic Distribution Take first derivative with respect to x Therefore F is strictly increasing Take2nd derivative and put equal to zero Therefore F has a point of inflection at x=0

19 Logistic Distribution Now we comment that F has no maximum and minimum. Therefore F is convex on and concave on Since,

20 Variance of a Function of Poisson Variate Using Taylor’s Theorem We know that, We are interested to find the Variance of

21 The Taylor’s series is defined as, Therefore the variance of Variance of a Function of Poisson Variate Using Taylor’s Theorem

22 Risk Functional Risk functional, R L,P (g)= Population Regression functional /classifier, g *  From sample D, we will select g D by a learning method(???)  P is chosen by nature, L is chosen by the scientist  Both R L,P (g * ) and g * are uknown

23  Problems of empirical risk minimization Empirical risk minimization Empirical Risk functional, =

24 What Can We Do?  We can restrict the set of functions over which we minimize empirical risk functionals  modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two. Structural riskMinimization Regularization

25 Regularized Error Function In linear regression, we minimize the error function: Replace the quadratic error function by Є-insensitive error function: An example of Є-insensitive error function:

26 Linear SVR: Derivation Meaning of equation 3

27 ● ● Linear SVR: Derivation ● ● ● ● ● Complexity Sum of errors vs. Case I: Case II: “tube”complexity “tube”complexity

28 Linear SVR: Derivation Case I: Case II: “tube”complexity “tube”complexity The role of C ● ● ● ● ● ● ● C is small ● ● ● ● ● ● ● C is big

29 ● ● Linear SVR: derivation ● ● ● ● ● Subject to:

30 Lagrangian Minimize: f(x)= = Dual var. α n, α n *, μ n, μ * n >=0

31 Dual Form of Lagrangian Prediction can be made using: Maximize: ???

32 How to determine b? Karush-Kuhn-Tucker (KKT) conditions implies( at the optimal solutions: Support vectors are points that lie on the boundary or outside the tube These equations implies many important things.

33 Important Interpretations   

34 Support Vector: The Sparsity of SV Expansion and

35 Dual Form of Lagrangian (Nonlinear case) Prediction can be made using: Maximize:

36 Non-linear SVR: derivation Subject to:

37 Non-linear SVR: derivation Subject to: Saddle point of L has to be found: min with respect to max with respect to

38 Non-linear SVR: derivation...

39 U A Banach Space V, Another B-space f,a nonlinear function What is Differentiation? Differentiation is nothing but local linearization In differentiation we approximate a non-linear function locally by a (continuous) linear function

40 Fréchet Derivative I t can be easily generalized to Banach space valued function, f: is a linear map. It can be shown,. every linear map between infinite-dimensional spaces is not always continuous. Definition 1

41 We have just mentioned that Fréchet recognized, the definition 1 could be easily generalized to normed spaces in the following way: Frécehet Derivative Where and the set of all continuous linear functions between B 1 and B 2 If we write, the remainder of f at x+h, ; Rem(x+h)= f(x+h)-f(x)-df(x)(h)

42 Then 2 becomes Soon the definition is generalized (S-differentiation ) in general topological vector spaces in such a way ; i) a particular case of the definition becomes equivalent to the previous definition when, domain of f is a normed space, ii) Gateaux derivative remains the weakest derivative in all types of S-differentiation. S Derivative

43 Definition 3 When S= all singletons of B1, f is called Gâteaux differentiable with Gâteaux derivative. When S= all compact subsets of B1, f is called Hadamard or compactly differentiable with Hadamard or compact derivative. When S= all bounded subsets of B1, f is called or boundedly differentiable with or bounded derivative. Definition 2 Let S be a collection of subsets of B1, let t R. Then f is S- differentiable at x with derivative df(x) if S Derivatives

44 Equivalent Definitions of Fréchet derivative (a) For each bounded set, as in R, uniformly (b) For each sequence, and each sequence

45 (c) Uniformly in (d) (e)Uniformly in Statisticians generally uses this form or its some slight modification

46 Relations among Usual Forms of Definitions Set of Gateaux differentiable function at set of Hadamad differentiable function at set Frechet differentiable function x. In application to find Frechet or Hadamard derivative generally we shout try first to determine the form of derivative deducing Gateaux derivative acting on h,df(h) for a collection of directions h which span B 1. This reduces to computing the ordinary derivative (with respect to R) of the mapping which is much related to influence function, one of the central concepts in robust statistics. It can be easily shown that, (i)When B 1 =R with usual norm, they will three coincide (ii)When B 1, a finite dimensional Banach space, Frechet and Hadamard derivative are equal. The two coincide with familiar total derivative.

47 Properties of Fréchet derivative  Hadamard diff. implies continuity but Gâteaux does not.  Hadamard diff. satisfies chain rule but Gâteaux does not.  Meaningful Mean Value Theorem, Inverse Function Theorem, Taylor’s Theorem and Implicit Function Theorem have been proved for Fréchet derivative

48

49 Lebesgue Counting

50 Mathematical Foundations of Robust Statistics T(G)≈T(F)+ d 1 (F,G) <δ d 2 (T(F),T(G)) <ε (T(G)-T(F))≈

51 Mathematical Foundations of Robust Statistics

52 Mathematical Foundations of Robust Statistics

53 Mathematical Foundations of Robust Statistics

54 Given a Measurable Space ( ,F), There exist many measures on F. If  is the real line, the standard measure is “length”. That is, the measure of each interval is its length. This is known as “Lebesgue measure”. The  -algebra must contain intervals. The smallest  - algebra that contains all open sets (and hence intervals) is call the “Borel”  -algebra and is denoted B. A course in real analysis will deal a lot with the measurable space.

55 Given a Measurable Space ( ,F), A measurable space combined with a measure is called a measure space. If we denote the measure by , we would write the triple: ( ,F  Given a measure space ( ,F  if we decide instead to use a different measure, say  then we call this a “change of measure”. (We should just call this using another measure!) Let  and  be two measures on ( ,F), then (Notation )  is “absolutely continuous” with respect to  if  and  are “equivalent” if

56 The Radon-Nikodym Theorem If  <<  then  is actually the integral of a function wrt . g is known as the Radon- Nikodym derivative and denoted:

57 The Radon-Nikodym Theorem If  <<  then  is actually the integral of a function wrt . Consider the set function (this is actually a signed measure) Then is the  -superlevel set of g. Idea of proof: Create the function through its superlevel sets Chooseand letbe the largest set such that for all (You must prove such an A  exists.) Now, given superlevel sets, we can construct a function by:

58 The Riesz Representation Theorem: All continuous linear functionals on L p are given by integration against a function with That is, letbe a cts. linear functional. Then: Note, in L 2 this becomes:

59 The Riesz Representation Theorem: All continuous linear functionals on L p are given by integration against a function with What is the idea behind the proof: Linearity allows you to break things into building blocks, operate on them, then add them all together. What are the building blocks of measurable functions. Indicator functions! Of course! Let’s define a set valued function from indicator functions:

60 The Riesz Representation Theorem: All continuous linear functionals on L p are given by integration against a function with A set valued function How does L operate on simple functions This looks like an integral with  the measure! But, it is not too hard to show that  is a (signed) measure. (countable additivity follows from continuity). Furthermore,  <<  Radon-Nikodym then says d  =gd 

61 The Riesz Representation Theorem: All continuous linear functionals on L p are given by integration against a function with A set valued function How does L operate on simple functions This looks like an integral with  the measure! For measurable functions it follows from limits and continuity. The details are left as an “easy” exercise for the reader...

62 A random variable is a measurable function. The expectation of a random variable is its integral: A density function is the Radon-Nikodym derivative wrt Lebesgue measure: A probability measure P is a measure that satisfies That is, the measure of the whole space is 1.

63 In finance we will talk about expectations with respect to different measures. A probability measure P is a measure that satisfies That is, the measure of the whole space is 1. whereor And write expectations in terms of the different measures: