Jiequn Han, Arnulf Jentzen, and Weinan E Presented By Aishwarya Singh

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Beyond Linear Separability
A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.
Slides from: Doug Gray, David Poole
Neural networks Introduction Fitting neural networks
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Simple Neural Nets For Pattern Classification
x – independent variable (input)
Back-Propagation Algorithm
Artificial Neural Networks
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
NEURAL NETWORKS FOR DATA MINING
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Non-Bayes classifiers. Linear discriminants, neural networks.
Akram Bitar and Larry Manevitz Department of Computer Science
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Chapter 6 Neural Network.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 6: Artificial Neural Networks for Data Mining.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Today’s Lecture Neural networks Training
Big data classification using neural network
Fall 2004 Backpropagation CS478 - Machine Learning.
Deep Feedforward Networks
Artificial Neural Networks
The Gradient Descent Algorithm
Learning with Perceptrons and Neural Networks
Learning in Neural Networks
Artificial neural networks:
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Machine Learning Basics
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
CS 179 Lecture 17 Options Pricing.
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Machine Learning Today: Reading: Maria Florina Balcan
Goodfellow: Chap 6 Deep Feedforward Networks
of the Artificial Neural Networks.
Perceptron as one Type of Linear Discriminants
Capabilities of Threshold Neurons
Lecture Notes for Chapter 4 Artificial Neural Networks
Ch4: Backpropagation (BP)
Artificial Neural Networks
Neural networks (1) Traditional multi-layer perceptrons
Computer Vision Lecture 19: Object Recognition III
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Ch4: Backpropagation (BP)
Akram Bitar and Larry Manevitz Department of Computer Science
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

Jiequn Han, Arnulf Jentzen, and Weinan E Presented By Aishwarya Singh Solving High-Dimensional Partial Differential Equations Using Deep Learning Jiequn Han, Arnulf Jentzen, and Weinan E Presented By Aishwarya Singh

introduction: Partial Differential Equations are a ubiquitous tool to model dynamic multivariate systems in science, engineering and finance. PDEs are inherently multidimensional by the definition of partial differentiation. “Solving” a PDE refers to the extraction of an unknown closed-form function of multiple variables (or an approximation thereof) from a known relation between the dynamics of some of these variables (the PDE), and some boundary conditions. Stochastic PDEs are thought of as a generalization of regular (deterministic) PDEs.

partial differential equations in finance: The Black-Scholes Equation is likely the most famous PDE in finance. The Hamilton-Jacobi-Bellman Equation (HJB) is also a prominent PDE for dynamic asset allocation in an actively trading portfolio. Han et. al. have taken on the challenge of solving extremely high dimensional versions of these PDEs using a Neural Network approximation method.

Motivation – The Curse of Dimensionality: The “Curse of Dimensionality” (as coined by Richard Bellman in 1957) is an issue seen when increasing the numbers of objects being modeled in any system. A quantum physical system with many particles. A biological system with many genes. A financial system with many assets. This issue is most often referred to in data mining and combinatorics contexts. In the context of PDE-based modeling, the “curse” refers to the explosion in computational costs for numerical algorithms, which often need to be used when analytical solutions are unavailable.

Neural Networks Introduction: Neural Networks are a family of algorithms that can be used to create powerful models for regressions or classification problems on high-dimensional data. Neural Networks are compositions of simple functions that ultimately approximate a much more complex function. Neural Networks are very heavily parametrized with huge numbers of ‘weights’ that are optimized for simultaneously.

Neural Networks Basics: A ‘perceptron’ (aka. ‘artificial neuron’) is the building block of a Neural Network. A perceptron feeds the dot product of a vector of inputs and a vector of ‘weights’ into an Activation Function. An Activation Function mimics the ‘firing’ of a biological neuron. The output is negligible until the input reaches a certain threshold, thus producing a nonlinear micro-model. Popular activation functions are the step function, smoothed step (sigmoid) function, linear rectifier, and smoothed rectifier (‘softplus’ function). Each ‘layer’ of a Neural Network consists of several parallel perceptrons typically processing the same inputs, but with different weights. Image Credit: towardsdatascience.com

multilayer networks – ‘Deep learning’: Neural Networks consist of an input layer which simply distributes input vectors, an output layer which collates outputs, and one or more ‘hidden’ layers which construct and store a model. ‘Feed-Forward’ refers to the architectural property of most NNs where hidden layer outputs will act as inputs for subsequent hidden layers, composing their perceptrons and weights on top of one another. ‘Deep Learning’ refers to the use of NNs with many layers and complex architectures. Image Credit: neuraldesigner.com

Learning Through optimization – Gradient Descent : Each output of a Neural Network is ultimately a huge composition of perceptron functions and weights. Neural Networks are trained by comparing these outputs to a target value (training data) and minimizing a Loss Function between them. Sum-of-Squared-Errors is a popular general-purpose Loss Function. A gradient can be defined for an NN output function, and for the output functions of any given NN layer, based on the Activation Functions used by the component perceptrons. Gradient Descent is an optimization algorithm wherein weights across a layer are simultaneously altered in steps (with step size based on an exogenous ‘learning rate’ parameter) in order to reduce the Loss Function. ‘Backpropagation’ refers to the process of cascading the weight-updating process backwards from a final Loss Function to the first hidden layer in a network, in order to train weights across all layers.

Methodology - Semilinear parabolic pde: Han et al. consider a highly general class of PDEs called semilinear parabolic PDEs as the target of their NN-solver algorithm. The general form for these equations is as follows: t is time. x is a d-dimensional space. mu is a vector-valued unknown function being sought. sigma is a known dxd-matrix-valued function. f is a known nonlinear function. nabla refers to the gradient, or first derivative vector, of mu w.r.t ‘t’ and ‘x’. The Hessian is the matrix of second derivatives of mu w.r.t ‘t’ and ‘x’. A Trace is a summation of diagonal elements of a matrix.

Methodology - Backwards stochastic pdE: Han et al. substitute the d-dimensional deterministic variable ‘x’ with a random variable X which varies in time based on the following stochastic process: This implies that the solution to the semilinear parabolic PDE will be of the form: which is a Backwards Stochastic Diff. Eq. (BSDE)

Methodology - Backwards stochastic pdE: The next step is to Euler-discretize the BSDE in time: Methodology - Backwards stochastic pdE: This discretization allows sample paths of random variable X to be easily generated. The initial values of the ‘mu’ function and its gradient are treated as model parameters.

Methodology – NEURAL NETWORK APPROXIMATION: The randomly-generated X values are used to train ‘sub-networks’ seeking to find an approximate function that conducts the following mapping at each time step: This approximate function is defined in the terms of a multilayer feed-forward network with parameters ‘theta’: These sub-networks for each time step are then stacked into an overall network architecture whose goal is to approximate the function ‘mu’, i.e the solution to the original PDE. The final Expected Loss Function takes the form:

CASE 1 – Black-Scholes with default risk: The basic case of Black-Scholes considers a single underlying asset that is not itself contingent on anything else; the underlying cannot default. Han et. al. consider a Black-Scholes PDE for the pricing of a European derivative based on 100 underlying assets which are conditional on no- default. Default risk is modelled as the first jump in a Poisson process with a level- dependent intensity ‘Q’ and asset recovery rate given default ‘delta’. The final PDE for the option price ‘mu’ is: ‘x’ is a 100-dimensional vector. Most params (sigmas, delta, R, etc.) are exogenous.

CASE 1 Results discussion: Shown is a plot of time-0 option price as generated by the NN- approximation, given that all 100 assets have a price of 100. The blue line is the mean and the shaded area the standard deviation of 5 independent runs. 40 time steps and a learning rate parameter of 0.008 are used. The price converges to a value of ~57.3 given enough optimizer iterations.

CASE 2 – HJB Equation: Han et al. consider a 100-dimensional control problem wherein a cost functional is to be minimized. The related HJB Equation is as follows: ‘lambda’ is a positive constant for control strength. This PDE can actually be solved explicitly, given a function g(x) as a terminal condition at time T for mu(t,x) : Han et al. choose the following terminal function: The resulting solution is used as a benchmark against which to test the BSDE NN approximation algorithm.

CASE 2 Results Discussion: The blue line and shaded region again represent mean and standard dev. across 5 independent runs. The BSDE NN achieved a relative error of 0.17% in 330s on a Macbook Pro, when calculating optimal cost for all-0 inputs. 20 time steps and a learning rate of 0.01 are used. CASE 2 Results Discussion: The Deep BSDE Solver is compared to Monte Carlo runs of the explicit solution (which has a Brownian noise input), across different values for the control strength parameter ‘lambda’. An intuitive decrease in optimal costs as control strength increases is observed.

CASE 3 – Allen-CAHN Equation: The Allen-Cahn Equation is a reaction-diffusion equation in physics. Reaction- diffusion equations are relevant in fields ranging from chemistry and molecular biology to materials science and electronics engineering. Once again, Han et al. consider a 100-dimensional case: CASE 3 – Allen-CAHN Equation: with initial condition A relative error of 0.30% was achieved in 647s, using 20 time steps and learning rate 0.0005.

Overall results discussion: For the Black-Scholes case, the benchmark used for comparison was a no- default-risk 100-dimensional system, handled by Monte Carlo methods. The benchmark asset price was 60.781, against the BSDE NN’s with-default model’s ~57.300. For the HJB case, an explicit solution for the PDE with a noise input run through Monte Carlo sampling was used as a benchmark. For the Allen-Cahn case, a numerical solution produced by another BSDE- related algorithm, the branching diffusion method, was used for comparison. All runs were done on a Macbook Pro with a 2.9 GHz Intel Core i5 CPU and 16 GB RAM. An attempt to address the quantum many-body problem in physics did not succeed due to difficulty related to the Pauli Exclusion Princple.

Deep learning network architecture: Architecture was implemented using TensorFlow with the ‘Adam’ Stochastic Gradient Descent optimizer.

Deep learning network architecture: Each time step has an associated fully-connected sub-network. Each sub-network has a d-dimensional input layer, 2 (d+10)-dimensional hidden layers and a d-dimensional output layer. The Activation Function is the linear rectifier (the perceptrons in this case are referred to as Rectified Linear Units – ReLUs). Another reaction-diffusion PDE with an explicit oscillating solution was used to test the effects of having a different number of hidden layers. The number of hidden layers per sub-network was varied from 0 to 4, with 30 time steps and 40000 optimization iterations. The results are as follows: Note that ‘number of layers’ refers to the total number of layers with free parameters (hidden and output) across the 30 time steps.

CONCLUSIONS & perspective: The BSDE Neural Network approximation can be considered to be a type of numerical method. In essence, the neural network is taking a Monte Carlo simulation as training data and trying to reverse-engineer a complex nonlinear model out of it. The apparent benefit seems to be that, once trained, the NN can store a high-dimensional model for further use. Initial simulation and training can take considerable time, which will still increase when additional accuracy is sought by adding more hidden layers. Does the approach have potential? Probably.

The End