Financial Data Modelling Dr Nikolay Nikolaev Department of Computing Goldsmiths College University of London 2018
Dynamic Nonlinear Models Lecture 3 (FDM 2018) Dynamic Nonlinear Models When processing time series data the feedforward TDNN, which are static by design, accommodate the time using sliding window vectors (also called tapped delay lines). The sliding input window shifts at each discrete time step over the data series taking the next data point and removing respectively the oldest one (according to a predefined lag/dimension). In this way, using delay windows helps us to process temporal patterns. The TDNN neural network models have several drawbacks: they limit the duration of the temporal events because they do not have implicit memory, and require to determine the lag space and delay time in advance; they face difficulties in capturing long-term temporal relationships in the data; they are trained with static learning algorithms (like backpropagation, and standard optimizers). Proper handling of sequential time series data with single-layer and multilayer neural networks is accomplished out by adding memory that contains, remembers past outputs. Adding feedback connections to the neural network structure gives the model potential to capture better temporal relationships between serial data, as well as to describe better the hidden dynamics of the unknown data generator.
Lecture 3 (FDM 2018) Having feedback connections makes these recurrent neural networks dynamic systems, in other words this is the memory that makes the recurrent networks powerful tools for learning temporal dependencies in serial data. It should be noted also that having memory renders such recurrent networks especially suitable for describing non-stationarity, so they are especially useful for learning from nonstationary time series. There are two main advantages of having memory in neural network models: the memory stores the state of the dynamical neural system and determines the evolution of the output; the memory enables learning of longer time dependencies without the need to determine accurately the input size in advance, in other words it makes possible to learn with imprecise embeddings from time series data. A common learning framework can be also the maximum likelihood estimation (MLE) method, but for treating nonlinear models it is usually implemented with exact derivatives rather than using numerical integration.
Lecture 3 (FDM 2018) Dynamic NARMA Models Nonlinear versions of autoregressive moving average (NARMA) models can be developed using neural network representations. These are NARMA connectionist architectures in which we pass as inputs the latest time series measurements together with feedback past outputs. The NARMA model is defined as follows: 𝑦 𝑡 =𝑓 𝒙 𝑡 , 𝒆 𝑡 + 𝜀 𝑡 =𝑓( 𝑥 𝑡−1 , 𝑥 𝑡−2 ,…, 𝑥 𝑡−𝑝 , 𝜀 𝑡−1 , 𝜀 𝑡−2 ,…, 𝜀 𝑡−𝑞 )+ 𝜀 𝑡 where 𝜀 𝑡−1 = 𝑦 𝑡−1 −𝑓 𝒙 𝑡−1 , 𝒆 𝑡−1 are the recent prediction errors. Consider a simple recurrent single-neuron (Percepton) network having such inputs: 1 if 𝑙 =0 𝑧 𝑡−𝑙 = 𝑥 𝑡−𝑙 if 1 ≤𝑙 ≤𝑝 𝑓 𝑡−𝑙+𝑝 if (𝑝+1) ≤𝑙 ≤(𝑝+𝑞) where 𝑝 is the number of lagged inputs, and 𝑞 is the number of recurrent connections.
Lecture 3 (FDM 2018) Assuming that the output node uses the 𝑡𝑎𝑛ℎ activation function the model computes: 𝑓 𝒙 𝑡 , 𝒆 𝑡 =𝑡𝑎𝑛ℎ 𝑙=1 𝑝 𝑤 𝑙 𝑥 𝑡−𝑙 + 𝑙=𝑝+1 𝑞 𝑤 𝑙 𝑓 𝑡−𝑙 + 𝑤 0 where the temporal variables capture information from the past, and send it via the loop, thus providing memory capacity. This is what helps to capture time-varying patterns in data.
Training NARMA Networks Lecture 3 (FDM 2018) Training NARMA Networks There are two algorithms for computing dynamic gradients in such recurrent single-neuron networks: BackPropagation-Through-Time (BPTT)- this algorithm unfolds the network back in time and calculates the error derivatives backwards as an expansion; Real-Time Recurrent Learning (RTRL)- this algorithm computes the error derivatives forward in time. Having dynamic, temporal derivatives one can plug them into a standard optimizer or implement gradient-descent training with first-order or second-order methods.
Online Gradient Descent Training Lecture 3 (FDM 2018) Online Gradient Descent Training The first-order online gradient-descent training algorithm updates the weights at each particular time step in direction opposite to the instantaneous gradient of the cost function with the following equation: 𝑤 𝑗,𝑡 = 𝑤 𝑗,𝑡−1 +η 𝜕 C 𝑡 𝜕 𝑤 𝑗,𝑡−1 = 𝑤 𝑡 +η 𝜀 𝑡 𝑓 𝑡 ′ 𝜕 𝑠 𝑡 𝜕 𝑤 𝑗,𝑡−1 where 𝑓 𝑡 ′denotes the derivative of the activation function, 𝜀 𝑡 is the error 𝜀 𝑡 = 𝑦 𝑡 −𝑓 𝒙 𝑡 , 𝒘 𝑡 , and s 𝑡 is the summation at the output node s 𝑡 = 𝑙=1 𝑝 𝑤 𝑙 𝑥 𝑡−𝑙 + 𝑙=𝑝+1 𝑞 𝑤 𝑙 𝑓 𝑡−𝑙 + 𝑤 0 . This derivative is obtained according to the maximum likelihood principle starting from the instantaneous cost function: C 𝑡 is the instantaneous cost function C 𝑡 = 0.5 ( 𝑦 𝑡 −𝑓 𝒙 𝑡 , 𝒘 𝑡 ) 2 . The so called Real-Time Recurrent Learning (RTRL) derivatives are calculated using the chain rule in the following way: 𝜕 C 𝑡 𝜕 𝑤 𝑗 = 𝜕 C 𝑡 𝜕 𝑓 𝑡 𝜕 𝑓 𝑡 𝜕 𝑠 𝑡 𝜕 𝑠 𝑡 𝜕 𝑤 𝑗 = 𝜀 𝑡 (1− 𝑓 𝒙 𝑡 , 𝒘 𝑡 2 ) 𝜕 𝑠 𝑡 𝜕 𝑤 𝑗 where the derivative of the tanh activation function is 𝑓 𝑡 ′ =(1− 𝑓 𝒙 𝑡 , 𝒘 𝑡 2 ), and the time subscripts for the weights are omitted for clarity.
Temporal RTRL Derivatives Lecture 3 (FDM 2018) Temporal RTRL Derivatives The derivatives at the output node summation with respect to –input-to-hidden weights are taken as follows: 𝜕 𝑠 𝑡 𝜕 𝑤 𝑗 = 𝜕 𝑙=1 𝑝+𝑞 𝑤 𝑙 𝑧 𝑡−𝑙 + 𝑧 0 𝜕 𝑤 𝑗 = 𝑙=1 𝑝+𝑞 𝑤 𝑙 𝜕 𝑧 𝑡−𝑙 𝜕 𝑤 𝑗 + 𝑧 𝑡−𝑙 𝜕 𝑤 𝑙 𝜕 𝑤 𝑗 = = 𝑙=1 𝑝 𝑤 𝑙 𝜕 𝑥 𝑡−𝑙 𝜕 𝑤 𝑗 + 𝑙=𝑝+1 𝑞 𝑤 𝑙 𝜕 𝑓 𝑡−𝑙 𝜕 𝑤 𝑗 + 𝑧 𝑡−𝑗 = 𝑙=𝑝+1 𝑞 𝑤 𝑙 𝜕 𝑓 𝑡−𝑙 𝜕 𝑤 𝑗 + 𝑧 𝑡−𝑗 where the assumption is that 𝜕 𝑥 𝑡−𝑙 /𝜕 𝑤 𝑗 = 0. Note here that the first term accounts for the implicit effect of weight 𝑤 𝑗 on the network output, while the second term is the explicit effect of this weight on the network summation. Having knowledge about training such a recurrent single-neuron network there can be designed also recurrent multilayer Perceptrons if severe nonlinearities are present in the data (after performing initial checks with some diagnostic tests).
Example: RTRL training of a nonlinear recurrent single-layer network Lecture 3 (FDM 2018) Example: RTRL training of a nonlinear recurrent single-layer network Let a simple network having one node, one input node, a bias constant, and an output-to-input feedback connection be given. The output node has a hyperbolic tangent activation function. Suppose that the initial weights are: 𝒘 𝑡 = [− 0.0391 0.1461 0.0779], the gradients from the previous time step are: 𝜕 𝒔 𝑡 /𝜕 𝒘 𝑡 = [0.1 0.2 0.3], and the learning rate is: η=0.1. The given time series data are: external input: 𝑥 𝑡−1 = 0.9524, and target: 𝑥 𝑡 = 0.9801. Assuming that the network output generated with the previous data point is 𝑓 𝑡 = 0.5, the error is calculated as follows: 𝑒 𝑡 = 𝑥 𝑡 − 𝑓 𝑡−1 = 0.9801 − 0.5=0.4801. Next, the input vector is constructed as follows: 𝒛 𝑡 = [1.0 𝑥 𝑡−1 𝑓 𝑡−1 ] = [1.0 0.9524 0.5]. Then, we perform the forward propagation: s 𝑡 = 𝑤 1 𝑧 1 + 𝑤 2 𝑧 2 + 𝑤 3 𝑧 3 = −0.0391 ∗1+0.1461∗0.9524+0.0779∗0.5 =0.139 𝑓 𝑡 =tanh s 𝑡 =0.1381
Example: RTRL training (continuation) Lecture 3 (FDM 2018) Example: RTRL training (continuation) After that, we apply the chain rule with the corresponding values: 𝜕 𝐶 𝑡 /𝜕 𝑓 𝑡 ∗ 𝜕 𝑓 𝑡 /𝜕 𝑠 𝑡 = 𝜀 𝑡 (1− 𝑓 𝑡 2 )=0.4801∗(1− 0.5 2 ) =0.3601 Having the past derivatives, the weight deltas are computed as follows: η𝜀 𝑡 𝑓 𝑡 ′ 𝜕 𝑠 𝑡 /𝜕 𝑤 1 = 0.1∗0.3601∗ 0.1 =0.0036 η𝜀 𝑡 𝑓 𝑡 ′ 𝜕 𝑠 𝑡 /𝜕 𝑤 2 =0.1∗0.3601∗ 0.2 =0.0072 η𝜀 𝑡 𝑓 𝑡 ′ 𝜕 𝑠 𝑡 /𝜕 𝑤 3 =0.1∗0.3601∗0.3 =0.0108 Therefore, the weights are updated in the following way: 𝑤 1 = (−0.0391)+0.0036 = −0.0355 𝑤 2 = 0.1461+0.0072 = 0.1533 𝑤 3 = 0.0779+0.0108 = 0.0877
Example: RTRL training (continuation) Lecture 3 (FDM 2018) Example: RTRL training (continuation) Finally, the derivatives for the next time step are produced as follows: 𝜕 𝑠 𝑡 /𝜕 𝑤 1 = 𝑤 3 𝜕 𝑓 𝑡 /𝜕 𝑤 1 + 𝑧 1 =0.0877∗0.1 +1.0 =1.0088 𝜕 𝑠 𝑡 /𝜕 𝑤 2 = 𝑤 3 𝜕 𝑓 𝑡 /𝜕 𝑤 2 + 𝑧 2 =0.0877∗0.2+0.9524= 0.09699 𝜕 𝑠 𝑡 /𝜕 𝑤 3 = 𝑤 3 𝜕 𝑓 𝑡 /𝜕 𝑤 3 + 𝑧 3 =0.0877∗0.3+0.5 =0.5263
Exercise: Programming the RTRL algorithm in Matlab Lecture 3 (FDM 2018) Exercise: Programming the RTRL algorithm in Matlab First we need to initialize all data structures and load the time series data: NVCTS = odim; NOUTS = 1; NINPUTS = 10; m = NINPUTS+1; NNODES = 3; nrows = NNODES; NCOLS = m+NNODES; eta = 0.1; e = zeros(NNODES,1); s = zeros(NNODES,1); out = zeros(NNODES,1); yprim = zeros(NNODES,1); w = zeros(NNODES,NCOLS); delw = zeros(NNODES,NCOLS); z = zeros(NVCTS,NCOLS); d = zeros(NVCTS,NCOLS); p = zeros(NNODES,NCOLS,NNODES); pold = zeros(NNODES,NCOLS,NNODES); z(:,1) = 1.0; for i = 1:NVCTS for j = 1:NINPUTS z(i,j+1) = x(i,j); % load the input vectors end for j = 1:NOUTS d(i,j) = targets(i); % load the targets w = 0.5*(rand(NNODES,NCOLS)-0.5); % initialize the weights
Exercise: Programming the RTRL algorithm in Matlab (continuation) Lecture 3 (FDM 2018) Exercise: Programming the RTRL algorithm in Matlab (continuation) Next we develop the training loops to iterate over the data, starting with forward propagation: for epoch = 1:50 for t = 1:NVCTS % Compute the error for k = 1:NOUTS e(k) = d(t,k)-out(k); end % Set previous out(k)=out(t) as part of the next input z(t,k+m) for k = 1:NNODES z(t,k+m) = out(k); % Generate the summations at each of the k nodes s(k) = 0.0; for i = 1:NCOLS s(k) = s(k)+w(k,i)*z(t,i);
Exercise: Programming the RTRL algorithm in Matlab (continuation) Lecture 3 (FDM 2018) Exercise: Programming the RTRL algorithm in Matlab (continuation) After that, we perform the backward pass and update the weights: % Compute the output out at time (t+1): out(k) = out(t+l) = f(s(t)) for k = 1:NOUTS out(k) = s(k); end for k = 1:NNODES-NOUTS out(k+NOUTS) = tanh(s(k+NOUTS)); % Compute the weight changes at time t for i = 1:NNODES for j = 1:NCOLS delw(i,j) = 0.0; delw(i,j) = delw(i,j)+eta*e(k)*pold(i,j,k); % Update the weights for time (t+1) w = w+delw;
Exercise: Programming the RTRL algorithm in Matlab (continuation) Lecture 3 (FDM 2018) Exercise: Programming the RTRL algorithm in Matlab (continuation) Finally, the temporal matrix is computed for the next iteration: for k = 1:NOUTS yprime(k) = z(t,k); end for k = 1:NNODES-NOUTS yprime(k+NOUTS) = 1-out(k)^2; for i = 1:NNODES for j = 1:NCOLS for k = 1:NNODES kron = 0.0; if (i==k) kron = 1.0; end ssum = 0.0; for l = 1:NNODES ssum = ssum+w(k,l+m)*pold(i,j,l); % pold = p(t) p(i,j,k) = yprime(k)*(ssum+kron*z(t,j)); ptemp = pold; pold = p; p = ptemp; % pold is now p(t+1)
Lecture 3 (FDM 2018) References: R.J.Williams and D.Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity, In: Chauvin,Y. and Rumelhart,D.E. (Eds.), Back-propagation: Theory, Architectures and Applications, Chapter 13, Lawrence Erlbaum Publishers, Hillsdale, N.J., pp.433-486. S.Haykin (1997). Neural Networks: A Comprehensive Foundation (2nd ed.), Pearson Higher Education, Upper Saddle River,New Jersey. N.Nikolaev and H.Iba (2006). Adaptive Learning of Polynomial Networks: Genetic Programming, Backpropagation and Bayesian Methods, Springer, New York.