A. S. Weigend M. Mangeas A. N. Srivastava

Nonlinear Gated Experts for Time Series: Discovering Regimes and Avoiding Overfitting
A. S. Weigend M. Mangeas A. N. Srivastava Department of Computer Science University of Colorado Presented by Bo Han 2019/5/10

Methodology of Gated Network Experimental Results Discussion
Outline Motivation Methodology of Gated Network Experimental Results Discussion Conclusion 2019/5/10

Motivation Regression: Fitting a function to real data points.
Simple case: Complex case: Divide and Conquer Strategy Easier to build local model Switch between regiems 2019/5/10

Basic Idea of Gated Network
Architecture of Gated Network (introduced in last class) E[ d | x ] yK(x) y1(x) g1(x) gK(x) Each expert approximates a local function over a regime EXPERT 1 Variance σ12 EXPERT K Variance σK2 GATING NETWORK Decide which expert to use for an input Classifier: RNRK x Learning Tasks: Learning parameters of individual local expert and gating network 2019/5/10

Assumptions and Cost Function
Assum1: Statistical independence among patterns Assum2: One and only one expert is responsible for a pattern Assum3: Each expert represents a Gaussian density X=<x1, x2, ……, xn> Xi is independent of Xj (i≠j) Total likelihood L of observing output d from input x: Parameters in gating network Parameters in local experts ( j=1,2,…,K) Maximum likelihood estimator: Learning: finding parameters (j=1,2,…,K) to maximize L Minimize cost function: C= -ln(L) 2019/5/10

Search Parameters By EM Algorithm
Map the problem to EM by introducing a hidden variable d If pattern t generated by jth expert I(t) = [ ··· 0 ] Otherwise 2nd expert contributes output d y1 y2 y3 yK Initially guess parameters of model (j=1,2,…,K) Repeat until converge E[ d | x ] Expectation step: Use model parameters to estimate Ij(t) Maximization step: Use Ij(t) to update model parameters for minimizing cost function yK(x) y1(x) g1(x) gK(x) EXPERT 1 EXPERT K GATING NETWORK 2019/5/10

E step: model parametershidden variable Assumes 3 experts Initial guess of model parameters Guess of unknown Parameters Guess of unknown hidden structure 2019/5/10

E step: model parametershidden variable Guess of unknown Parameters Guess of unknown hidden structure M step: hidden variable  model parameters 2019/5/10

Why it converges Maximize likelihood function Jordan and Xu proved convergence for EM approach to Mixtures of Experts (1993) Minimize cost function M-Step: Update Parameters expected value of Ij(t) in E-Step Analysis: CM , Since gj approximates hj , by minimizing CM, gj0 or gj 1; Since 2019/5/10

Experiment 1: Computer-Generated Data
Data: Mixture of Two Processes training and test set: both have 1000 samples Source1 Source2 xt+1 = tanh (-1.2xt + εi+1) ε ~ N (0, 0.1) xt+1 = 2 (1-xt2) – 1 Structure: 2 inpute attributes (last 2 values) to local expert 4 input attributes (last 4 values) to gating network 3 local experts (each with 1 layer of 10 hidden notes) Gating network (1 layer of 20 hidden notes) Measure: 2019/5/10

Experiment 1: Analysis and Interpretation
Gated Network split input space very well 2. The variance associated with each expert approaches the real variance 2019/5/10

Single Network With 1 layer of 50 hidden nodes Single Network With 1 layer of 10 hidden nodes Gated Network Min(ENMS) Analysis 3. Gated network performs better than single network 2019/5/10

Experiment 2: Laboratory Data -- Laser
Data: Approximated by three nonlinear different equations training data: 10,000 samples test data: two groups, each with 1,250 samples Structure: 10 input attributes (last 10 values) 8-15 local experts (each with 1 layer of 5 hidden notes) Gating network (1 layer of 10 hidden notes) 2019/5/10

Finally, 5/6 local experts survived 2. Each local expert is responsible for different function (partition, valleys and peaks) 2019/5/10

Experiment 3: Electricity Demand of France
Data: Multi-variate (51 input attributes) such as: weather, day of week, …… Multi-scale (daily, weekly, yearly) Multi-regimes (holidays, weekdays, summer, winter…) Training set: Jan.1, Dec.31, 1992 Test set: Jan.1, March 1, 1994 Structure: 51 input attributes 8 local experts (each with 1 layer of 5 hidden notes) Gating network (1 layer of 10 hidden notes) 2019/5/10

Gated Network gave a good explanation on the results Expert 1: the days around holidays Expert 2: holidays Expert 3: warmer season Expert 4: colder season 2019/5/10

Experiment 3: Model Comparison
1. Gated Network have the least degree of overfitting Analysis 2019/5/10

Discussion Advantages: Disadvantages:
Gated network uses the divide and conquer strategy to partition the input space Each local expert can really concentrate on one part of the problem and ignore the rest Training the Gated network by EM algorithm Disadvantages: Local experts is built just over local regions But, how can we discover the global trend among all regions (this is always observed in time-series) EM algorithm can not grantee the global optimum, so how this affects the performance of gated network? 2019/5/10

Conclusion Gated network performs significantly better than that of single network It discovers hidden regimes (helps us to understand the input space) It shows less overfitting than other regression model assuming a global noise scale 2019/5/10

A. S. Weigend M. Mangeas A. N. Srivastava

Similar presentations

Presentation on theme: "A. S. Weigend M. Mangeas A. N. Srivastava"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A. S. Weigend M. Mangeas A. N. Srivastava

Similar presentations

Presentation on theme: "A. S. Weigend M. Mangeas A. N. Srivastava"— Presentation transcript:

Similar presentations

About project

Feedback