Download presentation
CS 189 Brian Chu Slides at:
Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge)
Terminology Unit – each “neuron”
2-layer neural network: a neural network with one hidden layer (what you’re building) Epoch – one pass through entire training data For SGD, this is N iterations For mini-batch gradient descent (batch size of B), this is (N/B) iterations
First off… Many of you will struggle to even finish.
In which case you can ignore my bells and whistles. My 2.6GHz quad core 16GB RAM Macbook takes ~1.5 hours to train to ~96-97%.
First off… Add a signal handler + snapshotting
E.g. implement functionality where if you press Ctrl-C (on Unix systems, this is sending the interrupt signal), your code saves a snapshot of the state of the training (current iteration, decayed learning rate, momentum, current weights, anything else), then exits. Look into Python “signal” and “pickle” libraries.
Art of tuning Training neural nets is an art, not a science
Cross-validation? Pfffft “I used to tune that parameter but I’m too lazy and I don’t bother any more” – grad student talking about weight decay hyperparameter. There are way too many hyperparameters for you to tune. Training is too slow for you to bother using cross-validation. Many hyperparameters: just use what is standard and spend your time elsewhere
Knobs Learning: SGD/mini-batch/full batch, momentum, RMSprop, Adagrad, NAG, etc. How to decay? ReLU, tanh, sigmoid activations Loss: MSE or cross-entropy (with softmax) L1, L2, Max-norm, Dropout, Dropconnect regularization Convolutional layers Initialization: Xavier, Gaussian, etc. When to stop? Early stop? Stopping rule? Or just run forever
I recommend Cross-entropy, softmax *
* = What everyone in the literature, in practice, uses Cross-entropy, softmax * Only decay per epoch (or more than 1 epoch)* (e.g. don’t just divide by # iterations) Epoch = one training pass thru entire data Only decay after a round of seeing every data point. Note: if your mini-batch size is 10, N = 20, then one epoch is 2 iterations Momentum learning rate ( ?) * Maybe RMSProp? Mini-batch (somewhere between ) * No regularization. Gaussian initialization (mean 0, std. dev. 0.01) * Run forever, take a snapshot when you feel like stopping (seriously!)
Activation functions tanh >>> sigmoid ReLU = stacked sigmoid
(tanh is just shifted sigmoid anyways) ReLU = stacked sigmoid ReLU is basically standard in computer vision
Almost certainly will improve accuracy but total overkill
Considered “standard” today: Convolutional layers (with max-pooling) Dropout (Dropconnect?)
If using numpy Not a single for-loop should be in your code.
Avoid unnecessary memory allocation: Use the “out=“ keyword argument to re-use numpy arrays
May want to consider Faster implementation than Python w/ numpy:
Cython, Java, Go, Julia, etc.
Honestly, if you want to win…
(if you have a compatible graphics card) Write a CUDA or OpenCL implementation, train for many days. (you might consider adding regularization in this case) I didn’t do this: I used other generic tricks that you can read in the literature.
Debugging Check your dimensions Check your numpy dtypes
Check your derivatives – comment all your backprop steps Numerical gradient calculator:
Connection with SVMs / linear classifiers with kernels
Kernel SVM can be thought of as: 1st layer: |units| = |support vectors| Value of each unit i = K(query, train(i)) 2nd layer: linear combo of first layer Simplest training for 1st layer: store all training points as templates.
Similar presentations
© 2025 Inc.
All rights reserved.