Zip Codes and Neural Networks: Machine Learning for Handwritten Number Recognition Taylor Harbold, Michelle Page, Courtney Rasmussen Supervisor: Dr. Cuixian Chen 1UNCW Department of Mathematics and Statistics Introduction Statistical Techniques Model Refinement Neural Network is an idea from neuroscience that dates back to the 1940s. It started as using electrical circuits to model how neurons work, and has since then led to amazing advances like artificial intelligence. Neural networks utilize an oversimplification of the synapse processes that occur in the brain to interpret information. Raw input is taken in, organized and interpreted a certain way, and then a conclusion is come to. In statistics, neural networks mimic these processes by employing the methods of projection pursuit regression and back propagation. It is executed by taking linear data and putting it through complex, non-linear equations that improve themselves and get better at interpreting data with practice, just like our brains. By doing this, we create a simpler method for solving complex problems. Although neural networks have a wide array of uses, like facial recognition and stock prediction, we applied these techniques to predict the true value of hand-written digits from zip code data. Using MATLAB and the RSNNS package in the program R-Studio, we created prediction models that can “read” human handwriting. A neural network consists of three different layers; input, hidden layer, and output. The input layer is made up of the vectors of X with 𝑥= 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑝 𝑇 where p denotes the number of dimensions of the model. The subsequent layer is the hidden layer, which consists of hidden neurons denoted by Zm. To calculate Z we use 𝑍 𝑚 =𝜎 ∝ 0𝑚 + ∝ 𝑚 𝑇 𝑋 where 𝑚=1,2,…,𝑀, M being the number of hidden neurons in the layer. The activation function most widely used is the sigmoid function, denoted by 𝜎, formulated by 𝜎 𝑣 = 1 1+ 𝑒 −𝑣 . The output layer is the response variable in the form y= 𝑦 1 , 𝑦 2, …, 𝑦 𝐾 . To calculate the response variable we use 𝑦 𝑘 = 𝑔 𝑘 𝑇 and 𝑇 𝑘 = 𝛽 0𝑘 + 𝛽 𝑘 𝑇 𝑍. Initial weights (or slopes) and biases (or intercepts) of Z and Y are randomly generated from a range of [-1,1]. In most cases, the sigmoid function also assumes the role of the non-linear function 𝑔 𝑘 . If K=1 we will perform a regression neural network, whereas if K>1 a classification neural network is needed. In order to find the optimal model, we needed to find the amount of hidden neurons within each hidden layer that would give the best accuracy. We tested at 5, 50, 75 and 100 hidden neurons. After running the first series of analysis, it was found that using 100 hidden neurons was most ideal. Using more hidden neurons usually results in a better model since it shows the flexibility of the data. Here it was necessary for the data since there were 256 entries. The learning rate is the length of the step that is taken when updating weights and biases of the model per iteration during back propagation. Keeping this constant at 0.1, we found that the test data accuracy was 0.9387145. Comparing this to the model with 75 hidden neurons at a learning rate of 0.1, the accuracy was 0.937718. Since the difference of these two models are so close, we decided to compare the models at different learning rates to find the most accurate model. Table of Model Accuracy Single Hidden Layer Back-Propagation Network for Classification Number of Hidden Neurons Learning Rate Minimum Weighted SSE Accuracy of Training Model Accuracy of Test Model 5 (Base Model) 0.1 2700 0.8777945 0.7927255 50 900 0.9950624 0.9317389 75 850 0.9954739 0.937718 100 825 0.9962968 0.9387145 0.3 800 0.995611 0.935725 0.5 780 0.996434 750 0.9969826 0.9431988 0.6 0.9965711 0.9342302 0.9968454 0.9402093 Data INPUT HIDDEN LAYER RESPONSE 𝑍 1 = 𝜕( 𝛼 01 + 𝛼 11 𝑋 1 + 𝛼 21 𝑋 2 + 𝛼 31 𝑋 3 ) =𝜎(−0.4+ 0.2∗1 + 0.4∗0 +(−0.5∗1)) 𝑍 1 =𝜎(.7) 1 1+ 𝑒 −0.7 =0.332 Therefore 𝑍 1 =0.332. In the United States, postal codes are a vital part of the US postal service. Zip codes are made up of five numerical values ranging from zero to nine, usually following a pattern depending on geographical location within each region of the United States. For example, North Carolina has ZIP codes starting with the numbers 27 and 28, whereas Massachusetts has ZIP codes starting with the digits 010 through 027. It is also important to note that some states have ZIP codes that are restricted to starting to only two numerical values. An example of this would be the state of Utah in which all of their zip codes start with 84. Geographical Location of the first two digits of ZIP code Example of handwritten ZIP codes The second stage of model refinement was needed in order to find the best learning rate. Using 100 hidden neurons, we tested the model at learning rates of 0.1, 0.3, 0.5, and 0.6. We found that as the learning rate increased from 0.1 to 0.5, the accuracy increased as well. But when the learning rate went past 0.5, the accuracy started to decline. The learning rate that created the best model at 100 hidden neurons was 0.5, giving an accuracy of 0.9431988. This shows that we were able to predict the ZIP code digits with 94.31988% accuracy and an error of 5.68%.. 𝑌 1 = 𝑔 1 (0.1+(−0.3∗0.332) +(−0.2∗0.525)) 𝑌 1 = 𝑔 1 (−0.105) 𝜎(𝑇)= 1 1+ 𝑒 0.105 =0.474 Therefore 𝑌 1 = 0.474. Weighted SSE Graphical Summary Base Model Final Model http://www.whereig.com/usa/zipcodes/ https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf The data used to analyze handwritten ZIP codes was taken from envelopes in the United States that had been automatically scanned into the U.S Postal System. Each number from the ZIP code was isolated and examined by creating an eight-bit grey scale map. This was done with the identity of each isolated number that had a 16 by 16 grid overlaying the image. The eight-bit map is represented by pixels that range from 0 to 255, depending on the size and clarity of the handwriting of each isolated number. Neural Networks made it possible to predict handwritten ZIP codes using a training data set. The training and test data had ten columns corresponding to each number in the isolated ZIP code (0, 1, 2…, 9). The responses were represented in binary form, 0 being false and 1 being true. This resulted in 7,291 rows of results that would help model handwritten ZIP codes. The test data that was used was compromised of the ten columns and 2,007 binary responses, similar to the training data set. Training Data Testing Data Training Data Testing Data After running the model, we then calculate the error of each perceptron, using the distinct formulas dependent on the layer. For the output layer we utilize the equation 𝐸𝑟𝑟 𝑘 = 𝑂 𝑘 1− 𝑂 𝑘 𝑇 𝑘 − 𝑂 𝑘 , where 𝑂 𝑘 is the output value for 𝑦 𝑘 and 𝑇 𝑘 is the true value. For the hidden layer, the formula 𝐸𝑟𝑟 𝑚 = 𝑧 𝑚 1− 𝑧 𝑚 1− 𝑧 𝑚 𝑘=1 𝐾 𝐸𝑟𝑟 𝑘 𝛽 𝑚𝑘 is needed. We then use a function of these values to update the weights and biases between each perceptron and run the model again. Hidden Neurons Learning Rate Accuracy Test Model 5 0.1 0.7927255 Hidden Neurons Learning Rate Accuracy Test Model 100 0.5 0.9431988 For weights: 𝑤 𝑖𝑗 𝑛𝑒𝑤 = 𝑤 𝑖𝑗 𝑜𝑙𝑑 +ℓ 𝐸𝑟𝑟 𝑗 𝑂 𝑖 For biases: 𝜃 𝑗 𝑛𝑒𝑤 = 𝜃 𝑗 𝑜𝑙𝑑 +ℓ 𝐸𝑟𝑟 𝑗 At this point in time, it was found that in order to most accurately predict the handwritten ZIP code that the model would need 100 hidden neurons at a learning rate of 0.5. Using this model, the handwritten ZIP code can be predicted at a 94.31988% accuracy with a 5.68% error. We then run the model once again with new weights and biases, bringing our output values closer to the true values of Y with each iteration. Future Studies In future research, it would be wise to expand on the amount of hidden layers within the model. Manipulating the amount of hidden layers would influence the model and determine if using regression or clustering would be more optimal. We can later use the amount of hidden layers to create a hierarchy system that will analyze different levels of resolution.