Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1

Non-volatile Memory Based Convolutional Neural Networks for Transfer Learning
Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1 1Stanford SystemX Alliance, Stanford University; 2Institute of Microelectronics, Peking University Introduction Transfer Learning Results Transfer Learning Results Deep learning have succeeded in many fields. AlphaGo won the contest against KeJie in May, 2017. The progress in object detection promotes automatic driving. AI for medicine and healthcare are flourishing. Training large-scale networks is energy- and time-consuming: Deep learning is driven by a huge amount of data, like 160GB ImageNet. Training network like ResNet-50 needs 8 powerful Tesla P100 GPUs. The training period can last for days. Utilizing non-volatile memory array can speed DNN up: Integrate memory and computing together can break through the Von Neumann bottleneck and have parallelism naturally. RRAM shows high speed, small area, simple structure and CMOS compatibility. (RRAM: Resistive-RAM, a sort of nonvolatile memory) RRAM has multilevel resistances and potential for 3D integration. Utilizing RRAM Array in Small CNN architecture Transfer learning implementation: Scheme II Using multilevel RRAM in the last layer: Additional memory is needed to store the full-precision weight value. The RRAM crossbar array will be refreshed by quantifying the weight array before every iteration. The operation of multilevel RRAM is simpler than that of the analog RRAM. Even binary RRAM can be used in this scheme, which has high speed, low variation and high endurance. The crossbar array achieves the multiplication between vector and matrix: Two rows of RRAM represent one kernel since the weight value can be positive or negative. Switch the sequence of pooling and activation can make it easier for circuit implementation. The influence of RRAM conductance variation on recognition accuracy: When variation increases, the average accuracy begins to fall and the fluctuation of recognition accuracy gets larger. Hence, keeping the variation lower than 50% is necessary. Modeling & Calculating Energy and Delay Compact Model of RRAM The physical model of RRAM: This model is based on the conductive filament theory and used for all the network simulations in our work. Schematic of a 3D RRAM array: All vertical layers share the same selective layer which consists of transistors. RRAM Characteristics Utilizing RRAM in Large-scale Neural Networks Typical I-V Curve and Multilevel Resistances (a) The architecture of VGG-16: We used VGG-16 as the represent of traditional CNN architecture. And we also tested Google Inception V3 to try on state-of-the-art structure. (b) The influence of variation: As there are more parameters and layers in large-scale CNN architectures, the influence of conductance variation on accuracy is severer than that in the small networks. (c) The influence of quantization: In the field of model compression, using weight sharing or the method described in BWN can achieve high accuracy. However, in transfer learning tasks, we can simulate any pre-trained model utilizing multilevel RRAM, which will not affect the final transfer performance. (a) 3D RRAM Tool for Analyzing Energy and Delay (a) (b) (b) (c) The left figure shows the typical I-V curve of RRAM devices: The black curve corresponds to set process while the red curve corresponds to reset process. High resistive state (HRS) can be abruptly set to low resistive state (LRS), and LRS can be reset to HRS gradually. Different resistive states can be obtained by applying AC pulses: The proportion among those resistive states is around 1:3:7. The variation of high conductance is about 10% while that of low conductance is 40%. (a) shows results on training data while (b) is the performance of prediction: We use parameters in 3D RRAM array as high-dimensional inputs to train this tool. We combine random forest and SVM algorithms in this tool, which can give decent estimations without going through all the physical details of every RRAM in the huge array. The energy used for weight update and inference in one specific network is about 105mJ and the time delay of the transfer learning system is mainly the time needed for writing operations. Typically, a writing process of RRAM takes ns. Thus the total time delay is around 70ms. Transfer learning implementation: Scheme I Measured Analog Characteristics of RRAM Acknowledgement Gradual reset process can be achieved by applying continuous small pulses: Up to hundreds of resistance states can be reached. The operation here contains hundreds of pulses. Thus it is complicated and time-consuming, for which we can’t use analog RRAM to represent all weights in networks. The variation of each state is rather small. This work is accomplished under the guidance of Haitong Li and Professor H.-S. Philip Wong. And it is supported in part by Stanford SystemX Alliance and UGVR Program Reference Using analog RRAM array as the last layer with all [1] Wong, H-S. Philip, et al. "Metal–oxide RRAM." Proceedings of the IEEE (2012): [2] Li, Haitong, et al. "A SPICE model of resistive random access memory for large-scale memory array simulation." IEEE Electron Device Letters 35.2 (2014): [3] Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv: (2015). [4] Ambrogio S, Balatti S, Milo V, et al. Neuromorphic learning and recognition with one-transistor-one-resistor synapses and bistable metal oxide RRAM[J]. IEEE Transactions on Electron Devices, 2016, 63(4): [5] Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "Binaryconnect: Training deep neural networks with binary weights during propagations." Advances in Neural Information Processing Systems [6] Li, Haitong, et al. "Four-layer 3D vertical RRAM integrated with FinFET as a versatile computing unit for brain-inspired cognitive information processing." VLSI Technology, 2016 IEEE Symposium on. IEEE, 2016. other layers simulated by multilevel RRAM and fixed: Given the analog characteristics of RRAM, there is a trade-off between the range and precision of weights. If the precision is not enough, many values in ΔW can’t be updated to the RRAM array, which will slow down the training process. If the range of weights is too small, the learning capability of networks tends to be insufficient and the accuracy will stop increasing after hundreds of iterations, since most of the weights have come to their extremes, which is illustrated in the contrast between those two figures on the right.

Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1

Similar presentations

Presentation on theme: "Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1

Similar presentations

Presentation on theme: "Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1"— Presentation transcript:

Similar presentations

About project

Feedback