从零开始搭建深度学习大厦系列-3.卷积神经网络基础(5-9)
(1)本人挑战手写代码验证理论,获得一些AI工具无法提供的收获和思考,对于一些我无法回答的疑问请大家在评论区指教;
(2)本系列文章有很多细节需要弄清楚,但是考虑到读者的吸收情况和文章篇幅限制,选择重点进行分享,如果有没说清楚或者解释错误的地方欢迎在评论区提出;
(3)写的时候是用英文撰写的,这里就不翻译成中文了,希望大家理解;
(4)本系列内容基于李沐老师《动手学深度学习》教材,网址:
《动手学深度学习》 — 动手学深度学习 2.0.0 documentation
(5)由于代码量较大,以免费资源的形式上传到个人空间,方便读者运行和使用。
注:AlexNet提供了Pytorch和MxNet两种实现方式,LeNet只提供基于MxNet框架的实现。
原论文作者训练后的模型参数也可以通过深度学习框架直接下载获取,不过本实验意在探究CNN的理论基础和实现思路,因此从零开始训练不同版本的"LeNet"和"AlexNet"。
同时提出了不同的代码实现方案和分析思路,对于训练成本较大的模型,建议使用Google Colaboratory提供的免费算力平台,本质是配置了Pytorch和Tensorflow等深度学习框架的基于Ubuntu系统的服务器。
上篇文章主要分析:
【1】CNN卷积神经网络中卷积层、池化层、批规范化层、激活层、“暂退层”的作用原理;
从零开始搭建深度学习大厦系列-2.卷积神经网络基础(5-9)-CSDN博客文章浏览阅读75次,点赞4次,收藏2次。本篇主要分析:【1】CNN卷积神经网络中卷积层、池化层、批规范化层、激活层、“暂退层”的作用原理;https://blog.csdn.net/2302_80464577/article/details/149260533本篇主要分析:
【2】单CPU核训练背景下的时间花费组成和实验验证,以及函数接口的加速效果;
【3】学习率、优化方法、批量大小、激活函数等超参数(Hyperparameters)的调参方法;
【4】卷积神经网络(LeNet,1998)和深度卷积神经网络(AlexNet,2012)在MNIST,Fashion_MNIST,CIFAR100数据集上的表现与一种可能可行的参数量自适应调整方法;
【5】CNN激活层特征可视化,直观比对人工设计卷积核的滤波效果,理解CNN的信息提取过程;
【6】混淆矩阵的作用分析,绘制自定义混淆矩阵。
Content
Environment Setting. 5
Experiment Goals. 6
1. Edge Detection. 6
1.1 Basic Principle. 6
1.2 Function Design. 6
1.3 Carrying-out Result 8
2. Shape of layers and kernels in a CNN.. 14
2.1 Basic Theories. 15
2.2 Code implementation(numpy,mxnet.gluon.nn,mxnet.nd) 17
2.3 Result 18
3. 1x1 Convolution. 20
3.1 Basic Theory. 20
3.2 Code implementation (3 lines) 20
3.3 Result 21
4-5 CNN Architecture Implementation and Evaluation. 21
About Data loaders. 22
About num _workers and prefetching processes. 23
4. LeNet Implementation (MxNet based) 24
4.1 Basic Theories. 24
4.2 Code Implementation. 25
4.3 Model Evaluation on Fashion-MNIST dataset 26
4.3.1 Pooling: Maximum-pooling VS Average-pooling. 26
4.3.2 Optimization: sgd vs sgd+momentum(nag) 28
4.3.3 Activation Function: ReLU vs sigmoid. 28
4.3.4 Normalization Layer: Batch Normalization VS None. 29
4.3.5 Batch size: 64 vs 128. 31
4.3.6 Textbook Result(Batch Normalization) && Running Snapshot 32
4.4 LeNet Evaluation on MNIST dataset 33
4.5 Evaluating LeNet on CIFAR100. 35
4.5.1 Coarse Classification (20 classes) 36
4.5.2 Fine Classification (100 classes) 38
4.5.3 Running Snapshot 39
5. AlexNet Architecture. 39
5.1 Code Implementation. 39
5.2 Fashion_MNIST Dataset (Mxnet vs Pytorch) 44
5.3 MNIST Dataset (Pytorch only) 45
5.4 CIFAR100(100 classes, fine labels)-Pytorch Only. 46
5.4.1 Learning rate setting. 46
6. CNN activation layer characteristics visualization. 48
6.1 MNIST Dataset 49
6.2 Fashion_MNIST Dataset 52
7. Confusion Matrix. 54
7.1 MNIST.. 54
7.2 Fashion_MNIST.. 55
References. 55
Environment Setting
All of the four experiments are carried out on virtual environment based on Python interpreter 3.7.0 and mainly used packages include deep-learning package mxnet1.7.0.post2(PREFETCHING PROCESS version), visualization package matplotlib.pyplot, image processing package opencv-python, array manipulation package numpy.
Experiment Goals
- Design appropriate kernels of fixed parameters and detect edges with horizontal, vertical, diagonal orientation separately;
- Derive shape transformation formula in the forward propagation process of CNN(Convolutional Neural Network) and verify the result by fundamental coding and calling scripts;
- Understand the effect and principle of 1x1 kernels and then explore different implementation versions of 1x1 convolution in 2-dimensional plane such as cross-correlation calculation and matrix multiplication;
- Construct LeNet[2] by hand using mxnet.gluon.nn and explore how different settings of hyperparameters impact training result and model performance;
- Construct AlexNet[3] by hand using torch.nn and explore how different settings of hyperparameters impact training result and model performance.
4-5 CNN Architecture Implementation and Evaluation
AlexNet(2012) is much more complex than LeNet(1998), which is oriented to solve 1000-classes classification mission challenge published on ImageNet while LeNet aims to classify 0-9 Arabic digital numbers initially.
Considering training cost,overfitting and underfitting problems and dataset difference, shrinking parameter size of AlexNet while maintaining basic structure including number of layers and corresponding types on datasets MNIST, Fashion_MNIST and CIFAR100. Specifically speaking, lowering number of input channels and output channels of convolutional and linear layers to a certain ratio of original setting. Parameter size is almost 1/251 of original version for 10 classes, 1/109 for 20 classes,1/16 for 100 classes.
At the same time do necessary parameter size expanding for LeNet on 20-class and 100-class classification problems, with expanding ratio 2 and 10 approximately.
What’s more, (initial) learning rate of AlexNet is set much smaller than that of LeNet (e.g., 0.01/0.03 VS 0.9, in textbook) and from my own implementation it is surprised to find that although AlexNet performs a bit worse than LeNet finally, it reaches a much higher initial accuracy after first several epochs on testing dataset and a higher initial accuracy after first several epochs on training dataset (See Figure1).
About Data loaders
Data loaders are elementary tools for training, generally speaking it is essential to set shuffle=True in training dataset——But shuffling operation occurs before all training epochs. This is verified by code.
Figure 14 X2 records last epoch last batch information of X
The explanation may solve one question making me tangled for a long time: a total of training and testing time of CNN only accounts for about 1/10-1/6 (12s-20s in 120s) of time in one actual epoch. Duration(Theoretical epoch) is measured by time.time(), actual epoch time is measured by mxnet.cpu() at the start of every evaluation in every epoch.
What is the reason behind? Inference from experimenting may point out that most of the time spent on total training and testing process is actually fetching shuffled(maybe index-based first) batches of data into CPU memory but not executing FP and BP process. Real FP and BP actually do not account much time ratio. Another interesting passage explains what time.clock()(time.process_time()) and time.time() records.[4] However, in my opinion and based on experiment, time.time() also only records CPU running time but not absolute time.
Figure 15 Actual Epoch vs Theoretical Epoch? CIFAR100 (20 classes), time.time()
Figure 16 Actual Epoch VS Theoretical Epoch ? MNIST(10 classes), time.process_time ()
About num _workers and prefetching processes
These parameters are intended for smoothing the bootstrapping process of training and testing since it takes time for batches of data to be loaded into CPU memory (notice: not main memory), may be similar to pipelining technology in RISC processor design.
Actual test: Increase memory usage and bring not much training or testing acceleration effect, see 4.5.3.
Figure 17 parameter annotations of data.Dataloader
4. LeNet Implementation (MxNet based)
4.1 Basic Theories
Figure 18 LeNet Architecture(1998)
Practically speaking, only those in convolutional layers can be changed when altering activation functions in LeNet because default setting sigmoid may cause gradient vanishing issue in convolutional layers but more robust in the MLP section within LeNet.
Figure 19 altering all activation functions to ReLU in LeNet causes problem
4.2 Code Implementation
4.3 Model Evaluation on Fashion-MNIST dataset
4.3.1 Pooling: Maximum-pooling VS Average-pooling
Average-pooling: Better stability and less sensitive to noises
Figure 20 Mxnet-LeNet Architecture, (32,0.9,sgd,avg_pooling,sigmoid)
Maximum-pooling: Better capture for sharp details and edge information, more sensitive to noises
Figure 21 Mxnet-LeNet Architecture, (32,0.9,sgd,max_pooling,sigmoid)
4.3.2 Optimization: sgd vs sgd+momentum(nag)
Nag:
Figure 22 Mxnet-LeNet Architecture, (128,0.9,nag,max_pooling,sigmoid)
SGD: See 4.3.1.
Nag optimization achieves higher accuracy in the first several epochs while lower accuracy at last.
4.3.3 Activation Function: ReLU vs sigmoid
ReLU:
Figure 23 Mxnet-LeNet Architecture, (128,0.9,nag,max_pooling,relu)
Sigmoid: See 4.3.2.
It is important to notice that what pixel value behaves like in a CNN. There are mainly two situations as follows:
- just normalizing to range [0,1];
- Do channel(instance) normalization as data preprocessing method, generally the result lies in a small range (-M,M);
However kernel parameters belong to small values which can be negative or positive, so every feature map with in a convolutional layer may have different signs no matter which situation above, after passing some activation functions such as ReLU and sigmoid, the values become all positive however may easily turn negative once entering next convolutional layer. That’s way characteristics visualization should be disposed after activation functions.
4.3.4 Normalization Layer: Batch Normalization VS None
Batch Normalization:
Figure 24 Mxnet-LeNet Architecture, (128,0.9,nag,max_pooling,relu,batch_normalization)
No normalization: See 4.3.3.
Batch normalization leads to larger gap between training and testing curves, which has less explainability and may indicate poorer generalization ability.
Figure 25 Explanation for batch normalization in textbook (2023)
4.3.5 Batch size: 64 vs 128
Figure 26 MxNet-LeNet Architecture, (64,0.9, nag (default rho), max_pooling, ReLU, batch_normalization)
Batch size==128: See 4.3.4.
Actually, training time has not much relation with number of Prefetching processes in the data loader setting, it is affected by sample capacity and batch size: The smaller the batch size is, the shorter time one iteration of training spends. Epoch<->Sample Space; Iteration<->One Mini-Batch.
This is because batch is basic unit of parameters update, for samples within a batch, forward propagation and backward propagation are executed sequentially to obtain gradient. Larger size brings bigger matrix input but we only require the final result as an average value of total calculations. For this reason, a k multiple of batch size increases calculation complexity and storage space by k times during forward propagation; Meanwhile brings k times addition operations to obtain summation of gradients on each sample.
To conclude, training time corresponds to batch size in a positive linear manner but almost remain constant for a fixed sample space due to number of iterations, while storage space(hardware) depends on batch size directly. This is significantly different with transformer, where complexity has quadratic relation with input sequence length.
In smaller batch sizes case, training process is rather unstable and noisy and may cause loss function value to explode easily.
Figure 27 AlexNet training on Fashion_MNIST on Colab, batch size=8, initial learning rate=0.03
4.3.6 Textbook Result(Batch Normalization) && Running Snapshot
Figure 28 My computer supports working of (batch size=64, prefetching processes=4, input image size=1x28x28, parameter size=1xLeNet): (64,4,1x28x28,1)
4.4 LeNet Evaluation on MNIST dataset
Figure 29 Mxnet-LeNet Architecture (64,0.9(cosine decay), nag(default rho), max_pooling, ReLU, batch_normalization) on MNIST
MNIST dataset is simpler and has stronger characteristics, thus LeNet reaches an accuracy of 99.6% on training dataset and 99.0% on testing dataset.
To make a small comparison, training and evaluating MLP on MNIST dataset, training accuracy is 97.9% and testing accuracy is 97.3% after 10 epochs of training with batch size 256, initial learning rate 0.5(cosine-type learning rate decay) and SGD+Momentum method optimizing .
Figure 30 2 layers MLP performance on MNIST
4.5 Evaluating LeNet on CIFAR100
4.5.1 Coarse Classification (20 classes)
Use coarse labels in this case.
Figure 31 CIFAR100 color images demonstration (coarse label)
Figure 32 Mxnet-LeNet Architecture (32,0.9,nag(default rho), max_pooling, ReLU, batch_normalization)
4.5.2 Fine Classification (100 classes)
Figure 33 CIFAR100 (100 classes with fine labels)
[5 Prefetching processes]
Obvious overfitting problem under the following set of hyperparameters.
Figure 34 MxNet-LeNet on CIFAR100(100 classes), (64, constant lr=0.9, max_pooling,nag(default rho), relu, batch_normalization, nearly 10 times parameter size)
4.5.3 Running Snapshot
13 workers in parallel vs 1 worker run for batch prefetching (in network training and also testing evaluation), no obvious benefits in running time but accounting for larger memory.
5. AlexNet Architecture
5.1 Code Implementation
Mxnet Implementation
Pytorch Implementation(CPU/GPU supportive)
Based on three basic tool functions.
- AlexNet_torch();àinterface API
- train_ch();àmodel training API + model structure and parameters saving
- train_epoch_ch();àactual training execution per epoch
- evaluate_accuracy();àmodel evaluation at the end of every training epoch
- (Choosable) softmaxcrossentropy_loss();àan alternative for torch.nn.CrossEntropyLoss()
(1)
(2)
(3)
NCHW form. C=3 for color images input, such as CIFAR10 and CIFAR100 datasets.
Embedded SGD optimizer does not need to calculate mean value first, just directly backward propagate. When using a hand-code version, it depends on form.
(4)evaluate_accuracy()
(5)softmaxcrossentropy_loss()
5.2 Fashion_MNIST Dataset (Mxnet vs Pytorch)
Figure 35 Training AlexNet on Fashion_MNIST on Google Colaboratory on PREFETCHING PROCESS with Pytorch DL frame, (64,2,1x224x224,1/256)
Figure 36 PREFETCHING PROCESS+local device+Pytorch, (32,4,1x224x224,1/256)
Figure 37 Mxnet does not manage storage well, thus takes much more time to train than Pytorch
Training based on Mxnet DL frame is far more slower and less efficient if taking time of memory interchange into account, for example, 4 Prefetching processes takes 7-8 minutes to do go through one epoch wit batch size 64. However pure training time one epoch spends is only 30 plus seconds (measured by time.time()).
5.3 MNIST Dataset (Pytorch only)
Notice: Loss function(Graphed) is not divided by batch size in this part.
Figure 38 Colab+Pytorch+PREFETCHING PROCESS, (64,2,1x224x224,1/256)
5.4 CIFAR100(100 classes, fine labels)-Pytorch Only
5.4.1 Learning rate setting
Notice: Loss function(Graphed) is not divided by batch size in this part.
Case1: Initial lr=0.01, constant, optimization: SGD
Figure 39 Training AlexNet on CIFAR100 on Colaboratory on T4 GPU, (32,1,3x32x32,0.251)
Case2: Initial lr=0.03, cosine type decay, optimization: SGD
Final accuracy of Case2 on training dataset is about 5% higher than that of Case1, while accuracy on testing dataset is about 4% than that of Case1. To make a simile, you offer this shrunk version of AlexNet (a student) 3 questions, he can only answer one correctly.
But don’t be too disappointed about CNN or deep learning, since training greatly depends on hardware and this is just a glimpse of CV skyscraper.
Interesting thing is that testing performance outperforms training performance for AlexNet, no matter what kind of dataset has been input so far.
6. CNN activation layer characteristics visualization
Taking the 10-epochs training process of LeNet on MNIST and Fashion_MNIST as examples. Only visualize the first channel of feature map from one image.
6.1 MNIST Dataset
Figure 40 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch1
epoch 1, loss 0.1266, train acc 0.964, test acc 0.985 duration 6.3 s
Figure 41 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch2
epoch 2, loss 0.0538, train acc 0.984, test acc 0.986 duration 6.1 s
Figure 42 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch3
epoch 3, loss 0.0409, train acc 0.987, test acc 0.987 duration 6.6 s
Figure 43 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch4
epoch 4, loss 0.0330, train acc 0.990, test acc 0.985 duration 6.6 s
Figure 44 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch5
epoch 5, loss 0.0289, train acc 0.991, test acc 0.988 duration 7.5 s
Figure 45 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch6
epoch 6, loss 0.0253, train acc 0.992, test acc 0.988 duration 7.7 s
Figure 46 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch7
epoch 7, loss 0.0219, train acc 0.993, test acc 0.989 duration 6.4 s
Figure 47 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch8
epoch 8, loss 0.0202, train acc 0.993, test acc 0.990 duration 7.7 s
Figure 48 LeNet: 28x28(image)->24x24(relu0)->8x8(relu1) Epoch9
epoch 9, loss 0.0176, train acc 0.994, test acc 0.988 duration 6.5 s
6.2 Fashion_MNIST Dataset
Figure 49 LeNet, fashion_mnist, Epoch 1, sofa
Figure 50 LeNet, fashion_mnist, Epoch 1, sneaker
epoch 1, loss 0.5140, train acc 0.813, test acc 0.841 duration 3.1 s
Figure 51 LeNet, fashion_mnist, Epoch 10, shirt
epoch 10, loss 0.2270, train acc 0.915, test acc 0.905 duration 5.3 s
7. Confusion Matrix
Dimension in height stands for true class indexes, while dimension in width stands for prediction class indexes.
7.1 MNIST
Figure 52 Epoch 1 of LeNet on MNIST
7.2 Fashion_MNIST
Figure 53 Epoch 2/4/6/8/10 of LeNet on Fashion_MNIST
testing accuracy: (85.4%,88.0%,89.5%,89.4%,90.5%)
From Figure 53, we infer that LeNet judges some digits 0 to be 6, and some digits 4 to be 2, which can help us gain insight of data characteristics and take relevant actions such as specially training LeNet on intended dataset of 0s and 6s or 2s and 4s.
This is one application of confusion matrix.
References
[1] http://vision.stanford.edu/teaching/cs231n/schedule.html
[2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[3] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep con volutional neural networks. In Advances in neural information processing systems (pp. 1097 1105).
[4] https://zhuanlan.zhihu.com/p/33450843