Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Hsing-Hung Chou; Ching-Te Chiu; Yi-Ping Liao

doi:10.1017/ATSIP.2021.16

Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Published online by Cambridge University Press: 17 November 2021

Hsing-Hung Chou ,

Ching-Te Chiu and

Yi-Ping Liao

Show author details

Hsing-Hung Chou*: Affiliation:
Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan
Ching-Te Chiu: Affiliation:
Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan Institute of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Yi-Ping Liao: Affiliation:
Institute of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
*: Corresponding author: Hsing-Hung Chou Email: [email protected]

Article contents

Abstract
INTRODUCTION
RELATED WORK
PROPOSED ARCHITECTURE
EXPERIMENTAL RESULTS
DISCUSSION
CONCLUSION
References

Abstract

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.

Keywords

Deep convolutional model compression Knowledge distillation Transfer learning

Type: Original Paper
Information: APSIPA Transactions on Signal and Information Processing , Volume 10 , 2021 , e18

DOI: https://doi.org/10.1017/ATSIP.2021.16 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright: Copyright © The Author(s), 2021. Published by Cambridge University Press

I. INTRODUCTION

Recently, the area of deep learning is booming owing to the availability of high computation GPGPU and ability to process massive data. Many state-of-the-art performances have been achieved with deep learning for different tasks, including image classification[Reference Deng, Dong, Socher, Li, Li and Fei-Fei1], object detection [Reference Everingham, Van Gool, Williams, Winn and Zisserman2], and semantic segmentation [Reference Cordts3], and new tasks such as iterative reconstruction [Reference Choy, Xu, Gwak, Chen and Savarese4] and depth estimation [Reference Silberman, Hoiem, Kohli and Fergus5]. However, when neural networks become deeper and wider, the complexity of deep neural network (DNN) models grows rapidly. Consequently, DNN models cannot practically work with a large number of parameters and high levels of computation, notably for Internet of Things (IoT) devices and self-driving car (Autonomous car).

There are five approaches to achieve a compact yet accurate model–frugal architecture, pruning, matrix decomposition, quantization, and specialist knowledge distillation (KD). KD is researched to understand how to train a student deep neural network (S-DNN) by learning from teacher deep neural network (T-DNN). In general, S-DNN does not have the same layers as T-DNN; thus, it is difficult to train to find the optimization point. There are two branches of KD approaches. One is the conventional approach, referred to as offline methods [Reference Yim, Joo, Bae and Kim6–Reference Park, Kim, Lu and Cho11], which train the T-DNN first, then use the S-DNN to mimic the pre-trained T-DNN. Although this two-phase approach incurs considerable computation time, we get a better performing S-DNN. Sometimes, this S-DNN is better performing than the pre-trained T-DNN, because the T-DNN is already pre-trained and has more layers than the S-DNN, while the S-DNN has a better initial weight. The other approach is referred to as online methods [Reference Zhang, Xiang, Hospedales and Lu12–Reference Lan, Zhu and Gong14], which start both as scratch models that train together. This one-phase approach expects to train a better model than when only training the S-DNN model. Compared with the offline methods, these online methods do not need to train a T-DNN first, so there is less training time. However, in this research, we are focused on how to get a better performing S-DNN model with the same compression rate. Because most of the conventional methods showed better results in experiments, we decided to pick the first method in this paper.

After considering the training method, what is taught from the pre-trained model as knowledge to the S-DNN model is very sensitive to its performance. To address this, FSP [Reference Yim, Joo, Bae and Kim6] proposed to use the correlation between input and output feature maps of the layer module, in the form of a Gramian matrix.

In this work, our contributions are as follows:

1) We propose a cross-layer matrix to extract more knowledge and add Kullback Leibler (KL) Divergence and offline ensemble to improve image classification with the same compression.
2) We propose $1\times 1$ convolutional layers to tune the channels of T-DNN to be identical to those of S-DNN to solve the constraints of our proposed method.
3) We propose using two-step KD to improve image classification when there is a huge difference in layers between T-DNN and S-DNN, to avoid the loss of KD.
4) Our method can be used not only for image classification tasks [Reference Krizhevsky, Sutskever and Hinton15, Reference Simonyan and Zisserman16] but also for other tasks, such as object detection [Reference Ren, He, Girshick and Sun17, Reference Redmon and Farhadi18], semantic segmentation [Reference Liu, Chen, Liu, Qin, Luo and Wang19], and action estimation [Reference Ji, Xu, Yang and Yu20] in videos.

II. RELATED WORK

The compression methods can be divided into four categories, knowledge distillation, pruning, low-rank decomposition, and quantization.

A) Knowledge distillation

Hinton et al. [Reference Hinton, Vinyals and Dean7] distilled knowledge from a very large teacher model to promote a small student model by using a softened softmax of a teacher network. The rationale is to take advantage of extra supervision provided by the teacher model during the training of the student model, beyond a conventional supervised learning objective such as the cross-entropy loss subject to the training data labels. Romero et al. [Reference Romero, Ballas, Kahou, Chassang, Gatta and Bengio8] proposed “Hint training” to train partial layers, then used the final output layers to train the student model to enhance the performance. Park et al. [Reference Park, Kim, Lu and Cho11] proposed to transfer mutual relations of data examples. Yim et al. [Reference Yim, Joo, Bae and Kim6] transferred knowledge from the T-DNN to the S-DNN as output feature maps rather than as layer parameters. They become a certain layer group in the network and define the correlation between the input and output feature maps of the layer group as a Gram matrix so that the feature correlations of the S-DNN and T-DNN become similar. We expand the Gram matrix by adding more than cross-one layer. Furthermore, approach taken by Lee et al. [Reference Lee, Kim and Song10] is based on the correlation between two feature maps as knowledge by using Singular Value Decomposition (SVD). However, this approach [Reference Lee, Kim and Song10] will cost a significant computation time because feature maps that are needed to be decomposed by CPU. As a result, we take FSP [10] model as a baseline to enhance as our proposed methods.

Moreover, while earlier distillation methods often take an offline learning strategy that need two phases of the training procedure, the more recently proposed deep learning [Reference Zhang, Xiang, Hospedales and Lu12] method overcomes this limitation by conducting an online distillation in one-phase training between two peer student models. We will add KLDivergence in our proposed method, which makes it different than online methods. Anil et al. [Reference Anil, Pereyra, Passos, Ormandi, Dahl and Hinton13] extend [Reference Zhang, Xiang, Hospedales and Lu12] to decrease the training time of large-scale distributed neural networks. Lan et al. [Reference Lan, Zhu and Gong14] present an On-the-fly Native Ensemble (ONE) learning strategy for one-stage online distillation. However, existing online methods lack a strong “teacher” model, which limits the efficacy of knowledge discovery.

In [Reference Wang and Yoon21], Wang and Yoon provide a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically used for vision tasks and systematically analyze the research status of KD in vision applications. KDGAN consisting of a classifier, a teacher, and a discriminator is proposed in [Reference Wang, Zhang, Sun and Qi22]. The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses. From the concrete distribution, continuous samples are generated to obtain low-variance gradient updates, which speed up the training. To efficiently transmit extracted useful teacher information to the student DNN, Bae et al. propose to perform bottom-up step-by-step transfer of densely distilled knowledge [Reference Bae, Yeo, Yim, Kim, Pyo and Kim23].

B) Deep neural network compression and efficient processing

Scientists found that network pruning can be used not only to reduce network complexity but also to prevent over-fitting. An old method [Reference Hanson and Pratt24] to pruning was the Biased Weight Decay. Han et al. [Reference Han, Pool, Tran and Dally25] first proposed that it's peaceful to remove neurons with zero input or output connections from the neural network. By using L1/L2 regularization, some of the weights converged to zeros after training. As a result, by using the combination of pruning, quantization, and Huffman coding [Reference Han, Mao and Dally26], the compression of AlexNet can reach 35$\times$. CLIP-Q [Reference Tung and Mori27] flexibly makes weight pruning choices that can adapt to compress the DNN during the training time. Apart from weight pruning, there is another pruning approach, named channel pruning, that assesses neuron importance. Li et al. [Reference Li, Kadav, Durdanovic, Samet and Graf28] computed the importance of each filter by calculating its absolute weight sum. Furthermore, Hsiao et al.[Reference Hsiao, Chang, Chou and Chiu29] measured the significance of each filter by calculating the largest singular value.

In [Reference Deng, Li, Han, Shi and Xie30], Deng et al. provide a comprehensive survey on reviewing the mainstream compression approaches such as compacted model, tensor decomposition, data quantization, and network sparsification to compress DNN without compromising accuracy. In [Reference Sze, Chen, Yang and Emer31], Sze et al. present a tutorial and survey on understanding the key design for DNN and evaluating different DNN hardware implementations with benchmarks in order to achieve processing efficiency.

C) Low-rank decomposition

Low-rank approximation [Reference Denil, Shakibi, Dinh and De Freitas32–Reference Zhang, Zou, He and Sun35] approaches have been widely studied. However, low-rank approximation is inconvenient because each decomposition of feature maps is computationally expensive. Moreover, the methods of low-rank approximation only consider a few layers; therefore, they cannot consider the compression of the whole network.

D) Quantization

Quantization is a method for reducing the number of bits for weight and bias of each layer. We can divide methods either by using auxiliary data [Reference Basu and Varshney36, Reference Gupta, Agrawal, Gopalakrishnan and Narayanan37], or not using auxiliary data [Reference Dettmers38–Reference Ji, Ovsiannikov, Wang, Shi and Zhang40]. Additionally, there are two research approaches in compressing on bit-level. One is to use fixed-point implementation, and the other is to use common quantization methods, for example, K-means and scalar quantization. For fixed-point implementation, Hwang and Sung [Reference Hwang and Sung39] proposed a design with ternary weights, 3-bit signals, and an optimization process which was done by back-propagation-based re-training. For the other approach, Ji et al. [Reference Ji, Ovsiannikov, Wang, Shi and Zhang40] designed a supervised iteration quantization to reduce the bit resolution of the weights. They applied K-means-based adaptive quantization methods, such as vector quantization using K-means, product quantization, residual quantization, and discussed the efficacy of their design on compressing deep convolutional networks.

E) Deep learning

Special issues on deep learning framework architectures, hardware acceleration, DNN over the cloud, fog, edge, and end devices are elaborated [Reference Kang41]. In addition, methods and applications especially emphasizing on exploring recent advances in perceptual applications are addressed and discussed [Reference Kang41]. In [Reference Wang, Peng and Ko42], Wang et al. present to learn a proper prior from data for adversarial autoencoders. The notion of code generators is presented to transform manually selected simple priors into ones that can better characterize the data distribution.

III. PROPOSED ARCHITECTURE

The core idea of KD is how to define the vital information, then transfer the knowledge from the T-DNN to the S-DNN. As a result, we will divide our approach into four parts. Section A shows what the knowledge in T-DNN will transfer and its mathematical expression and the definition loss term $L_{KD}$. Moreover, we will add another loss function $L_{KL}$ between the prediction of T-DNN and S-DNN in Section B. Furthermore, we will use offline ensemble pre-trained T-DNNs to teach one student in Section C. Subsequently, the overall loss function will be discussed in Section Reference CordtsD. Finally, we will discuss the constraints of our proposed method and solutions. The overall architecture for our three proposed compression methods are shown in Fig. 1.

Fig. 1. Overall architecture of our proposed methods. There are three parts of our architecture. First, we propose cross-layer matrix to exact more features by FSP [Reference Yim, Joo, Bae and Kim6] adopting the proposed Gramian matrix in the orange part. Second, we adopt the KL Divergence in the offline environment to make S-DNN find a wider robust minimum in the brown part. Finally, we propose the use of offline ensemble pre-trained T-DNN to teach a S-DNN by using stochastic mean in the red part.

A) Cross-layer matrix

1) Proposed distilled knowledge

Yim et al. [Reference Yim, Joo, Bae and Kim6] proposed “FSP” by using Gramian matrix to mimic the generated features of the T-DNN, which can be a hard constraint for the S-DNN. Based on [Reference Yim, Joo, Bae and Kim6], we generate more Gramian matrices by crossing more than one layer. The numbers of cross matrices we add as loss function depends on how many layer modules are in DNN model. We believe that with more Gramian matrices in loss function, it would make the S-DNN get better performance.

The reason why we use the Gramian matrix created by feature maps is that we believe that instead of teaching the right answer to the student, it is better to teach the solution procedure to the student. Imagine there is a classroom, a teacher is teaching a student with a math question. It is better to teach how to use a formula first and the solution procedure than to provide the correct answer directly.

2) Mathematical expression of the knowledge distillation

Based on FSP [Reference Yim, Joo, Bae and Kim6], the Gramian matrix can be defined by two output feature maps. We propose the Gramian matrix as the knowledge to transfer. The Gramian matrix G $\mathbb {R}^{m \times n}$ is generated by the features from two layers. One output feature map is defined as $F^{1} \in \mathbb {R}^{h \times w \times m }$, where $h$, $w$, represent the height and width of output feature maps and $m$ represents the number of output channels. The other output feature map is defined as $F^{2} \in \mathbb {R}^{h \times w \times n }$. Then, the Gramian matrix G $\mathbb {R}^{m \times n}$ is calculated by (1)

(1)\begin{equation} G_{i,j}(x; W)=\sum_{s=1}^{h} \sum_{t=1}^{w} \frac{ F_{s,t,i}^{1}(x;W) \times F_{s,t,j}^{2}(x;W)}{h \times w}, \end{equation}

where $i,\,j$ represent the points of cross-one-layer results, $x$ represents the input image and $W$ are the weights of the network model. Unlike FSP [Reference Yim, Joo, Bae and Kim6], we select several points not only from cross-one module layer but also from cross-more-than-one module layer to generate more Gramian matrices as shown in (2) and (3).

(2)\begin{align} G_{i,q}(x; W)& =\sum_{s=1}^{h} \sum_{t=1}^{w} \frac{ F_{s,t,i}^{1}(x;W) \times F_{s,t,q}^{2}(x;W)}{h \times w}, \end{align}

(3)\begin{align} G_{i,r}(x; W)& =\sum_{s=1}^{h} \sum_{t=1}^{w} \frac{ F_{s,t,i}^{1}(x;W) \times F_{s,t,r}^{2}(x;W)}{h \times w}, \end{align}

where $i,\,q$ represent the points of cross-two-layer results as shown in Fig. 2(b) and $i,\,r$ represent the points of cross-three-layer results as shown in Fig. 2(c). In Fig. 2, we see there are three different kinds of cross-layer matrix. Our proposed method is to combine all the Gramian matrix 2(d) as knowledge.

Fig. 2. (a) Cross one layer. (b) Cross two layers. (c) Cross three layers. (d) Our proposed.

3) KD loss for the Gramian matrix

As discussed previously, the T-DNN will teach S-DNN the solution of question by using the Gramian matrix. We assume that there are $B$ Gramian matrices $G_{b}^{T}$, $u=1,\,\ldots,\,B$, which are generated by the T-DNN, and $B$ Gramian matrices $G_{b}^{S}$,$i=1,\,\ldots,\,B$, which are generated by the S-DNN. Next, each pair of Gramian matrices will be calculated as the cost function by using the squared L2 norm. The cost function of knowledge distillation $L_{KD}(W_{t}; W_{s})$ is defined as (4):

(4)\begin{align} L_{KD}(W_{t}; W_{s}) & = \frac{1}{B} \sum_{x} \sum_{b=1}^{B} \lambda_{i}\notag\\ & \quad \times ||G_{b}^{T}(x;W_{t})-G_{b}^{S}(x;W_{s})||_{2}^{2}, \end{align}

where $\lambda _{i}$ represents the weight for each KD loss and $B$ represents the numbers of Gramian matrices. Because our proposed method adds more Gramian matrices by creating the cross matrices, we initially set all KD losses with the same weight. As a result, the values of $\lambda _{i}$ are identical in our experiments.

B) KL Divergence

We propose using KL Divergence, which was used in DML [Reference Zhang, Xiang, Hospedales and Lu12], as our second-order loss function. In contrast to the online method [Reference Zhang, Xiang, Hospedales and Lu12] with two-direction learning, our offline method is only used in one direction from T-DNN to S-DNN. Given $D$ as the data examples $X={\{x_{n}\}_{n=1}^{D}}$ from $C$ classes, we represent the corresponding label set as $Y = \{ y_{i}\}_{n=1}^{C}$ with $y_{i}\{1,\,2,\,\ldots,\,C\}$. The probability of class $c$ for data example $x_{n}$ is given by a neural network $\theta _{1}$ and computed as

(5)\begin{equation} p_{1}^{C}(x_{n})=\frac{\exp(z_{1}^{C})}{\sum_{c=1}^{C} \exp(z_{1}^{C})},\end{equation}

where $p_{1}^{C}(x_{n})$ represents the probability distribution of $\theta _{1}$ and the logit $z_{1}^{C}$ is the output of the “softmax” layer in $\theta _{1}$. As a result, the formulation of KL Divergence can be computed as

(6)\begin{equation} L_{KL}({p_{T}||p_{S}}) = \sum_{d=1}^{D} \sum_{c=1}^{C} p_{T}^{c}(x_{d})log \frac{p_{T}^{c}(x_{d})}{p_{S}^{c}(x_{d})},\end{equation}

where $L_{KL}({p_{T}||p_{S}})$ represent the probability distribution of teacher and student model. We believe that the student model can get full of knowledge from teacher model by having distribution similar to teacher's distribution.

C) Offline ensemble

The original method of FSP [Reference Yim, Joo, Bae and Kim6] is discussed with one T-DNN to transfer one S-DNN. Compared with FSP [Reference Yim, Joo, Bae and Kim6], we propose using offline ensemble pre-trained teachers to generate the stochastic mean and improve the image classification result. The cost functions of knowledge distillation and KL Divergence are defined as

(7)\begin{align} & L_{KD}^{Ensemble}= \frac{1}{K}\sum_{k=1}^{K} L_{KD,k}, \end{align}

(8)\begin{align} & L_{KL}^{Ensemble}= \frac{1}{K}\sum_{k=1}^{K} L_{KL}({p_{k}||p_{S}}), \end{align}

where $L_{KD}^{Ensemble}$ represents the loss function of offline ensemble knowledge distillation, $L_{KL}^{Ensemble}$ represents the loss function of offline ensemble KL Divergence, $K$ represents the numbers of pre-trained teacher models (K=3). We believe that the offline ensemble pre-trained teacher models with the same architecture, but the different weights will transfer knowledge to student model by using the stochastic mean.

D) Overall loss function

We had already proposed $L_{KD}$, $L_{KL}$, and stochastic mean for our method. Hence, the overall loss function $L_{total}(\theta _{1})$ for training S-DNN is shown as (9)

(9)\begin{align} L_{total}(\theta_{1})=L_{CE}(\theta_{1}) + \frac{1}{K} \sum_{k=1}^{K}L_{KL}(p_{k}||p_{s}) + \frac{1}{K} \sum_{k=1}^{K}L_{KD,k},\end{align}

with the objective function of multi-class image classification $L_{CE}(\theta _{1})$ to train the network $\theta _{1}$ is defined as the cross entropy error between the predicted values and the correct labels:

(10)\begin{equation} L_{CE}(\theta_{1}) ={-}\sum_{d=1}^{D} \sum_{c=1}^{C} I(y_{d},c) log(p_{1}^{c}(x_{d})),\end{equation}

with an indicator function $I$ defined as

(11)\begin{equation} I(y_{i},m)=\left\{ \begin{array}{@{}ll} 1, & y_{n}=c \\ 0, & y_{n}\neq c, \end{array} \right. \end{equation}

To prevent $L_{KD,k}$ larger than $L_{CE}(\theta _{1})$ from inducing gradient explosion, we will adopt gradient clipping [Reference Pascanu, Mikolov and Bengio43] to limit the gradient of knowledge distillation$\nabla (\theta _{1})_{KD}^{clipped}$ during training procedure as shown in Equation (12):

(12)\begin{align} & \nabla(\theta_{1})_{KD}^{clipped}=\left\{ \begin{array}{@{}ll} \beta \times \nabla(\theta_{1})_{KD}, & \nabla(\theta_{1})_{KD} < \nabla(\theta_{1})_{CE} \\ \nabla(\theta_{1})_{KD}, & otherwise, \end{array} \right. \end{align}

(13)\begin{align} & \beta = \frac{1}{1+exp(-\tau +p)}, \end{align}

(14)\begin{align} & \tau = \frac{||\nabla(\theta_{1})_{CE}||_{2}}{||\nabla(\theta_{1})_{KD}||_{2}}, \end{align}

where $\beta$ is a sigmoid function. In Equation (13), $p$ means the current epoch of training. Furthermore, the $L_{2}$-norm ratios are the $L_{CE}$ and $L_{KD,k}$ in Equation (14). Hence, the rich knowledge distilled from T-DNN can be transferred knowledge S-DNN without worrying about gradient explosion.

IV. EXPERIMENTAL RESULTS

In this section, we will evaluate our proposed compression method with two datasets and three different models. The two datasets are the familiar CIFAR-100 [Reference Krizhevsky and Hinton44] and the rich collection of images, ImageNet64*64 [Reference Chrabaszcz, Loshchilov and Hutter45], as shown in Fig. 3. Additionally, there are two models, VGG and ResNet, training and testing on CIFAR-100 and one model named MobileNet, training and testing on ImageNet64*64.

Fig. 3. (a) CIFAR-100. (b)ImageNet64*64.

A) Environment and datasets

Our proposed method is implemented in TensorFlow [Reference Abadi46] with Python 3.5 interference on the computers (CPU: Intel$^\circledR$ Core$^{TM}$ i7-7800X $@$ 3.5 GHZ, main memory: 32 GB DRAM, GPU: NVIDIA GEFORCE $^\circledR$ GTX 1080).

The CIFAR-100 dataset consists of 60000 images with a size of $32\times 32$, divided as 50000 training data and 10000 test data, and 100 classes. We used random shift, random rotation and horizontal flip as data augmentations. Our proposed method was tested under the same conditions as FSP [Reference Yim, Joo, Bae and Kim6], and for increasing the dependability of the testing results, we ran the experiments three times and took the average as the final experimental results. We take VGG and ResNet as the DNN to prove that our proposed method works. The T-DNN and S-DNN models are shown in Fig. 4. We picked VGG as our first model because its architecture is very simple and can be implemented quickly. As in SSKD_SVD [Reference Lee, Kim and Song10], we defined VGG-11 as T-DNN and VGG-6 as S-DNN. As in FSP [Reference Yim, Joo, Bae and Kim6], we defined ResNet-32 as T-DNN and partially reduced the residual modules to create ResNet-8 as S-DNN.

Fig. 4. T-DNN and S-DNN of the VGG and ResNet models. T-DNN: VGG-11 and ResNet-32. S-DNN: VGG-6 and ResNet-8.

The ImageNet64*64 dataset consists of about 1.2 million images with a size of 64$\times$64, divided with about 1.2 million training data and 50 000 test data, and 1000 classes. We used the same data augmentations as same with CIFAR-100 and the experiments were run three times and took the average as the final result. On ImageNet64*64, we defined MobileNet-16 as T-DNN and MobileNet-9 as S-DNN as shown in Fig. 5.

Fig. 5. T-DNN and S-DNN of the MobileNet models. T-DNN: MobileNet-16. S-DNN: MobileNet-9.

On CIFAR-100, the training procedure for networks was considered by FSP [Reference Yim, Joo, Bae and Kim6] and SSKD_SVD [Reference Lee, Kim and Song10]. We set the batch size to 128 and the training epochs to 200 during training, optimized the procedure by stochastic gradient descent [Reference Kiefer and Wolfowitz47], and adopted Nesterov accelerated gradient [Reference Nesterov48]. The initial learning rate was set to $10^{-2}$ and the momentum was set to 0.9. The decay parameter was set to $10^{-4}$. The learning rate was reduced to 0.1 per 50 epochs. Additionally, we set the batch size to 64 during training, training epochs to 40, and the learning rate was reduced to 0.1 per 10 epochs for ImageNet64*64.

B) Results

In this section, we show the final results of our proposed method with the computation rate, computation, Top-1 accuracy, and inference time. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results show that the student's model increases 0.57% Top-1 accuracy while decreasing 53.9% of parameters and 72.8% computation compared to T-DNN and reducing inference time from 61.6 to 49.8 ms. Furthermore, with ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results indicate that the student's model decreases 0.55% Top-1 accuracy by 0.55% while decreasing 83.65% of parameters and 82.6% computation compared to T-DNN and reducing inference time from 115.2 to 51.6 ms. The experimental results of VGG and ResNet are shown in Table 1 and 2, respectively. In most of the offline methods, the training procedure will lose some of their accuracies. However, it is surprising that our proposed method the S-DNN can train even better than T-DNN, as shown in Table 1.

Table 1. Classification results after knowledge distillation (VGG-11->6) on CIFAR-100 dataset.

Table 2. Classification results after knowledge distillation (ResNet-32->8) on CIFAR-100 dataset.

With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results show that the student's model decreases 3.92% Top-1 accuracy while decreasing 37.4% of parameters and 51.3% computation compare to T-DNN and reducing inference time from 98.7 ms to 52.9 ms. The results of MobileNet are shown in Table 3.

Table 3. Classification results after knowledge distillation (MobileNet-16->9) on ImageNet64*64.

C) Ablation

1) Cross-layer matrix

The different options of the cross matrix are shown in Fig. 6. Figure 6 (a) represents the “original” method FSP [Reference Yim, Joo, Bae and Kim6]. Figures 6(b) and 6(c) are our proposed methods in VGG and ResNet models. The combination of cross-one layer and cross-two layers is named as “P1(Cross two layers)”. Furthermore, the combination of cross-one layer, cross-two layers, and cross-three layers is named as “P1(Cross three layers)”.

Fig. 6. (a) Cross one layer. (b) Cross two layers. (c) Cross three layers.

The simulation results of VGG models are shown in Table 4. The result of S-DNN is set as the baseline. Compared with the baseline, the method of “P1(Cross two layers)” and “P1 (Cross three layers)” achieves a low performance in the testing result. Additionally, “P1 (Cross three layers)” have an increase in performance of 0.4% compared with FSP [Reference Yim, Joo, Bae and Kim6]. Subsequently, let us see the deeper architecture ResNet as shown in Table 5. Moreover, the methods of “P1(Cross two layers)” and “P1 (Cross three layers)” achieve a low performance in the testing results. The “P1 (Cross three layers)” have an increase in performance of 0.13% compared with FSP [Reference Yim, Joo, Bae and Kim6].

Table 4. Different proposed method of cross matrix (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 5. Different proposed method of cross matrix (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Additionally, the different choices of the cross-layer matrix are shown in Fig. 7. Figure 7 (a) represents the ‘original’ method FSP [Reference Yim, Joo, Bae and Kim6]. Figures 7(b)–7(d) are our proposed methods with MobileNet model. The combination of cross-one layer and cross-two layers are named as “P1(Cross two layers)”. Additionally, the combination of cross-one layer, cross-two layers, and cross-three layers is termed as “P1(Cross three layers)”. Finally, the combination of cross-one layer, cross-two layers, cross-three layers, and cross-four layers is named as “P1(Cross four layers)” .

Fig. 7. (a) Cross-one layer. (b) Cross-two layers.(c) Cross-three layers. (d)Cross-four layers.

The results of MobileNet model are shown in Table 6. Compared with the baseline, the methods of ’‘P1(Cross two layers)” and ‘’P1 (Cross three layers)” achieve a low performance in the testing result. Furthermore, “P1 (Cross three layers)” has increased performance of 3.38% compared with FSP [Reference Yim, Joo, Bae and Kim6].

Table 6. Different proposed method of cross matrix (MobileNet-16>9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

2) Influence of adding KL Divergence

The illustration of adding KL Divergence as our second-order loss function is shown in Fig. 8. We believe that the student's model can get full of knowledge from the teacher's model by being similar to the teacher's distribution. The combination of FSP [Reference Yim, Joo, Bae and Kim6] and adding KL Divergence is named as “P2 (KL Divergence)”. First, let us see the result of VGG models as shown in Table 7. Additionally, the result of S-DNN is set as the baseline. We have two competitors. The first is the proposed Hinton [Reference Hinton, Vinyals and Dean7] approach and our basis FSP [Reference Yim, Joo, Bae and Kim6]. It is demonstrated that adding KL Divergence yields a 1.54% increase over the competitor FSP [Reference Yim, Joo, Bae and Kim6]. Furthermore, the experimental results of ResNet are shown in Table 8. It indicates that adding KL Divergence yields a 1.54% increase over the competitor FSP [Reference Yim, Joo, Bae and Kim6].

Fig. 8. Illustration of using KL Divergence.

Table 7. Differential of adding KL Divergence (VGG-11>9) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 8. Differential of adding KL Divergence (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

On ImageNet64*64, the experimental results of MobileNet models are shown in Table 9. We have two competitors, the first is the proposed Hinton [Reference Hinton, Vinyals and Dean7] approach and our basis FSP [Reference Yim, Joo, Bae and Kim6]. It is shown that adding KL Divergence that obtains a 2.59% increase over the competitor FSP [Reference Yim, Joo, Bae and Kim6].

Table 9. Differential of adding KL Divergence (MobileNet-16>9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

3) Offline ensemble

In this section, we will discuss the influence of the number of multiple pre-trained teachers as shown in Fig. 9. Figure 9(a) uses one pre-trained teacher as FSP [Reference Yim, Joo, Bae and Kim6]. Figures 9(b) and 9(c) represent two pre-trained teachers and three pre-trained teachers named as “P3 (two teachers)” and “P3(three teachers)”, respectively. First, let us see the result of VGG model in Table 10. Additionally, the result of S-DNN is set as the baseline. From Table 10, we find that using more pre-trained teachers with stochastic mean can increase the Top-1 accuracy. It is shown that offline ensemble yields a 1.66% increase over the competitor FSP [Reference Yim, Joo, Bae and Kim6]. In addition, let us see the deeper model ResNet. From Table 11, we can also find that using more teachers can increase the Top-1 accuracy. It is demonstrated that offline ensemble obtains a 1.31% increase over the competitor FSP [Reference Yim, Joo, Bae and Kim6]. Finally, let us see the result of MobileNet models as shown in Table 12. It is demonstrated that adding KL Divergence yields a 2.59% increase over the competitor FSP [Reference Yim, Joo, Bae and Kim6].

Fig. 9. (a) One pre-trained teacher. (b) Two pre-trained teachers. (c) Three pre-trained teachers.

Table 10. Different numbers of teachers (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 11. Different numbers of teachers (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Table 12. Different numbers of teachers (MobileNet-16->9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

4) Combination of proposed methods

By analyzing prior work, we want to try a combination of proposed methods. The combination of crossing one layer, adding KL Divergence and offline ensemble is shown in Tables 13 and 14.

Table 13. Combination of proposed methods (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Table 14. Different proposed method of cross matrix (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

First, let us see the result of VGG with the combination of crossing three layers, adding KL Divergence and offline ensemble. From Table 13, we can see that by adding the proposed method increases the Top-1 accuracy. It is shown that the combination of proposed method gets an increase of 2.27% than FSP [Reference Yim, Joo, Bae and Kim6]. Second, let us see the deeper ResNet model with the combination of proposed methods. From Table 14, we can see that by adding the proposed method to increase the Top-1 accuracy. It is shown that the combination of proposed method gets an increase of 2.98% than FSP [Reference Yim, Joo, Bae and Kim6]. Finally, from Table 15 we can see that by adding proposed method to increase the Top-1 accuracy, the combination of proposed method get 4.04% increase than FSP [Reference Yim, Joo, Bae and Kim6].

Table 15. Combination of proposed methods (MobileNet) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

D) Comparison with other work

With the same compression on S-DNN, it can be seen that our proposed method got the state-of-the-art results on VGG and ResNet models compared with the competitors [Reference Yim, Joo, Bae and Kim6, Reference Hinton, Vinyals and Dean7, Reference Lee, Kim and Song10, Reference Park, Kim, Lu and Cho11]. As we can see in Table 16, the result of our proposed method achieves a 66.67% Top-1 accuracy with a 2.08x compression rate and 3.5x computation rate. Additionally, the result of our proposed method achieves a 68.45% Top-1 accuracy with a 6.11x compression rate and 5.27x computation rate as shown in Table 17. Furthermore, we can see in Table 18 that the result of our proposed method achieves a 49.86% Top-1 accuracy with a 1.59x compression rate and 2.05x computation rate.

Table 16. Computation, parameters, and average Top-1 accuracy comparison with VGG-11 and VGG-6 on CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 17. Computation, parameters, and average Top-1 accuracy comparison with ResNet-32 and ResNet-8 on CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Table 18. Computation, parameters, and average Top-1 accuracy comparison with ResNet-32 and ResNet-8 on CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

V. DISCUSSION

In this section, we show what our proposed methods meet limitations. After that, we will compare solutions with the previous proposed methods and discuss why some of our proposed new ideas make our S-DNN get better performance with the same compression.

A) Constraints of our proposed method

FSP [Reference Yim, Joo, Bae and Kim6] proposed a certain layer group in the network and defined the correlation between input and output feature maps of the layer group as Gramian matrix. The limitation is that the T-DNN and S-DNN should have the same channels, as shown in Fig. 10. If the Gramian matrices of T-DNN and S-DNN do not have the same dimension, $m_{T}=m_{s}$ and $n_{T}=n_{S}$, then they cannot be calculated as loss function. Furthermore, we find that huge difference layers between T-DNN and S-DNN may cause the difficulty in transferring knowledge, as shown in Fig. 11. Hence, we define “the T-DNN and S-DNN should have the same channels” and “the huge difference layers between T-DNN and S-DNN” as constraint 1 and constraint 2, respectively.

Fig. 10. Limitation of FSP [Reference Yim, Joo, Bae and Kim6]. $m_{T},\,n_{T}$ represent the dimension of T-DNN Gramian matrix and $m_{S},\,n_{S}$ represent the dimension of S-DNN Gramian matrix.

Fig. 11. Illustration of huge layer number difference. C, convolutional layer; FC, fully-connected layer.

B) Solutions

How to make the Gramian matrices of T-DNN and S-DNN with same dimension is the key to solving constraint 1. As a result, we propose to use 1x1 convolution layers to forcibly tune output channels of T-DNN. In Fig. 12, the color in orange layers are the additional convolutional $1\times 1$ layers. Because the additional convolutional layer only works in the training procedure, the parameters of T-DNN and S-DNN are the same with the original ones. Moreover, we propose two-step compression to solve constraint 2 as shown in Fig. 13. We believe that using “Teacher” T-DNN to teach a temporary neural network model “Temporal”, then use the “Temporal” to teach the final target neural network model “Student”.

Fig. 12. Using $1\times 1$ convolutional layers to decrease channels.

Fig. 13. The illustration of two-stage knowledge distillation. C, convolutional layer; FC, fully connected layer.

Based on our proposed method, we do not have the experiments on greatly reducing the depth of the student model. There are two reasons. First, if we greatly reduce the depth of the student model, the image sizes after passing the student module with reduced depth are different from that of the original student model, the issue of how to align with the teacher model needs to be resolved. Second, to remove the number of Res-blocks to three in some Res-blocks could make the top-1 accuracy drop heavily since the CNN model may not learn enough details from the feature maps.

As a result, we do not change the structure of ResNet. We decrease the student model to the lowest number of the ResNet layers to create the smallest size and computation of the ResNet-10 as shown in Fig. 14.

Fig. 14. ResNet-50/ResNet-18 /ResNet-10.

C) Experiments and results

Using the same environment details of CIFAR-100 datasets, we set the batch size to 64 during training and training epochs to 200. We use ResNet-50 as the teacher's model and ResNet-10 as the student's model and ResNet-18 as the temporal model, as shown in Fig. 14. Experimental results show that the student's model increases the Top-1 accuracy by 3.45% and decreases 72.19% of parameters and 73.16% computation compared to T-DNN and reduces inference time from 188.9 ms to 66.3 ms. The experimental results of ResNet are shown in Table 19. Compared with ResNet-10 in Table 20, by using $1\times 1$ convolutional layers and our proposed method, ResNet-18 can obtain a 6.20% differential accuracy. Furthermore, by using two-stage KD, we could increase the differential accuracy by a further 6.54%. We believe that using two-stage knowledge distillation can prevent the loss of KD.

Table 19. Classification results after knowledge distillation (ResNet-50->10) with CIFAR-100 dataset.

Table 20. Methods of adding $1\times 1$ convolution to solve the limitation of proposed method and multi-steps compression with ResNet models and CIFAR-100. T-DNN1: ResNet-50. T-DNN2: ResNet-18. S-DNN: ResNet-10.

VI. CONCLUSION

In this paper, we propose a method using cross-layer knowledge distillation with KL divergence and offline ensemble to extract more knowledge from T-DNN to S-DNN to improve the Top-1 accuracy of image classification. Moreover, we use a $1\times 1$ convolutional layer to tune the dimension of Gramian matrix to solve the limitation of our proposed method and we further propose a method, two-stage KD, to avoid the loss of knowledge transfer.

Hsin-Hung Chou received a B.S. degree in Communications, Navigation and Control Engineering from National Taiwan Ocean University, Taipei, Taiwan. He received M.S. degree in the Department of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan. His research topics focused on Deep Neural Network models compression with heuristic methods. He is now working in Cisco Taiwan and is a consulting engineer. He is focusing on 5G Core Network and importing 5GCN in manufacturing industries with automation tools to help customers digital transformation.

Ching-Te Chiu received her B.S. and M.S. degrees from National Taiwan University, Taipei, Taiwan. She received her Ph.D. degree from University of Maryland, College Park, Maryland, USA, all in electrical engineering. She is a Professor at the Computer Science Department and Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan. Currently, she is the Director of Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan. She was member of technical staff with AT&T, Lucent Technologies, Murry Hill, NJ, USA, and with Agere Systems, Murry Hill, NJ, USA. Her research interests include Machine Learning, Pattern Recognition, High Dynamic Range Image and Video Processing, Super Resolution, High Speed SerDes design, Multi-chip Interconnect, and Fault Tolerance for Network-on-Chip design. Dr. Chiu won the first prize award, the best advisor award, and the best innovation award of the Golden Silicon Award. She served as the Chair of IEEE Signal Processing Society, Design and Implementation of Signal Processing Systems (DISPS) TC. She is a TC member of the IEEE Circuits and Systems Society, Nanoelectronics and Gigascale Systems Group. She was the program chair of the first IEEE Signal Processing Society Summer School at Hsinchu, Taiwan 2011 and technical program chair of IEEE workshop on signal processing system (SiPS) 2013. She served as associate editor of IEEE Transactions on Circuits and Systems I and served as associate editor of IEEE Signal Processing Magazine and Journal of Signal Processing Systems.

Ms. Yi-Ping Liao Student at University of National Tsing Hua.

References

REFERENCES

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L.: Imagenet: a large-scale hierarchical image database, in IEEE Conf. on Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE, Florida, UAS, 2009, pp. 248–255.CrossRef Google Scholar

Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis., 88 (2) (2010), 303–338.CrossRef Google Scholar

Cordts, M.; et al. : The cityscapes dataset for semantic urban scene understanding, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, Las Vegas, USA, 2016, pp. 3213–3223.CrossRef Google Scholar

Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction, in European Conf. on Computer Vision. Springer, Amsterdam, the Netherlands, 2016, pp. 628–644.CrossRef Google Scholar

Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R.: Indoor segmentation and support inference from rgbd images, in European Conf. on Computer Vision. Springer, Florence, Italy, 2012, pp. 746–760.CrossRef Google Scholar

Yim, J.; Joo, D.; Bae, J.; Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 4133–4141.CrossRef Google Scholar

Hinton, G.; Vinyals, O.; Dean, J.: Distilling the knowledge in a neural network, preprint arXiv:1503.02531, 2015.Google Scholar

Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y.: Fitnets: hints for thin deep nets, preprint arXiv:1412.6550, 2014.Google Scholar

Chen, T.; Goodfellow, I.; Shlens, J.: Net2net: accelerating learning via knowledge transfer, preprint arXiv:1511.05641, 2015.Google Scholar

Lee, S.H.; Kim, D.H.; Song, B.C.: Self-supervised knowledge distillation using singular value decomposition, in European Conf. on Computer Vision. Springer, Munich, Germany, 2018, pp. 339–354.CrossRef Google Scholar

Park, W.; Kim, D.; Lu, Y.; Cho, M.: Relational knowledge distillation, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, IEEE, Long Beach, California, 2019, pp. 3967–3976.CrossRef Google Scholar

Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H.: Deep mutual learning, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, Utah, 2018, pp. 4320–4328.CrossRef Google Scholar

Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G.E.; Hinton, G.E.: Large scale distributed neural network training through online distillation, 2018.Google Scholar

Lan, X.; Zhu, X.; Gong, S.: Knowledge distillation by on-the-fly native ensemble, arXiv preprint arXiv:1806.04606, 2018.Google Scholar

Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks, in 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, United States, 2012, pp. 1097–1105.Google Scholar

Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar

Ren, S.; He, K.; Girshick, R.; Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks, in Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, 2015, pp. 91–99.Google Scholar

Redmon, J.; Farhadi, A.: Yolo9000: better, faster, stronger, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, Honolulu, Hawaii, USA, 2017, pp. 7263–7271.CrossRef Google Scholar

Liu, Y.; Chen, K.; Liu, C.; Qin, Z.; Luo, Z.; Wang, J.: Structured knowledge distillation for semantic segmentation, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, Long Beach, California, USA, 2019, pp. 2604–2613.CrossRef Google Scholar

Ji, S.; Xu, W.; Yang, M.; Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern. Anal. Mach. Intell., 35 (1) (2013), 221–231.CrossRef Google Scholar PubMed

Wang, L.; Yoon, K.-J.: Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks, IEEE Trans. Pattern Anal. Mach. Intell. (2021). doi:10.1109/TPAMI.2021.3055564.CrossRef Google Scholar

Wang, X.; Zhang, R.; Sun, Y.; Qi, J.: Kdgan: Knowledge distillation with generative adversarial networks, in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/019d385eb67632a7e958e23f24bd07d7-Paper.pdf.Google Scholar

Bae, J.-H.; Yeo, D.; Yim, J.; Kim, N.-S.; Pyo, C.-S.; Kim, J.: Densely distilled flow-based knowledge transfer in teacher-student framework for image classification. IEEE Trans. Image. Process., 29, (2020), 5698–5710.Google Scholar

Hanson, S.J.; Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation, in Advances in Neural Information Processing Systems 2, NIPS Conference, Denver, Colorado, USA, 1989, pp. 177–185.Google Scholar

Han, S.; Pool, J.; Tran, J.; Dally, W.: Learning both weights and connections for efficient neural network, in Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 1135–1143.Google Scholar

Han, S.; Mao, H.; Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding, preprint arXiv:1510.00149, 2015.Google Scholar

Tung, F.; Mori, G.: Clip-q: Deep network compression learning by in-parallel pruning-quantization, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, Utah, USA, 2018, pp. 7873–7882.CrossRef Google Scholar

Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P.: Pruning filters for efficient convnets, preprint arXiv:1608.08710, 2016.Google Scholar

Hsiao, T.-Y.; Chang, Y.-C.; Chou, H.-H.; Chiu, C.-T.: Filter-based deep-compression with global average pooling for convolutional networks. J. Syst. Arch., 95, (2019), 9–18.CrossRef Google Scholar

Deng, L.; Li, G.; Han, S.; Shi, L.; Xie, Y.: Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc. IEEE, 108 (4) (2020), 485–532.CrossRef Google Scholar

Sze, V.; Chen, Y.-H.; Yang, T.-J.; Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE, 105 (12) (2017), 2295–2329.CrossRef Google Scholar

Denil, M.; Shakibi, B.; Dinh, L.; De Freitas, N.; et al. : Predicting parameters in deep learning, in 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, Nevada, USA, pp. 2148–2156.Google Scholar

Denton, E.L.; Zaremba, W.; Bruna, J.; LeCun, Y.; Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation, in 28th Annual Conference on Neural Information Processing Systems 2014, Montreal, Canada, 2014, pp. 1269–1277.Google Scholar

Jaderberg, M.; Vedaldi, A.; Zisserman, A.: Speeding up convolutional neural networks with low rank expansions, preprint arXiv:1405.3866, 2014.CrossRef Google Scholar

Zhang, X.; Zou, J.; He, K.; Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern. Anal. Mach. Intell., 38 (10) (2015), 1943–1955.CrossRef Google Scholar

Basu, S.; Varshney, L.R.: Universal source coding of deep neural networks, in DCC 2017 Data Compression Conference, Snowbird, Utah, USA, 2017, pp. 310–319.CrossRef Google Scholar

Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P.: Deep learning with limited numerical precision, in The 32nd International Conference on Machine Learning (ICML 2015), France, 2015, pp. 1737–1746.Google Scholar

Dettmers, T.: 8-bit approximations for parallelism in deep learning, preprint arXiv:1511.04561, 2015.Google Scholar

Hwang, K.; Sung, W.: Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1, in SIPS 2014: IEEE Workshop on Signal Processing Systems, Belfast, Ireland, UK, 2014, pp. 1–6.CrossRef Google Scholar

Ji, Z.; Ovsiannikov, I.; Wang, Y.; Shi, L.; Zhang, Q.: Reducing weight precision of convolutional neural networks towards large-scale on-chip image recognition, in Independent Component Analyses, Compressive Sampling, Large Data Analyses (LDA), Neural Networks, Biosystems, and Nanoengineering XIII, vol. 9496. International Society for Optics and Photonics, 2015, p. 94960A.CrossRef Google Scholar

Kang, L.-W.: Special issue on deep learning based detection and recognition for perceptual tasks with applications. APSIPA Trans. Signal Inform. Proc., 8, (2019), e21.Google Scholar

Wang, H.-P.; Peng, W.-H.; Ko, W.-J.: Learning priors for adversarial autoencoders. APSIPA Trans. Signal Inform. Proc., 9, (2020), e4.Google Scholar

Pascanu, R.; Mikolov, T.; Bengio, Y.: On the difficulty of training recurrent neural networks, 2012.Google Scholar

Krizhevsky, A.; Hinton, G. et al. : Learning multiple layers of features from tiny images, Citeseer, Tech. Rep., 2009.Google Scholar

Chrabaszcz, P.; Loshchilov, I.; Hutter, F.: A downsampled variant of imagenet as an alternative to the cifar datasets, preprint arXiv:1707.08819, 2017.Google Scholar

Abadi, M.; et al. : Tensorflow: a system for large-scale machine learning, in Proceedings of OSDI '16: 12th USENIX Symposium on Operating. Systems Design and Implementation OSDI, Savannah, GA, USA, 2016, pp. 265–283.Google Scholar

Kiefer, J.; Wolfowitz, J.; et al. : Stochastic estimation of the maximum of a regression function. Annal Math. Stat., 23 (3) (1952), 462–466.CrossRef Google Scholar

Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/k$\hat{2}$

). Doklady AN USSR, 269, (1983), 543–547.Google Scholar

Fig. 1. Overall architecture of our proposed methods. There are three parts of our architecture. First, we propose cross-layer matrix to exact more features by FSP [6] adopting the proposed Gramian matrix in the orange part. Second, we adopt the KL Divergence in the offline environment to make S-DNN find a wider robust minimum in the brown part. Finally, we propose the use of offline ensemble pre-trained T-DNN to teach a S-DNN by using stochastic mean in the red part.

Fig. 2. (a) Cross one layer. (b) Cross two layers. (c) Cross three layers. (d) Our proposed.

Fig. 3. (a) CIFAR-100. (b)ImageNet64*64.

Fig. 4. T-DNN and S-DNN of the VGG and ResNet models. T-DNN: VGG-11 and ResNet-32. S-DNN: VGG-6 and ResNet-8.

Fig. 5. T-DNN and S-DNN of the MobileNet models. T-DNN: MobileNet-16. S-DNN: MobileNet-9.

Table 1. Classification results after knowledge distillation (VGG-11->6) on CIFAR-100 dataset.

Table 2. Classification results after knowledge distillation (ResNet-32->8) on CIFAR-100 dataset.

Table 3. Classification results after knowledge distillation (MobileNet-16->9) on ImageNet64*64.

Fig. 6. (a) Cross one layer. (b) Cross two layers. (c) Cross three layers.

Table 4. Different proposed method of cross matrix (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 5. Different proposed method of cross matrix (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Fig. 7. (a) Cross-one layer. (b) Cross-two layers.(c) Cross-three layers. (d)Cross-four layers.

Table 6. Different proposed method of cross matrix (MobileNet-16>9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

Fig. 8. Illustration of using KL Divergence.

Table 7. Differential of adding KL Divergence (VGG-11>9) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 8. Differential of adding KL Divergence (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Table 9. Differential of adding KL Divergence (MobileNet-16>9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

Fig. 9. (a) One pre-trained teacher. (b) Two pre-trained teachers. (c) Three pre-trained teachers.

Table 10. Different numbers of teachers (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 11. Different numbers of teachers (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Table 12. Different numbers of teachers (MobileNet-16->9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

Table 13. Combination of proposed methods (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Table 14. Different proposed method of cross matrix (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Table 15. Combination of proposed methods (MobileNet) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Table 16. Computation, parameters, and average Top-1 accuracy comparison with VGG-11 and VGG-6 on CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Table 17. Computation, parameters, and average Top-1 accuracy comparison with ResNet-32 and ResNet-8 on CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Table 18. Computation, parameters, and average Top-1 accuracy comparison with ResNet-32 and ResNet-8 on CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Fig. 10. Limitation of FSP [6]. $m_{T},\,n_{T}$ represent the dimension of T-DNN Gramian matrix and $m_{S},\,n_{S}$ represent the dimension of S-DNN Gramian matrix.

Fig. 11. Illustration of huge layer number difference. C, convolutional layer; FC, fully-connected layer.

Fig. 12. Using $1\times 1$ convolutional layers to decrease channels.

Fig. 13. The illustration of two-stage knowledge distillation. C, convolutional layer; FC, fully connected layer.

Fig. 14. ResNet-50/ResNet-18 /ResNet-10.

Table 19. Classification results after knowledge distillation (ResNet-50->10) with CIFAR-100 dataset.

Article contents

Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Abstract

Keywords

I. INTRODUCTION

II. RELATED WORK

A) Knowledge distillation

B) Deep neural network compression and efficient processing

C) Low-rank decomposition

D) Quantization

E) Deep learning

III. PROPOSED ARCHITECTURE

A) Cross-layer matrix

1) Proposed distilled knowledge

2) Mathematical expression of the knowledge distillation

3) KD loss for the Gramian matrix

B) KL Divergence

C) Offline ensemble

D) Overall loss function

IV. EXPERIMENTAL RESULTS

A) Environment and datasets

B) Results

C) Ablation

1) Cross-layer matrix

2) Influence of adding KL Divergence

3) Offline ensemble

4) Combination of proposed methods

D) Comparison with other work

V. DISCUSSION

A) Constraints of our proposed method

B) Solutions

C) Experiments and results

VI. CONCLUSION

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests