Winograd转置卷积快速实现方法研究

doi:10.19665/j.issn1001-2400.20230308

摘要/Abstract

摘要：

Winograd转置卷积算法是现场可编程门阵列中广泛使用的卷积加速方法,可通过分组后执行Winograd卷积来解决转置卷积的零填充问题。然而该方法需要对输入特征映射和卷积核进行分组运算,且需要对运算结果进行重组,以生成完整的输出特征映射,复杂的元素坐标计算增加了设计的复杂度。针对上述问题,提出一种采用统一转换矩阵计算Winograd转置卷积的方法,使用统一的转换矩阵代替对输入特征映射和卷积核进行分组,有效解决了重叠求和、零填充、卷积核翻转、分解和重组等问题。并在该方法的指导下,结合数据重用、双缓冲区设计和流水线等方法,完成了现场可编程门阵列上转置卷积的加速器的设计。选择高斯-泊松生成对抗网络进行实验验证,并与主流的转置卷积设计方法进行了综合比较。实验结果表明,提出的方法可有效降低资源消耗和功耗,加速器的有效性能比现有的转置卷积方法提高了约1.13至23.92倍。

关键词: 统一转换矩阵, Winograd转置卷积, 现场可编程门阵列, 加速器

Abstract:

The Winograd transposed convolution algorithm is a widely used convolution acceleration method for Field Programmable Gate Array(FPGA).It can solve the zero-padding problem of transposed convolution by performing the Winograd convolution after grouping.However,this method requires grouping operation on the input feature map and convolution kernel,and needs to reorganize the operation results to generate a complete output feature map.The complex calculation of element coordinates increases the difficulty of design.To solve the above problems,a Winograd transposed convolution method based on the unified transformation matrix is proposed,which uses the unified transformation matrix instead of grouping the input feature map and convolution kernel,and effectively solves the problems of overlapping summation,zero padding,convolution kernel inversion,decomposition and reorganization.And under the guidance of the Winograd transpose convolution method based on the unified transformation matrix,combined with data reuse,the double buffer and the pipeline,the design of a transposed convolution accelerator on FPGA is completed.The Gaussian-Poisson generative adversarial network is selected for experimental verification,and compared with the mainstream transposed convolution method.Experimental results show that the proposed method can effectively reduce the resource consumption and power consumption,and that the effective performance of the accelerator is 1.13x~23.92x higher than that of the existing transposed convolution methods.

Key words: unified transformation matrix, Winograd transposed convolution, field programmable gate array, accelerator

中图分类号:

TP18

李钊,黄程程,何益智,苏晓杰. Winograd转置卷积快速实现方法研究[J]. 西安电子科技大学学报, 2023, 50(6): 148-160.

LI Zhao,HUANG Chengcheng,HE Yizhi,SU Xiaojie. Research on the fast implementation method of Winograd transposed convolution[J]. Journal of Xidian University, 2023, 50(6): 148-160.

图/表 11

图1

图2

图3

图4

图5

图6

图7

表1

表2

表3

表4

参考文献 25

[1]	YU J, HU Y, NING X, et al. Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA[C]// 2017 International Conference on Field Programmable Technology(ICFPT).Piscataway:IEEE, 2017:227-230.
[2]	LU L, LIANG Y, XIAO Q, et al. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs[C]// 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM).Piscataway:IEEE, 2017:101-108.
[3]	SHEN J, HUANG Y, WANG Z, et al. Towards a Uniform Template-Based Architecture for Accelerating 2D and 3D CNNs on FPGA[C]// The 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays(FPGA'18). New York: ACM, 2018:97-106.
[4]	LIU X Y, POOL J, HAN S, et al. Efficient Sparse-Winograd Convolutional Neural Network[C]// Proceedings of the 6th International Conference on Learning Representations(ICLR 2018).Appleton:ICLR, 2018:1-10.
[5]	WEI X, YU C, ZHANG P, et al. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs[C]// 2017 54th ACM/EDAC/IEEE Design Automation Conference(DAC).Piscataway:IEEE, 2017:1-6.
[6]	YANG C, WANG Y, WANG X, et al. WRA:A 2.2-to-6.3 TOPS Highly Unified Dynamically Reconfigurable Accelerator Using a Novel Winograd Decomposition Algorithm for Convolutional Neural Networks[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2019, 66(9):3480-3493. doi: 10.1109/TCSI.8919
[7]	YEPEZ J, KO S B. Stride 2 1-D,2-D,and 3-D Winograd for Convolutional Neural Networks[J]. IEEE Transactions on Very Large Scale Integration Systems, 2020, 28(4):853-863. doi: 10.1109/TVLSI.92
[8]	DENG H, WANG J, YE H, et al. 3D-VNPU:A Flexible Accelerator for 2D/3D CNNs on FPGA[C]// Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines(FCCM 2021).Piscataway:IEEE, 2021:181-185.
[9]	SHEN J, HUANG Y, WEN M, et al. Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(7):1442-1455. doi: 10.1109/TCAD.43
[10]	WANG Z L, LAN Q, HE H J, et al. Winograd Algorithm for 3D Convolution Neural Networks[C]// Proceedings of the 26th International Conference on Artificial Neural Networks(ICANN 2017).Berlin:Springer, 2017:609-616.
[11]	KIM M, PARK C, KIM S, et al. Efficient Dilated-Winograd Convolutional Neural Networks[C]// 2019 IEEE International Conference on Image Processing(ICIP).Piscataway:IEEE, 2019:2711-2715.
[12]	DING W, HUANG Z Y, HUANG Z K, et al. Designing Efficient Accelerator of Depthwise Separable Convolutional Neural Network on FPGA[J]. Journal of Systems Architecture, 2019, 97:278-286. doi: 10.1016/j.sysarc.2018.12.008
[13]	KNAPHEIDE J, STABERNACK B, KUHNKE M. A High Throughput MobileNetV2 FPGA Implementation Based on a Flexible Architecture for Depthwise Separable Convolution[C]// 2020 30th International Conference on Field-Programmable Logic and Applications(FPL).Piscataway:IEEE, 2020:277-283.
[14]	YAN J, YIN S, TU F, et al. GNA:Reconfigurable and Efficient Architecture for Generative Network Acceleration[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(11):2519-2529. doi: 10.1109/TCAD.2018.2857258
[15]	ZHANG X, DAS S, NEOPANE O, et al. A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA(2017)[J/OL].[2020-01-01]. https://arxiv.org/abs/1705.02583v1.
[16]	LIU S, FAN H, NIU X, et al. Optimizing CNN-Based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA[J]. ACM Transactions on Reconfigurable Technology and Systems, 2018, 11(3):1-22.
[17]	XIA L, DIAO L, JIANG Z, et al. PAI-FCNN:FPGA Based Inference System for Complex CNN Models[C]// 2019 IEEE 30th International Conference on Application-Specific Systems,Architectures and Processors(ASAP).Piscataway:IEEE, 2019:107-114.
[18]	BAI L, LYU Y, HUANG X. A Unified Hardware Architecture for Convolutions and Deconvolutions in CNN[C]// 2020 IEEE International Symposium on Circuits and Systems(ISCAS).Piscataway:IEEE, 2020:1-5.
[19]	DI X K, YANG H G, HUANG Z H, et al. Exploring Resource-Efficient Acceleration Algorithm for Transposed Convolution of GANs on FPGA[C]// 2019 International Conference on Field-Programmable Technology(ICFPT).Piscataway:IEEE, 2019:19-27.
[20]	DI X K, YANG H G, JIA Y P, et al. Exploring Efficient Acceleration Architecture for Winograd-Transformed Transposed Convolution of GANs on FPGAs[J]. Electronics, 2020, 9(2):1-21. doi: 10.3390/electronics9010001
[21]	CHANG J, AHN S, KANG K, et al. Towards Design Methodology of Efficient Fast Algorithms for Accelerating Generative Adversarial Networks on FPGAs[C]// 2020 25th Asia and South Pacific Design Automation Conference(ASP-DAC).Piscataway:IEEE, 2020:283-288.
[22]	须颖, 刘帅, 邵萌, 等. 一种多尺度GAN的低剂量CT超分辨率重建方法[J]. 西安电子科技大学学报, 2022, 49(2):228-236.
	XU Yin, LIU Shuai, SHAO Meng, et al. Multi-Scale Generation Antagonistic Network for the Low-Dose CT Images Super-Resolution Reconstruction Algorithm[J]. Journal of Xidian University, 2022, 49(2):228-236.
[23]	高杰, 霍智勇. 一种门控卷积生成对抗网络的图像修复算法[J]. 西安电子科技大学学报, 2022, 49(1):216-224.
	GAO Jie, HUO Zhiyong. Algorithm for Image Inpainting in Generative Adversarial Networks Based on Gated Convolution[J]. Journal of Xidian University, 2022, 49(1):216-224.
[24]	李斌, 齐延荣, 周清雷. 基于Winograd算法的目标检测加速器设计与优化[J]. 电子学报, 2022, 50(10):2387-2397. doi: 10.12263/DZXB.20201371
	LI Bin, QI Yanrong, ZHOU Qinglei. Design and Optimization of Target Detection Accelerator Based on Winograd Algorithm[J]. Acta ElectronicaSinica, 2022, 50(10):2387-2397. doi: 10.12263/DZXB.20201371
[25]	HUANG C C, DONG X X, LI Z, et al. Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA[C]// 2021 International Conference on Field-Programmable Technology(ICFPT).Piscataway:IEEE, 2021:1-9.

卷积算法	重组转置卷积		分组Winograd转置卷积		统一转换矩阵Winograd转置卷积
卷积算法	DSP	LUT	DSP	LUT	DSP	LUT
F_tr,2(4²,3²)	36	145	25	265	25	264
F_tr,2(4²,4²)	64	320	36	873	36	608
F_tr,2(4²,5²)	100	433	49	1 133	49	1 111

	重组转置卷积	分组Winograd转置卷积	统一转换矩阵Winograd转置卷积
DSP功耗	0.125	0.089	0.085
信号功耗	0.048	0.066	0.051
逻辑功耗	0.015	0.030	0.017
总功耗	0.188	0.185	0.153

卷积层	输入大小	卷积核大小	输出大小	步长
TransConv-1	4×4×512	4×4×512×256	8×8×256	2
TransConv-2	8×8×256	4×4×256×128	16×16×128	2
TransConv-3	16×16×128	4×4×128×64	32×32×64	2
TransConv-4	32×32×64	4×4×64×3	64×64×3	2

	文献[15]	文献[16]	文献[18]	文献[20]	文中方法
加速算法				Winograd	Winograd
工作频率/MHz	100	200	220	200	250
吞吐量(GOPs)	2.6	29.0	94.3	639.2	241.0
DSP数量	220	900	900	2 520	840
DSP效率(GOPs/DSPs)	0.012	0.032	0.104	0.254	0.287