一种面向二维三维卷积的GPGPU cache旁路系统

doi:10.19665/j.issn1001-2400.2023.02.010

摘要/Abstract

摘要：

通用图形处理器作为卷积神经网络的核心加速平台,其处理二维、三维卷积的性能,决定着神经网络在实时目标识别检测领域的有效应用。然而,受其固有cache系统功能的限制,当前通用图形处理器架构无法实现二维、三维卷积的高效加速。针对此问题,首先提出一种L1Dcache动态旁路设计方案。该方案定义了一组能够动态反映指令访问cache特征的数据结构,并基于此数据结构定义访存特征记录表,以记录不同访存指令在请求cache时的执行状态。其次,采用优先线程块的warp调度策略来加速访存状态的采样。最后根据访存状态得出不同PC值下访存请求对L1Dcache的旁路的判定,并动态完成部分低局域性数据请求对L1Dcache的旁路。由此将L1Dcache空间保留给高局域性的数据并降低二维、三维卷积执行时的访存阻塞周期,进而提升了二维、三维卷积在通用图形处理器上执行时的访存效率。实验结果表明,相比原架构,在面向二维、三维卷积时分别带来了约2.16%与19.79%的性能提升,体现了设计方案的有效性与实用性。

关键词: 卷积, 通用图形处理器, 存储系统, cache旁路

Abstract:

As the core computing platform of the convolution neural network,general-purpose graphics processor(GPGPU),its performance of processing two-dimensional and three-dimensional convolution determines the application of the neural network in real-time target recognition and detection.However,limited by inherent cache system design,the current GPGPU architecture cannot achieve efficient acceleration of 2D and 3D convolution computing.Aiming at this problem,a dynamic L1Dcache bypassing design for this problem is proposed.First,we define a new data structure that can dynamically reflect the cache access characteristics of an instruction,and then defines a memory-access-feature record table based on this information,in order to record the execution status of different memory accesses.Second,the warp scheduling strategy with the priority thread block is adopted,which can speed up the sampling of the memory access state.Next,the L1Dcache bypassing decision of memory accesses under different PCs is obtained due to the sampling results.Finally,the L1Dcache bypassing of some low-locality data accesses is completed.As a result,the L1Dcache space is reserved for data with high locality and the memory access stall cycle of 2D and 3D convolution is reduced.In addition,the memory access efficiency of 2D and 3D convolution has been improved.Compared with the original design,experimental results show that the L1Dcache bypassing design brings 2.16% performance improvements in 2D convolution and 19.79% in 3D convolution.Experiments prove the effectiveness and practicality of this design.

Key words: convolution, GPGPU, memory system, cache bypassing

中图分类号:

贾世伟,张玉明,秦翔,孙成璐,田泽. 一种面向二维三维卷积的GPGPU cache旁路系统[J]. 西安电子科技大学学报, 2023, 50(2): 92-100.

JIA Shiwei,ZHANG Yuming,QIN Xiang,SUN Chenglu,TIAN Ze. GPGPU cache bypassing system for 2D and 3D convolution[J]. Journal of Xidian University, 2023, 50(2): 92-100.

图/表 9

图1

图2

图3

图4

图5

图6

图7

图8

图9

参考文献 20

[1]	KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet Classification with Deep Convolutional Neural Networks[J]. Advances in neural information processing systems, 2012, 25(2):1106-1114.
[2]	CHATTERJEE S, ZIELINSKI P. On the Generalization Mystery in Deep Learning (2022)[J/OL].[2022-6-3]. https://doi.org/10.48550/arXiv.2203.10036.
[3]	韩永赛, 马时平, 何林远, 等. 改进YOLOv3的快速遥感机场区域目标检测[J]. 西安电子科技大学学报, 2021, 48(5):156-166.
	HAN Yongsai, MA Shiping, HE Linyuan, et al. Detection of the Object in the Fast Remote Sensing Airport Area on the Improved YOLOv3[J]. Journal of Xidian University, 2021, 48(5):156-166.
[4]	ZOU Z, SHI Z, GUO Y, et al. Object Detection in 20 Years:A Survey (2019)[J/OL].[2019-5-16]. https://doi.org/10.48550/arXiv.1905.05055.
[5]	DU H, SHI H, ZENG D, et al. The Elements of End-to-End Deep Face Recognition:A Survey of Recent Advances (2021)[J/OL].[2021-12-27]. https://doi.org/10.48550/arXiv.2009.13290.
[6]	REDMON J, FARHADI A. YOLO9000:Better,Faster,Stronger[C]//IEEE Conference on Computer Vision & Pattern Recognition. Piscataway:IEEE, 2017:6517-6525.
[7]	RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet Large Scale Visual Recognition Challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252. doi: 10.1007/s11263-015-0816-y
[8]	HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the IEEE international conference on computer vision. Piscataway:IEEE, 2017:2961-2969.
[9]	LIU W, ANGUELOV D, ERHAN D, et al. SSD:Single Shot Multibox Detector[C]//European Conference on Computer Vision (ECCV), Heidelberg:Springer, 2016.21-37.
[10]	CHEN X, MA H, WAN J, et al. Multi-View 3D Object Detection Network for Autonomous Driving[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE, 2017.6526-6534.
[11]	HOWARD A G, ZHU M, CHEN B, et al. MobileNets:Efficient Convolutional Neural Networks for Mobile Vision Applications (2017)[J/OL].[2017-04-17]. https://arxiv.org/abs/1704.04861.
[12]	LIANG Y, LU L, XIAO Q, et al. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(4):857-870 doi: 10.1109/TCAD.43
[13]	ZHANG S, DU Z, LEI Z, et al. Cambricon-X:An Accelerator for Sparse Neural Networks[C]∥49th Annual IEEE/ACM International Symposium on Microarchitecture. New York: IEEE, 2016:1-12
[14]	GUO K, SUI L, QIU J, et al. Angel-Eye:A Complete Design Flow for Mapping CNN onto Embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 37(1):35-47 doi: 10.1109/TCAD.2017.2705069
[15]	CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss:An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1):127-138 doi: 10.1109/JSSC.2016.2616357
[16]	LUEBKE D, HUMPHREYS G. How GPUs Work[J]. IEEE Computer, 2007, 40:96-100.
[17]	NVIDIA Corporation. NVIDIA TESLA V100 GPU ARCHITECTURE (2017)[DB/OL].[2017-5-8]. https://www.nvidia.cn/content/dam/en-zz/zh_cn/Solutions/Data-Center/volta-gpu-architecture/Volta-Architecture-Whitepaper-v1.1-CN.compressed.pdf.
[18]	NVIDIA Corporation. NVIDIA TURING GPGPU ARCHITECTURE (2018)[DB/OL].[2018-8-14]. https://images.nvidia.cn/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
[19]	NVIDIA Corporation. NVIDIA A100 Tensor Core GPU ARCHITECTURE (2020)[DB/OL].[2020-5-5]. https://images.nvidia.cn/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf#cid=_pa-srch-baid_zh-cn.
[20]	NVIDIA Corporation. NVIDIAH100 Tensor Core GPU ARCHITECTURE (2022)[DB/OL].[2022-9-19]. https://nvdam.widen.net/s/9bz6dw7dqr/gtc22-whitepaper-hopper.pdf.