基于改进QR算法的矩阵分解器设计

doi:10.16180/j.cnki.issn1007-7820.2022.11.004

摘要/Abstract

摘要：

矩阵分解是矩阵求逆中重要的运算之一,被广泛运用在神经网络、数字信号处理、无线通信技术等领域中。针对传统的分解算法运算不利于硬件实现的缺陷,文中在一种列向量优化QR分解算法的基础上,提出了一种一维线性矩阵分解结构,并完成了其ASIC设计。该分解器支持2~32阶矩阵分解运算,在TSMC 28 nm工艺下工作主频为700 MHz。仿真和FPGA测试结果表明,该分解器与MATLAB运算结果的相对误差小于10^-12。在执行12阶级以上规模矩阵分解时,该分解器的运算周期相比传统一维线性结构具有2.3倍的加速比。在执行32阶矩阵分解时,该分解器的运算周期相比于NIVIDA RTX2070具有22.8倍的加速比。

关键词: 矩阵分解, QR分解, Givens旋转, Column-wise Givens Rotation, FPGA实现, 硬件加速, 一维线性结构, ASIC实现

Abstract:

Matrix decomposition is one of the important operations in matrix inversion, which is widely used in neural networks, digital signal processing, wireless communication technology and other fields. Based on a column-vector optimized QR decomposition algorithm, this study proposes a one-dimensional linear matrix decomposition structure and completes the ASIC implementation of the structure to address the shortcomings of the traditional decomposition algorithm operations that are not conducive to hardware implementation. The matrix decomposer supports matrix decomposition operations of order 2~32 and operates at 700 MHz at TSMC 28 nm process. Simulation and FPGA test results show that the relative error between the decomposer and MATLAB results is less than 10^-12. When performing matrix decomposition of more than 12-orders, the operation cycle of the decomposer has a speedup ratio of 2.3 times compared with the traditional one-dimensional linear structure. When performing 32-order matrix decomposition, the operation cycle of the decomposer has a speedup ratio of 22.8 times compared with NIVIDA RTX2070.

Key words: matrix decomposition, QR decomposition, Givens rotation, Column-wise Givens Rotation, FPGA implementation, hardware acceleration, one-dimensional linear structure, ASIC implementation

中图分类号:

TN47

陈文杰,宋宇鲲,张多利. 基于改进QR算法的矩阵分解器设计[J]. 电子科技, 2022, 35(11): 21-28.

CHEN Wenjie,SONG Yukun,ZHANG Duoli. Design of Matrix Decomposer Based on Improved QR Algorithm[J]. Electronic Science and Technology, 2022, 35(11): 21-28.

图/表 14

图1

图2

图3

图4

图5

图6

表1

表2

图7

图8

表3

表4

表5

表6

参考文献 19

[1]	Liao W, Zhou J, Liang X, et al. An improved Zhang neural network model solving the matrix inverse online[C]. Kusatsu: Proceedings of the International Conference on Advanced Mechatronic Systems, 2019.
[2]	蔡念, 刘广聪, 蔡红丹. 改进矩阵分解与卷积神经网络结合的推荐模型[J]. 计算机工程与应用, 2019, 55(19):178-184. doi: 10.3778/j.issn.1002-8331.1811-0218
	Cai Nian, Liu Guangcong, Cai Hongdan. Improved model combining improved matrix decomposition and convolutional neural networks[J]. Computer Engineering and Applications, 2019, 55(19):178-184. doi: 10.3778/j.issn.1002-8331.1811-0218
[3]	吕尉邦, 贺光辉. 一种适用于多用户MIMO系统的低复杂度S-GMI-THP预编码算法及硬件实现[J]. 微电子学与计算机, 2019, 36(7):6-11.
	Lü Weibang, He Guanghui. A low complexity S-GMI-THP precoding algorithm and hardware implementation for multi-user MIMO systems[J]. Microelectronics & Computer, 2019, 36(7):6-11.
[4]	Hunek W P. An application of polynomial matrix σ-inverse in minimum-energy state-space perfect control of nonsquare LTI MIMO systems[C]. Miedzyzdroje: Proceedings of the Twentieth International Conference on Methods and Models in Automation and Robotics, 2015.
[5]	汪凤玲, 吴贇, 支佳. 毫米波MIMO系统中稀疏度自适应的压缩感知信道估计[J]. 电子科技, 2019, 32(10):13-16.
	Wang Fengling, Wu Yun, Zhi Jia. Sparse adaptive compressed sensing channel estimationin millimeter wave MIMO systems[J]. Electronic Science and Technology, 2019, 32(10):13-16.
[6]	鲍成浩, 水鹏朗. 利用直接矩阵求逆和临界采样子带自适应滤波器的快速系统辨识[J]. 电子与信息学报, 2008, 30(1):139-143.
	Bao Chenghao, Shui Penglang. Fast system identification using direct matrix inversion and a critically sampled subband adaptive filter[J]. Journal of Electronics & Information Technology, 2008, 30(1):139-143.
[7]	包志强, 贾富伟, 朱少彬. FPGA实现基于施密特正交化的自适应算法[J]. 电子科技, 2015, 28(9):1-5.
	Bao Zhiqiang, Jia Fuwei, Zhu Shaobin. QRD-SMI algorithm based on MGS algorithm and its FPGA implementation[J]. Electronic Science and Technology, 2015, 28(9):1-5.
[8]	张树鹏, 李彦明, 李杰, 等. 基于Householder变换的改进最小二乘法估算电池SOC[J]. 电源技术, 2016, 40(2):335-338.
	Zhang Shupeng, Li Yanming, Li Jie, et al. Least square method based on Householder transformation to measure state ofcharge[J]. Chinese Journal of Power Sources, 2016, 40(2):335-338.
[9]	于敬巨, 张多利, 宋宇鲲. 高性能矩阵求逆硬件加速器的设计与实现[J]. 合肥工业大学学报(自然科学版), 2018, 41(12):1652-1658.
	Yu Jingju, Zhang Duoli, Song Yukun. Design and implementation of high performance matrix inverse hardware accelerator[J]. Journal of Hefei University of Technology(Natural Science), 2018, 41(12):1652-1658.
[10]	Givens W. Computation of plane unitary rotations transforming a general matrix to triangular form[J]. Journal of the Society for Industrial and Applied Mathematics, 1958, 6(1):26-50. doi: 10.1137/0106004
[11]	Ma L, Dickson K, Mcallister J, et al. QR decomposition-based matrix inversion for high performance embedded MIMO receivers[J]. IEEE Transactions on Signal Processing, 2011, 59(4):1858-1867. doi: 10.1109/TSP.2011.2105485
[12]	马晓龙, 陈贵灿. 基于脉动阵列的复数定点QR分解VLSI设计[J]. 微电子学, 2011, 41(5):685-689.
	Ma Xiaolong, Chen Guican. VLSI design for plural fixed-point QR decomposition based on systolic array[J]. Microelectronics, 2011, 41(5):685-689.
[13]	Mahapatra C, Mahboob S, Leung V C M, et al. Fast inverse square root based matrix inverse for MIMO-LTE systems[C]. Shenyang: Proceedings of the International Conference on Control Engineering and Communication Technology, 2012.
[14]	Botchev V. A truly two-dimensional systolic array FPGA implementation of QR decomposition[J]. Computing Reviews, 2010, 51(6):358-361.
[15]	周杰, 陈啸洋, 窦勇, 等. 大矩阵QR分解的FPGA设计与实现[J]. 计算机工程与科学, 2010, 32(10):34-37.
	Zhou Jie, Chen Xiaoyang, Dou Yong, et al. The FPGA implementation of large-scale QR decomposition[J]. Computer Engineering & Science, 2010, 32(10):34-37.
[16]	张多利, 蒋雯, 叶紫燕, 等. 一种用于矩阵求逆的原位替换算法及硬件实现[J]. 合肥工业大学学报(自然科学版), 2020, 43(1):75-80.
	Zhang Duoli, Jiang Wen, Ye Ziyan, et al. An in-situ substitution algorithm for matrix inversion and its hardware implementation[J]. Journal of Hefei University of Technology(Natural Science), 2020, 43(1):75-80.
[17]	范程龙, 孙燚. 大数据环境下关于信号处理的技术探讨[J]. 数字技术与应用, 2018, 336(6):238-239.
	Fan Chenglong, Sun Yi. Discussion on signal processing technology in big data environment[J]. Digital Technology & Application, 2018, 336(6):238-239.
[18]	Rákossy Z R, Merchant F, Acosta-Aponte A, et al. Efficient and scalable CGRA-based implementation of column-wise givens rotation[C]. Zurich: Proceedings of the IEEE International Conference on Application-specific Systems, 2014.
[19]	黄有度. 矩阵理论及其应用[M]. 合肥: 合肥工业大学出版社, 2013.
	Huang Youdu. Matrix theory and its application[M]. Hefei: Hefei University of Technology Publishing House, 2013.

矩阵阶数	一维阵列资源	二维阵列资源
2	27	20
4		64
6		132
8		224
16		832
32		3 200

资源	已使用	总计	利用率/%
LUT	32 460	2 532 960	1.28
LUTRAM	1 226	459 360	0.27
BRAM	16	2 520	0.63
DSP	135	2 880	4.69
FF	34 801	5 065 920	0.69

矩阵阶数	数值范围	最大相对误差	平均相对误差
4	[-10⁰,10⁰]	1.546 0×10^-14	3.524 1×10^-15
	[-10¹⁰,10¹⁰]	3.869 2×10^-14	2.262 7×10^-15
	[-10²⁰,10²⁰]	7.484 0×10^-13	2.115 4×10^-14
8	[-10⁰,10⁰]	8.363 6×10^-11	2.651 3×10^-12
	[-10¹⁰,10¹⁰]	2.239 4×10^-13	2.920 9×10^-14
	[-10²⁰,10²⁰]	1.210 5×10^-10	8.332 9×10^-12
16	[-10⁰,10⁰]	1.925 8×10^-12	3.588 2×10^-14
	[-10¹⁰,10¹⁰]	5.480 0×10^-12	1.453 7×10^-13
	[-10²⁰,10²⁰]	2.793 3×10^-10	1.870 3×10^-12
32	[-10⁰,10⁰]	5.412 3×10^-12	3.942 7×10^-14
	[-10¹⁰,10¹⁰]	1.562 3×10^-11	4.494 1×10^-14
	[-10²⁰,10²⁰]	3.895 3×10^-11	1.615 8×10^-13

矩阵阶数	传统一维	二维阵列	本文结构
7×7	1 512	954	976
12×12	4 290	1 412	1 854

	矩阵阶数	面积 /mm²	工作频率 /GHz	分解时间/s	分解周期
RTX2070	16	445.00	1.62	6.83×10^-8	110 665
分解器	16	2.25	0.70	35.88×10^-10	2 512
RTX2070	32	445.00	1.62	1.25×10^-7	203 198
分解器	32	2.25	0.70	1.27×10^-8	8 894