基于锚点的快速三维手部关键点检测算法

doi:10.16180/j.cnki.issn1007-7820.2024.04.011

摘要/Abstract

摘要：

在人机协作任务中,手部关键点检测为机械臂提供目标点坐标,A2J(Anchor-to-Joint)是具有代表性的一种利用锚点进行关键点检测的方法。A2J以深度图为输入,可实现较好的检测效果,但对全局特征获取能力不足。文中设计了全局-局部特征融合模块(Global-Local Feature Fusion,GLFF)对骨干网络浅层和深层的特征进行融合。为了提升检测速度,文中将A2J的骨干网络替换为ShuffleNetv2并对其进行改造,用5×5深度可分离卷积替换3×3深度可分离卷积,增大感受野,有效提升了骨干网络对全局特征的提取能力。文中在锚点权重估计分支引入高效通道注意力模块(Efficient Channel Attention,ECA),提升了网络对重要锚点的关注度。在主流数据集ICVL和NYU上进行的训练和测试结果表明,相比于A2J,文中所提方法的平均误差分别降低了0.09 mm和0.15 mm。在GTX1080Ti显卡上实现了151 frame·s^-1的检测速率,满足人机协作任务对于实时性的要求。

关键词: 人机协作, 三维手部关键点检测, 锚点, 深度图, 全局-局部特征融合, ShuffleNetv2, 深度可分离卷积, 高效通道注意力

Abstract:

In human-robotcollaboration tasks, hand key point detection provides target point coordinates for the robotic arm.A2J(Anchor-to-Joint) is a representative method of key point detection using anchor points.A2J can achieve better detection effect with depth map input, but it has insufficient ability to acquire global features.In this study, a GLF(Global-Local Feature Fusion) module is designed to fuse the shallow and deep features of the backbone network.In order to improve the detection speed, the backbone network of A2J is replaced with ShuffleNetv2 and reformed, and 3×3 depth separable convolution is replaced with 5×5 depth separable convolution to increase the sensitivity field and effectively improve the backbone network's ability to extract global features.ECA(Efficient Channel Attention) is introduced into the anchor weight estimation branch to improve the network's attention to important anchor points.The results of training and testing on the mainstream data sets ICVL and NYU show that the average error of the proposed method is reduced by 0.09 mm and 0.15 mm, respectively, compared with A2J.The detection rate of 151 frame·s-1 is realized on GTX1080Ti graphics card, which fully meets the real-time requirement of man-machine collaboration task.

Key words: human-robot collaboration, 3D hand keypoint detection, anchor point, depth map, global-local feature fusion, ShuffleNetv2, depthwise separable convolution, efficient channel attention

中图分类号:

TP391.41

秦晓飞, 何文, 班东贤, 郭宏宇, 于景. 基于锚点的快速三维手部关键点检测算法[J]. 电子科技, 2024, 37(4): 77-86.

QIN Xiaofei, HE Wen, BAN Dongxian, GUO Hongyu, YU Jing. Research on Fast 3D Hand Keypoint Detection Algorithm Based on Anchor[J]. Electronic Science and Technology, 2024, 37(4): 77-86.

图/表 15

图1

图2

图3

图4

图5

图6

图7

图8

表1

表2

表3

图9

图10

表4

图11

参考文献 31

[1]	王丽萍, 汪成, 邱飞岳, 等. 深度图像中的3D手势姿态估计方法综述[J]. 小型微型计算机系统, 2021, 42(6):1227-1235.
	Wang Liping, Wang Cheng, Qiu Feiyue, et al. Survey of 3D hand pose estimation methods using depth map[J]. Journal of Chinese Computer Systems, 2021, 42(6):1227-1235.
[2]	张哲. 基于深度图像的3D手部关键点检测研究[D]. 北京: 北京交通大学, 2021:52-55.
	Zhang Zhe. Research on 3D hand key point detection based on depth image[D]. Beijing: Beijing Jiaotong University, 2021:52-55.
[3]	吴海波, 王晨, 崔禹. 深度图像预处理算法研究[J]. 电子科技, 2021, 34(11):31-36.
	Wu Haibo, Wang Chen, Cui Yu. Research on depth image preprocessing algorithm[J]. Electronic Science and Technology, 2021, 34(11):31-36.
[4]	Sinha A, Choi C, Ramani K. Deephand:Robust hand pose estimation by completing a matrix imputed with deep features[C]. Las Vegas: IEEE Conference on Computer Vision and Pattern Recognition, 2016;4150-4158.
[5]	Guo H, Wang G, Chen X, et al. Region ensemble network:Improving convolutional network for hand pose estimation[C]. Beijing: IEEE International Conference on Image Processing, 2017:4512-4516.
[6]	Ge L H, Liang H, Thalmann D, et al. The the third convol-utional neural networks for efficient and robust hand pose estimation from single depth images[C]. Honolulu: IEEE Conference on Computer Vision and Pattern Recognition, 2017:5679-5688.
[7]	Xiong F, Zhang B, Xiao Y, et al. A2J:Anchor-to-Joint regression network for the third articulated pose estimat-ion from a single depth image[C]. Seoul: IEEE/CVF International Conference on Computer Vision, 2019:110-117.
[8]	Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]. Honolulu: IEEE Conference on Computer Vision and Pattern Recognition, 2017:936-944.
[9]	Fu C Y, Liu W, Ranga A, et al. Dssd:Deconvolutional single shot detector[EB/OL].(2017-1-23) [2022-12-26] https://arxiv.org/abs/1701.06659.
[10]	Li Y, Li J, Lin W, et al. Tiny-DSOD:Lightweight object detection for resource-restricted usages[EB/OL].(2018-7-29) [2022-12-26] https://arxiv.org/abs/1807.11013.
[11]	Zhang X, Zhou X, Lin M, et al. Shufflenet:An extremely efficient convolutional neural network for mobile devices[C]. Salt Lake City: IEEE Conference on Computer Vision and Pattern Recognition, 2018:6848-6856.
[12]	Howard A G, Zhu M, Chen B, et al. Mobilenets:Efficient convolutional neural networks for mobile vision applications[EB/OL].(2017-4-17) [2022-12-26] https://arxiv.org/abs/1704.04861.
[13]	Ma N, Zhang X, Zheng H T, et al. Shufflenetv2:Practical guidelines for efficient CNN architecture design[C]. Munich: The European Conference on Computer Vision, 2018:122-138.
[14]	He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]. Las Vegas: IEEE Conference on Computer Vision and Pattern Recognition, 2016:770-778.
[15]	Wang Q, Wu B, Zhu P, et al. ECA-Net:Efficient channel attention for deep convolutional neural networks[C]. Long Beach: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:11534-11542.
[16]	Tang D, Jin C H, Tejani A, et al. Latent regression forest:Structured estimation of the third articulated hand posture[C]. Columbus: IEEE Conference on Computer Vision and Pattern Recognition, 2014:3786-3793.
[17]	Tompson J, Stein M, Lecun Y, et al. Real-time continuou-s pose recovery of human hands using convolutional networks[J]. ACM Transactions on Graphics, 2014, 33(5):1-10.
[18]	Qin Z, Li Z, Zhang Z, et al. ThunderNet:Towards real-time generic object detection on mobile devices[C]. Seoul: IEEE/CVF International Conference on Computer Vision, 2019:6717-6726.
[19]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]. Salt Lake City: IEEE Conference on Computer Vision and Pattern Recognition, 2018:7132-7141.
[20]	Ren S, He K, Girshick R, et al. Faster R-CNN:Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
[21]	Zhou X, Wan Q, Zhang W, et al. Model-based deep handpose estimation[EB/OL].(2016-06-22) [2022-12.26]https://arxiv.org/abs/1606.06854.
[22]	Oberweger M, Lepetit V. Deepprior++:Improving fast and accurate the third hand pose estimation[C]. Venice: IEEE International Conference on Computer Vision Workshops, 2017:585-594.
[23]	Guo H, Wang G, Chen X, et al. Towards good practices for deep the third hand pose estimation[EB/OL].(2017-07-23) [2022-12-26] https://arxiv.org/abs/1707.07248.
[24]	Wan C, Probst T, Van Gool L, et al. Dense the third regression for hand pose estimation[C]. Salt Lake City: IEEE Conference on Computer Vision and Pattern Recognition, 2018:5147-5156.
[25]	Ge L, Cai Y, Weng J, et al. Hand pointnet:The third hand pose estimation using point sets[C]. Salt Lake City: IEEE Conference on Computer Vision and Pattern Recognition, 2018:8417-8426.
[26]	Chen X, Wang G, Guo H, et al. Pose guided structured region ensemble network for cascaded hand pose estimation[J]. Neurocomputing, 2020, 395(5):138-149. doi: 10.1016/j.neucom.2018.06.097
[27]	Du K, Lin X, Sun Y, et al. Crossinfonet:Multi-task infor-mation sharing based hand pose estimation[C]. Long Beach: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:9888-9897.
[28]	Ge L, Ren Z, Yuan J. Point-to-Point regression pointnet for the third hand pose estimation[C]. Munich: European Conference on Computer Vision, 2018:489-505.
[29]	Moon G, Chang J Y, Lee K M. V2V-posenet:Voxel-to-Voxel prediction network for accurate the third hand and human pose estimation from a single depth map[C]. Salt Lake City: IEEE Conference on Computer Vision and Pattern Recognition, 2018:5079-5088.
[30]	Menges B, Sarrey M, Henaff P. Implementation, risk assessment and safety human/robot interaction of collaborative robot UR10[C]. Nancy: International Conference on Safety of Industrial Automated Systems, 2018:198-203.
[31]	Quigley M, Conley K, Gerkey B, et al. ROS:An open-so-urce robot operating system[C]. Kobe: ICRA Workshop on Open Source Software, 2009:239-244.

方法	平均误差/mm		参数量 /MB	帧速率/ frame·s^-1
方法	ICVL	NYU	参数量 /MB	帧速率/ frame·s^-1
DeepModel^[21]	11.56	17.04	-	-
DeepPrior++^[22]	8.10	12.24	-	30.0
REN-4x6x6^[23]	7.63	13.39	-	-
REN-9x6x6^[5]	7.31	12.69	-	-
DenseReg^[24]	7.30	10.20	5.8	28.0
HandPointNet^[25]	6.94	10.54	2.5	48.0
Pose-REN^[26]	6.79	11.81	-	-
CrossInfoNet^[27]	6.73	10.08	23.8	124.5
Point-to-Point^[28]	6.30	9.10	4.3	41.8
V2V-PoseNet^[29]	6.28	8.42	457.5	3.5
A2J^[7]	6.46	8.61	44.7	105.0
本文	6.37	8.46	22.4	151.2

方法	Pose- REN^[27]	Dense Reg^[25]	V2V- PoseNet^[30]	Cross- InfoNet^[28]	A2J^[7]	本文
手掌	7.4	7.6	6.3	6.9	7.3	6.8
手腕1	13.9	13.9	9.9	10.3	9.3	9.6
手腕2	11.5	16.1	8.9	9.2	9.3	8.9
拇指根	10.7	9.4	6.9	8.7	7.9	7.3
拇指中	11.3	9.6	7.8	8.8	8.2	8.3
拇指尖	13.9	11.6	9.8	11.5	10.6	10.5
食指中	10.9	9.0	7.4	9.7	8.4	7.9
食指尖	16.8	12.6	11.0	14.0	10.8	10.2
中指中	8.7	7.7	6.8	8.2	7.4	6.7
中指尖	14.5	10.4	10.1	12.8	9.6	9.0
无名指中	8.2	7.7	6.7	7.9	6.9	6.5
无名指尖	13.4	9.8	9.8	12.3	8.7	8.4
小指中	9.4	8.0	7.0	8.4	7.1	7.1
小指尖	14.9	9.8	9.5	12.5	9.2	9.1
平均	11.8	10.2	8.4	10.1	8.6	8.5

方法	Pose- REN^[27]	Dene Reg^[25]	V2V- PoseNet^[30]	Cross- InfoNet^[28]	A2J^[7]	本文
手掌	9.1	5.1	6.0	5.4	5.7	5.3
拇指根	9.3	6.4	6.2	5.8	6.0	5.9
拇指中	11.3	6.6	4.9	5.9	6.3	5.4
拇指尖	12.7	8.1	6.2	7.1	6.9	6.3
食指根	9.2	5.7	6.2	6.3	7.0	5.9
食指中	11.7	7.7	5.7	6.1	5.7	5.2
食指尖	13.7	10.9	7.0	8.3	6.3	6.2
中指根	9.6	5.3	5.6	5.1	5.5	5.3
中指中	14.7	8.3	6.4	7.0	6.4	6.0
中指尖	16.3	10.8	7.8	9.4	7.6	7.5
无名指根	5.6	5.6	5.7	5.4	5.5	5.3
无名指中	13.8	7.9	6.8	7.2	6.2	6.3
无名指尖	15.6	11.0	7.5	9.5	7.6	7.7
小指根	6.4	6.4	5.9	6.0	7.2	6.7
小指中	11.3	6.9	6.0	5.6	6.3	6.5
小指尖	14.9	9.4	7.4	7.6	7.3	6.9
平均	11.6	7.6	6.3	6.7	6.5	6.4

方法	参数量 /MB	帧速率 /frame·s^-1	平均误差 /mm
Baseline(A2J^[7])	42.8	105	6.46
ShuffleNet-A2J	22.7	154	7.13
Improved_ShuffleNet-A2J+ GLFF	23.1	147	6.59
Improved_ShuffleNet-A2J+ ECA	22.2	160	6.85
Improved_ShuffleNet-A2J+ ECA+GLFF	22.4	151	6.37