利用卷积块注意力机制识别人体动作的方法

doi:10.19665/j.issn1001-2400.2022.04.017

摘要/Abstract

摘要：

针对动作识别任务中注意力模型在关注图像序列中的感兴趣区域时,更多侧重于通道间的相关性而忽视了特征的空间位置信息,因而缺乏对视频中动态区域的精准辨识能力,提出基于注意力机制和卷积长短时记忆网络的动作识别方法。首先,使用ResNet-50网络获取视频帧的特征表示,并利用卷积块注意力模块,先通过通道注意力分配特征图在不同卷积通道上的资源,再以空间注意力去分析不同特征图中显著元素的空间位置关系。从而实现对卷积特征图权值的优化调整,抑制或降低与动作无关区域带来的影响。同时,考虑到长短时记忆网络(LSTM)在处理时空数据时丢失了图像帧的空间结构信息,而卷积长短时记忆网络(ConvLSTM)借助卷积操作挖掘了图像中的空间相关性,对视频属性的完整性表示做了进一步的补充。因而,使用卷积长短时记忆网络对特征的序列信息进行建模并获得帧级别的预测,最终综合所有帧的预测共同确定视频的类别。在三个公开数据集上的实验结果表明,所提方法能够有效地突出视频中关键性区域,在一定程度上提升了动作识别的准确率。

关键词: 机器视觉, 动作识别, 注意力机制, 感兴趣区域, 卷积长短时记忆网络

Abstract:

When focusing on the region of interest in the image sequence in the action recognition task,the attention mechanism focuses more on the correlation of features at the channel level and ignores the spatial location information on the features,so it lacks the ability to accurately identify dynamic regions in the video.Therefore,this paper proposes an action recognition algorithm based on the attention mechanism and convolutional LSTM.First,the ResNet-50 network is used to obtain the feature representation of the video frame,and the convolution block attention module is used to first allocate the resources of the feature map on different convolution channels through channel attention,and then the different feature maps are analyzed with spatial attention.In this way,the optimal adjustment of the weights of the convolutional feature map is realized,and the influence of the regions unrelated to the action is suppressed or reduced.At the same time,considering that the long-short-term memory network (LSTM) loses the spatial structure information of the image frame when processing spatiotemporal data,the convolutional long-short-term memory network (ConvLSTM) uses the convolution operation to mine the spatial correlation in the image.The completeness representation of video’s attribute is further supplemented.The ConvLSTM is used to model the sequence information of the features to obtain frame-level predictions.Finally,the predictions of all frames are combined to determine the video classification.Experimental results on three public datasets show that the method proposed in this paper can effectively highlight the key region in the video and improve the accuracy of action recognition to a certain extent.

Key words: machine vision, action recognition, attention mechanism, region of interesting, convolutional LSTM

中图分类号:

TP391.4

高德勇,康自兵,王松,王阳萍. 利用卷积块注意力机制识别人体动作的方法[J]. 西安电子科技大学学报, 2022, 49(4): 144-155.

GAO Deyong,KANG Zibing,WANG Song,WANG Yangping. Method to recognize human action by using the convolutional block attention mechanism[J]. Journal of Xidian University, 2022, 49(4): 144-155.

图/表 17

图1

图2

图3

图4

图5

表1

图6

表3

表2

图7

图8

表4

图9

表5

表6

表7

图10

参考文献 26

[1]	HERATH S, HARANDI M, PORIKLI F. Going Deeper into Action Recognition:A Survey[J]. Image and Vision Computing, 2017, 60:4-21. doi: 10.1016/j.imavis.2017.01.010
[2]	罗会兰, 童康, 孔繁胜. 基于深度学习的视频中人体动作识别进展综述[J]. 电子学报, 2019, 47(5):1162-1173.
	LUO Huilan, TONG Kang, KONG Fansheng. The Progress of Human Action Recognition in Videos Based on Deep Learning:A Review[J]. Acta Electronica Sinica, 2019, 47(5):1162-1173.
[3]	KRIZHEVSKYA, SUTSKEVER I, HINTON G, et al. Image Net Classification with Deep Convolutional Neural Networks[J]. Advances in Neural Information Processing Systems, 2012, 25(2):1097-1105.
[4]	SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos[C]// Advances in Neural Information Processing Systems. San Diego: NIPS, 2014:568-576.
[5]	TRAND, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks[C]// IEEE International Conference on Computer Vision.Piscataway:IEEE, 2015:4489-4497.
[6]	WANGL M, XIONG Y J, WANG Z, et al. Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]// European Conference on Computer Vision.Heidelberg:Springer, 2016:20-36.
[7]	DONAHUE J, HENDRICKS L A, ROHRBACH M, et al. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):677-691. doi: 10.1109/TPAMI.2016.2599174
[8]	XU K, BA J, KIROS R, et al. Show,Attend and Tell:Neural Image Caption Generation with Visual Attention[C]// In International Conference on Machine Learning. New York: ACM, 2015:2048-2057.
[9]	成磊, 王玥, 田春娜. 一种添加残差注意力机制的视觉目标跟踪算法[J]. 西安电子科技大学学报, 2020, 47(6):148-157.
	CHENG Lei, WANG Yue, TIAN Chunna. Residual Attention Mechanism for Visual Tracking[J]. Journal of Xidian University, 2020, 47(6):148-157.
[10]	SHARMA S, KIROS R, SALAKHUTDINOV R. Action Recognition Using Visual Attention[C]// Neural Information Processing Systems Time Series Workshop. San Diego: NIPS, 2015:1-6.
[11]	DU W, WANG Y, QIAO Y. RPAN:An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos[C]// IEEE International Conference on Computer Vision.Piscataway:IEEE, 2017:3745-3754.
[12]	GE H W, YAN Z H, YU W H, et al. An Attention Mechanism Based Convolutional LSTM Network for Video Action Recognition[J]. Multimedia Tools and Applications, 2019, 78(14):1-24. doi: 10.1007/s11042-018-6670-5
[13]	TONG M, LI MY, BAI H, et al. DKD-DAD:A Novel Framework with Discriminative Kinematic Descriptor and Deep Attention-Pooled Descriptor for Action Recognition[J]. Neural Computing and Applications, 2020,5285-5302.
[14]	WOO S, PARK J, LEE J Y, et al. CBAM:Convolutional Block Attention Module (2018)[J]. [2018-07-17]. https://arxiv.org/abs/1807.06521.
[15]	LI Z, GAVRILYUK K, GAVVES E, et al. VideoLSTM Convolves,Attends and Flows for Action Recognition[J]. Computer Vision and Image Understanding. 2018, 166:41-50. doi: 10.1016/j.cviu.2017.10.011
[16]	王洁然. 基于高低层特征融合与卷积注意力机制的视频动作识别方法研究[D]. 武汉: 华中科技大学, 2019.
[17]	HE K M, ZHANG X Y, RENS Q, et al. Deep Residual Learning for Image Recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2016:770-778.
[18]	SHI X, CHEN Z, WANG H, et al. Convolutional LSTM Network:A Machine Learning Approach for Precipitation Nowcasting (2015)[J/OL]. [2015-06-13]. https://arxiv.org/abs/1506.04214.
[19]	LIU J, LUO J, SHAH M, et al. Recognizing Realistic Actions From Videos “In the Wild”[C]// IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2009:1996-2003.
[20]	SOOMRO K, ZAMIR A R, SHAH M. UCF101:A Dataset of 101 Human Action Classes From Videos in the Wild (2012)[J/OL]. [2012-12-03]. https://arxiv.org/abs/1212.0402.
[21]	JHUANG H, GARROTE H, POGGIO E, et al. A Large Video Database for Human Motion Recognition[C]// Proceedings of IEEE International Conference on Computer Vision.Piscataway:IEEE, 2011:2556-2563.
[22]	KINGMA D P, BA J. Adam:A Method for Stochastic Optimization (2014)[J]. [2014-12-22]. https://arxiv.org/abs/1412.6980.
[23]	NG Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond Short Snippets:Deep Networks for Video Classification[C]// IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2015:4694-4702.
[24]	VAROL G, LAPTEV I, SCHMID C. Long-Term Temporal Convolutions for Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6):1510-1517. doi: 10.1109/TPAMI.2017.2712608
[25]	DIBA A, FAYYAZ M, SHARMAV, et al. Temporal 3D Convnets:New Architecture and Transfer Learning for Video Classification (2017)[J]. [2017-11-22]. https://arxiv.org/abs/1711.08200.
[26]	李庆辉, 李艾华, 王涛. 结合有序光流图和双流卷积网络的行为识别[J]. 光学学报, 2018, 38(6):234-240.
	LI Qinghui, LI Aihua, WANG Tao, et al. Double-Stream Convolution Networks with Sequential Optical Flow Image for Action Recognition[J]. Acta Optical Sinica, 2018, 38(6):234-240.

参数	设定值
卷积核的尺寸	3×3
学习率	10^-3
权重衰减系数	10^-5
丢失率	0.9

模型结构	YouTube	UCF101	HMDB51
去除注意力机制	89.97%	75.64%	40.85%
引入注意力机制	94.01%	80.52%	46.07%

CNN Model	精确度/%
GoogleNet	90.2
VGG-16	89.6
ResNet-50	91.1

动作类别	HMDB51文中方法	软注意力机制	动作类别	UCF101文中方法	软注意力机制
kick_ball	54.96	41.26	Surfing	83.52	67.45
it	55.46	42.06	BrushingTeeth	70.34	54.46
cartwheel	33.47	20.36	Typing	86.58	72.52
fencing	63.07	50.56	Skijet	90.08	76.36
eat	57.98	46.98	Shotput	82.52	69.48
swing_baseball	18.06	7.21	Archery	91.55	79.72
ride_bike	45.96	35.48	Bowling	92.58	81.84
hit	38.08	28.08	SkateBoarding	89.84	79.35
ride_horse	30.49	20.53	PlayingPiano	88.23	77.92
dive	46.73	37.02	TennisSwing	87.13	77.59
climb	77.83	68.16	PlayingDaf	85.47	76.51
clap	60.00	50.37	Nunchucks	81.37	72.87
drink	62.46	53.64	PlayingCello	86.51	78.05
golf	85.54	76.73	PlayingDhol	87.93	79.64
shoot_gun	40.18	31.94	TableTennisShot	78.28	70.28

模型结构	HMDB51/%	UCF101/%
CNN+LSTM	44.13	78.39
CNN+ConvLSTM	46.07	80.52