西安电子科技大学学报 ›› 2022, Vol. 49 ›› Issue (4): 144-155.doi: 10.19665/j.issn1001-2400.2022.04.017

• 计算机科学与技术 • 上一篇    下一篇

利用卷积块注意力机制识别人体动作的方法

高德勇1,2(),康自兵1(),王松1,2(),王阳萍1,3()   

  1. 1.兰州交通大学 电子与信息工程学院,甘肃 兰州 730070
    2.甘肃省人工智能与图形图像工程研究中心,甘肃 兰州 730070
    3.甘肃省轨道交通装备系统动力学与可靠性重点实验室,甘肃 兰州 730070
  • 收稿日期:2021-03-24 出版日期:2022-08-20 发布日期:2022-08-15
  • 通讯作者: 康自兵
  • 作者简介:高德勇(1976—),男,副教授,E-mail: 258680916@qq.com|王 松(1978—),男,副教授,E-mail: wangsong@mail.lzjtu.cn|王阳萍(1973—),女,教授,E-mail: 1328396793@qq.com
  • 基金资助:
    国家自然科学基金(62067006);甘肃省自然科学基金(21JR7RA291);甘肃省高等学校创新基金(2021B-113);甘肃省科技计划(18JR3RA104);甘肃省高等学校产业支撑计划(2020C-19);兰州交通大学天佑创新团队(TY202002)

Method to recognize human action by using the convolutional block attention mechanism

GAO Deyong1,2(),KANG Zibing1(),WANG Song1,2(),WANG Yangping1,3()   

  1. 1. School of Electronicand Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China
    2. Gansu Provincial Engineering Research Center for Artificial Intelligence and Graphic and Image Processing,Lanzhou 730070,China
    3. Gansu Provincial Key Lab of System Dynamics and Reliability of Rail Transport Equipment,Lanzhou 730070,China
  • Received:2021-03-24 Online:2022-08-20 Published:2022-08-15
  • Contact: Zibing KANG

摘要:

针对动作识别任务中注意力模型在关注图像序列中的感兴趣区域时,更多侧重于通道间的相关性而忽视了特征的空间位置信息,因而缺乏对视频中动态区域的精准辨识能力,提出基于注意力机制和卷积长短时记忆网络的动作识别方法。首先,使用ResNet-50网络获取视频帧的特征表示,并利用卷积块注意力模块,先通过通道注意力分配特征图在不同卷积通道上的资源,再以空间注意力去分析不同特征图中显著元素的空间位置关系。从而实现对卷积特征图权值的优化调整,抑制或降低与动作无关区域带来的影响。同时,考虑到长短时记忆网络(LSTM)在处理时空数据时丢失了图像帧的空间结构信息,而卷积长短时记忆网络(ConvLSTM)借助卷积操作挖掘了图像中的空间相关性,对视频属性的完整性表示做了进一步的补充。因而,使用卷积长短时记忆网络对特征的序列信息进行建模并获得帧级别的预测,最终综合所有帧的预测共同确定视频的类别。在三个公开数据集上的实验结果表明,所提方法能够有效地突出视频中关键性区域,在一定程度上提升了动作识别的准确率。

关键词: 机器视觉, 动作识别, 注意力机制, 感兴趣区域, 卷积长短时记忆网络

Abstract:

When focusing on the region of interest in the image sequence in the action recognition task,the attention mechanism focuses more on the correlation of features at the channel level and ignores the spatial location information on the features,so it lacks the ability to accurately identify dynamic regions in the video.Therefore,this paper proposes an action recognition algorithm based on the attention mechanism and convolutional LSTM.First,the ResNet-50 network is used to obtain the feature representation of the video frame,and the convolution block attention module is used to first allocate the resources of the feature map on different convolution channels through channel attention,and then the different feature maps are analyzed with spatial attention.In this way,the optimal adjustment of the weights of the convolutional feature map is realized,and the influence of the regions unrelated to the action is suppressed or reduced.At the same time,considering that the long-short-term memory network (LSTM) loses the spatial structure information of the image frame when processing spatiotemporal data,the convolutional long-short-term memory network (ConvLSTM) uses the convolution operation to mine the spatial correlation in the image.The completeness representation of video’s attribute is further supplemented.The ConvLSTM is used to model the sequence information of the features to obtain frame-level predictions.Finally,the predictions of all frames are combined to determine the video classification.Experimental results on three public datasets show that the method proposed in this paper can effectively highlight the key region in the video and improve the accuracy of action recognition to a certain extent.

Key words: machine vision, action recognition, attention mechanism, region of interesting, convolutional LSTM

中图分类号: 

  • TP391.4
Baidu
map