一种多模态特征编码的短视频多标签分类方法

doi:10.19665/j.issn1001-2400.2022.04.013

摘要/Abstract

摘要：

随着智能手机的普及和移动互联网的发展,短视频作为一种新兴的用户生成内容得到快速传播,浏览短视频也成为了最流行的娱乐方式之一。短视频天然具有模态和语义上的关联性,如何利用这种关联性是短视频表示学习的关键。针对短视频的多标签分类问题,提出了一种基于多模态子空间编码的短视频多标签分类模型,该模型将多模态下的子空间编码学习同标签语义相关性学习整合为一个统一框架。模型利用子空间编码网络获取完备的公共表示,充分挖掘短视频多个模态下的一致性和互补性信息,同时去除冗余信息,减小噪声的影响,获取模态融合的公共完备表示;利用图卷积网络构建标签相关性矩阵,学习标签间语义关联表示,将其用于指导多标签分类任务。对特征层和标签层信息进行更充分的融合交互以提高分类性能。算法从整体上构建了模态重构损失和多标签分类损失,充分利用短视频的多模态特性和多标签关联,在公开数据集上进行实验,证明了所提模型在分类任务的有效性和优越性。

关键词: 短视频, 多模态融合, 深度学习, 多标签分类, 神经网络

Abstract:

With the popularization of smart phones and the mobile Internet,micro-videos have been developed rapidly as a new form of user generated contents (UGCs).Browsing micro-videos has become one of the most popular entertainment methods.Micro-video has natural relevance in modalities and semantics.How to make full use of this correlation is the key to micro-video representation learning.Aiming at better solving multi-label classification tasks,a modal subspace encoding algorithm is proposed,which integrates subspace coding for multi-modal and label semantic relevance learning in a unified framework.The proposed algorithm uses the subspace coding network to obtain a complete common representation by modeling the consistency and complementary of modalities and meanwhile the redundancy and noise information are reduced further,so that the common and complete representations of multimodal fusion are obtained.Furthermore,the graph convolutional network is used to construct a label correlation matrix to learn the semantic relevance and representations of labels,which are used to guide the multi-label classification task.Overall,the proposed algorithm makes full use of feature-level and label-level information to improve classification performance.The reconstruction loss and multi-label classification loss are formulated as a whole and experiments on a public dataset have proved superiority of our proposed algorithm.

Key words: micro-video, multi-modal fusion, deep learning, multi-label classification, neural networks

中图分类号:

TP391

井佩光,李亚鑫,苏育挺. 一种多模态特征编码的短视频多标签分类方法[J]. 西安电子科技大学学报, 2022, 49(4): 109-117.

JING Peiguang,LI Yaxin,SU Yuting. Micro-video multi-label classification method based on multi-modal feature encoding[J]. Journal of Xidian University, 2022, 49(4): 109-117.

图/表 7

图1

图2

表1

图3

图4

表2

表3

参考文献 27

[1]	SAURA J R, BENNETT D R. A Three-Stage Method for Data Text Mining:Using UGC in Business Intelligence Analysis[J]. Symmetry, 2019, 11(4):519. doi: 10.3390/sym11040519
[2]	LIU M, NIE L, WANG M, et al. Towards Micro-Video Understanding by Joint Sequential-Sparse Modeling[C]// Proceedings of ACM International Conference on Multimedia. New York: ACM, 2017:970-978.
[3]	JING P, SU Y, NIE L, et al. Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(8):1519-1532. doi: 10.1109/TKDE.2017.2785784
[4]	CHEN X, LIU D, XIONG Z, et al. Learning and Fusing Multiple User Interest Representations for Micro-Video and Movie Recommendations[J]. IEEE Transactions on Multimedia, 2020, 23:484-496. doi: 10.1109/TMM.2020.2978618
[5]	JIA X, ZHENG X, LI W, et al. Facial Emotion Distribution Learning by Exploiting Low-Rank Label Correlations Locally[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2019:9841-9850.
[6]	CHEN X, SONG X, REN R, et al. Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation Learning[J]. ACM Transactions on Information Systems, 2020, 38(4):1-26.
[7]	D'MELLO S K, KORY J. A Review and Meta-Analysis of Multimodal Affect Detection Systems[J]. ACM Computing Surveys, 2015, 47(3):1-36.
[8]	HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical Correlation Analysis:An Overview with Application to Learning Methods[J]. Neural Computation, 2004, 16(12):2639-2664. doi: 10.1162/0899766042321814
[9]	党吉圣, 杨军. 多特征融合的三维模型识别与分割[J]. 西安电子科技大学学报, 2020, 47(4):149-157.
	DANG Jisheng, YANG Jun. 3D Model Recognition and Segmentation Based on Multi-Feature Fusion[J]. Journal of Xidian University, 2020, 47(4):149-157.
[10]	ZHANG C, FU H, ZHOU J T, et al. CPM-Nets:Cross Partial Multi-View Networks[C]// 33rd Conference on Neural Information Processing Systems. San Diego: NeurIPS, 2019:557-567.
[11]	张丽娟, 崔天舒, 井佩光, 等. 基于深度多模态特征融合的短视频分类[J]. 北京航空航天大学学报, 2021, 47(3):478-485.
	ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Micro-Video Classification Based on Deep Multi-Modal Feature Fusion[J]. Journal of Beihang University, 2021, 47(3):478-485.
[12]	张志昌, 张治满, 张珍文. 融合局部语义和全局结构信息的健康问句分类[J]. 西安电子科技大学学报, 2020, 47(2):9-15.
	ZHANG Zhichang, ZHANG Zhiman, ZHANG Zhenwen. Classifying Health Questions with Local Semantic and Global Structural Information[J]. Journal of Xidian University, 2020, 47(2):9-15.
[13]	KIPFT N, WELLING M. Semi-Supervised Classification with Graph Convolutional Networks (2016)[J/OL]. [2020-07-23]. http://arxiv.org/abs/1609.02907.
[14]	HECHT-NIELSEN R. Theory of The Backpropagation Neural Network[J]. Neural Networks, 1988, 1:445. doi: 10.1016/0893-6080(88)90469-8
[15]	BOTTOU L. Large-Scale Machine Learning with Stochastic Gradient Descent[C]// Proceedings of International Conference on Computational Statistics.Heidelberg:Springer, 2010:177-186.
[16]	YANNAI K, KAWANO Y. Food Image Recognition Using Deep Convolutional Network with Pre-Training and Fine-Tuning[C]// Proceedings of IEEE International Conference on Multimedia.Piscataway:IEEE, 2015:1-6.
[17]	LOGAN B. Mel Frequency Cepstral Coefficients for Music Modeling[C]// Proceedings of International Society for Music Information Retrieval.Massachusetts:ISMIR, 2000, 270:1-11.
[18]	WANG L, QIAN Y, TANG X. Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2015:4305-4314.
[19]	PENNINGTON J, SOCHER R, MANNING C D. Glove:Global Vectors for Word Representation[C]// Proceedings of Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL, 2014:1532-1543.
[20]	TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks[C]// Proceedings of IEEE International Conference on Computer Vision.Piscataway:IEEE, 2015:4489-4497.
[21]	YEH C K, WU W C, KO W J, et al. Learning Deep Latent Spaces for Multi-Label Classification[C]// Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2017:2838-2844.
[22]	ZHAN M L, ZHOU Z H. ML-KNN:A Lazy Learning Approach to Multi-Label Learning[J]. Pattern Recognition, 2007, 40(7):2038-2048. doi: 10.1016/j.patcog.2006.12.019
[23]	SZEGRDY C, LIU W, JIA Y, et al. Going Deeper with Convolutions[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2015:1-9.
[24]	ZHU Y, KWORK J T, ZHOU Z H. Multi-Label Learning with Global and Local Label Correlation[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(6):1081-1094. doi: 10.1109/TKDE.2017.2785795
[25]	DING Z, FU Y. Robust Multi-View Subspace Learning through Dual Low-Rank Decompositions[C]// Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2016:1181-1187.
[26]	ZHANG J, LUO Z, LI C, et al. Manifold Regularized Discriminative Feature Selection for Multi-Label Learning[J]. Pattern Recognition, 2019, 95:136-150. doi: 10.1016/j.patcog.2019.06.003
[27]	WANG L, LIU Y, QIN C, et al. Dual Relation Semi-Supervised Multi-Label Learning[C]// Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020:6227-6234.

算法	平均精度	覆盖度	汉明损失	排序损失	1-错误率
A	0.393 5	9.412 0	0.024 1	0.147 2	0.832 4
T	0.412 6	7.318 6	0.017 3	0.126 3	0.766 2
V	0.784 2	2.082 6	0.013 1	0.034 9	0.298 7
A+T	0.431 2	7.186 3	0.018 6	0.115 4	0.752 1
A+V	0.791 4	1.978 4	0.012 4	0.047 2	0.292 3
V+T	0.801 3	1.721 6	0.011 4	0.027 5	0.276 3
A+T+V	0.825 4	1.643 2	0.009 1	0.018 7	0.221 9

算法	平均精度	覆盖度	汉明损失	排序损失	1-错误率
RMSL^[25]	0.803 3	4.006 5	0.014 3	0.045 2	0.243 2
C2AE^[21]	0.801 3	3.694 2	0.012 8	0.048 1	0.238 1
Googlenet^[23]	0.667 6	4.568 0	0.017 6	0.434 9	0.434 9
C3D^[20]	0.714 9	3.904 1	0.014 6	0.369 4	0.369 4
MDFS^[26]	0.784 7	2.607 7	0.012 7	0.035 2	0.291 8
GLOCAL^[24]	0.752 7	3.994 3	0.013 3	0.051 5	0.245 7
MLKNN^[22]	0.784 3	4.020 4	0.013 4	0.047 6	0.308 7
DRML^[27]	0.790 5	1.672 5	0.008 1	0.019 2	0.192 1
文中算法	0.825 4	1.643 2	0.009 1	0.018 7	0.221 9