西安电子科技大学学报 ›› 2022, Vol. 49 ›› Issue (4): 109-117.doi: 10.19665/j.issn1001-2400.2022.04.013

• 计算机科学与技术 • 上一篇    下一篇

一种多模态特征编码的短视频多标签分类方法

井佩光(),李亚鑫(),苏育挺()   

  1. 天津大学 电气自动化与信息工程学院,天津 300072
  • 收稿日期:2021-05-07 出版日期:2022-08-20 发布日期:2022-08-15
  • 通讯作者: 苏育挺
  • 作者简介:井佩光(1988—),男,副教授,E-mail: pgjing@tju.edu.cn|李亚鑫(1996—),男,天津大学硕士研究生,E-mail: curryxin@tju.edu.cn
  • 基金资助:
    国家自然科学基金(61802277);天津市自然科学基金(20JCQNJC01210);博士后科学基金(2019M651038)

Micro-video multi-label classification method based on multi-modal feature encoding

JING Peiguang(),LI Yaxin(),SU Yuting()   

  1. School of Electrical and Information Engineering,Tianjin University,Tianjin 300072,China
  • Received:2021-05-07 Online:2022-08-20 Published:2022-08-15
  • Contact: Yuting SU

摘要:

随着智能手机的普及和移动互联网的发展,短视频作为一种新兴的用户生成内容得到快速传播,浏览短视频也成为了最流行的娱乐方式之一。短视频天然具有模态和语义上的关联性,如何利用这种关联性是短视频表示学习的关键。针对短视频的多标签分类问题,提出了一种基于多模态子空间编码的短视频多标签分类模型,该模型将多模态下的子空间编码学习同标签语义相关性学习整合为一个统一框架。模型利用子空间编码网络获取完备的公共表示,充分挖掘短视频多个模态下的一致性和互补性信息,同时去除冗余信息,减小噪声的影响,获取模态融合的公共完备表示;利用图卷积网络构建标签相关性矩阵,学习标签间语义关联表示,将其用于指导多标签分类任务。对特征层和标签层信息进行更充分的融合交互以提高分类性能。算法从整体上构建了模态重构损失和多标签分类损失,充分利用短视频的多模态特性和多标签关联,在公开数据集上进行实验,证明了所提模型在分类任务的有效性和优越性。

关键词: 短视频, 多模态融合, 深度学习, 多标签分类, 神经网络

Abstract:

With the popularization of smart phones and the mobile Internet,micro-videos have been developed rapidly as a new form of user generated contents (UGCs).Browsing micro-videos has become one of the most popular entertainment methods.Micro-video has natural relevance in modalities and semantics.How to make full use of this correlation is the key to micro-video representation learning.Aiming at better solving multi-label classification tasks,a modal subspace encoding algorithm is proposed,which integrates subspace coding for multi-modal and label semantic relevance learning in a unified framework.The proposed algorithm uses the subspace coding network to obtain a complete common representation by modeling the consistency and complementary of modalities and meanwhile the redundancy and noise information are reduced further,so that the common and complete representations of multimodal fusion are obtained.Furthermore,the graph convolutional network is used to construct a label correlation matrix to learn the semantic relevance and representations of labels,which are used to guide the multi-label classification task.Overall,the proposed algorithm makes full use of feature-level and label-level information to improve classification performance.The reconstruction loss and multi-label classification loss are formulated as a whole and experiments on a public dataset have proved superiority of our proposed algorithm.

Key words: micro-video, multi-modal fusion, deep learning, multi-label classification, neural networks

中图分类号: 

  • TP391
Baidu
map