Journal of Xidian University ›› 2024, Vol. 51 ›› Issue (3): 136-146.doi: 10.19665/j.issn1001-2400.20231004

• Computer Science and Technology & Artificial Intelligence • Previous Articles     Next Articles

Texture-aware video inpainting algorithm based on the multi-attention mechanism

XIA Yilan1(), WANG Xiumei1(), CHENG Peitao2()   

  1. 1. School of Electronic Engineering,Xidian University,Xi’an 710071,China
    2. School of Mechano-Elctronic Engineering,Xidian University,Xi’an 710071,China
  • Received:2023-03-13 Online:2024-06-20 Published:2023-11-15
  • Contact: CHENG Peitao E-mail:ylxia@stu.xidian.edu.cn;wangxm@xidian.edu.cn;chengpeitao@163.com

Abstract:

Existing video inpainting methods cannot effectively utilize distant spatial contents,which results in unreasonable structures and textures.To solve this problem,a texture-aware video inpainting algorithm based on the multi-attention mechanism is proposed in this paper.The algorithm designs a multi-attention mechanism composed of multi-head spatiotemporal attention and single-image local attention,guaranteeing global structures and enriching local textures.Multi-head spatial-temporal attention focuses on the overall spatial-temporal information,and single-image local attention distills local information through local windows of the self-attention mechanism.A plug-and-play fast Fourier convolution layer residual block is used to replace vanilla convolution in feedforward networks,expanding the receptive field into the entire image so that the global structure and texture of a single frame image can be enriched.The fast Fourier convolutional layer residual block and the single-image local attention complement each other and jointly promote the quality of local textures.Experimental results on YouTube-VOS and DAVIS datasets show that although the proposed method ranks second only to the optimal method Fuseformer on objective metrics,the number of parameters and running time are reduced by 54.8% and 21.5% respectively.And the proposed method can generate more visually realistic and semantically reasonable contents.

Key words: video inpainting, Transformer, fast Fourier convolution, multi-attention mechanism, texture-aware

CLC Number: 

  • TP391

Baidu
map