基于多注意力机制的纹理感知视频修复方法

doi:10.19665/j.issn1001-2400.20231004

Abstract

Abstract:

Existing video inpainting methods cannot effectively utilize distant spatial contents,which results in unreasonable structures and textures.To solve this problem,a texture-aware video inpainting algorithm based on the multi-attention mechanism is proposed in this paper.The algorithm designs a multi-attention mechanism composed of multi-head spatiotemporal attention and single-image local attention,guaranteeing global structures and enriching local textures.Multi-head spatial-temporal attention focuses on the overall spatial-temporal information,and single-image local attention distills local information through local windows of the self-attention mechanism.A plug-and-play fast Fourier convolution layer residual block is used to replace vanilla convolution in feedforward networks,expanding the receptive field into the entire image so that the global structure and texture of a single frame image can be enriched.The fast Fourier convolutional layer residual block and the single-image local attention complement each other and jointly promote the quality of local textures.Experimental results on YouTube-VOS and DAVIS datasets show that although the proposed method ranks second only to the optimal method Fuseformer on objective metrics,the number of parameters and running time are reduced by 54.8% and 21.5% respectively.And the proposed method can generate more visually realistic and semantically reasonable contents.

Key words: video inpainting, Transformer, fast Fourier convolution, multi-attention mechanism, texture-aware

CLC Number:

TP391

XIA Yilan, WANG Xiumei, CHENG Peitao. Texture-aware video inpainting algorithm based on the multi-attention mechanism[J].Journal of Xidian University, 2024, 51(3): 136-146.

Figures/Tables 17

References 25

[1]	CHAVAN S A, CHOUDHARI N M. Various Approaches for Video Inpainting:A Survey[C]//2019 5th International Conference on Computing,Communication,Control and Automation.Piscataway:IEEE, 2019:1-5.
[2]	潘浩. 数字视频的修复方法研究[D]. 合肥: 中国科学技术大学, 2010.
[3]	ZHANG X, LI H, QI Y, et al. Rain Removal in Video by Combining Temporal and Chromatic Properties[C]//IEEE International Conference on Multimedia and Expo. Piscataway:IEEE, 2006:461-464.
[4]	HUANG Y, ZHENG F, WANG D, et al. Super-Resolution and Inpainting with Degraded and Upgraded Generative Adversarial Networks[C]//Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. New York: ACM, 2021:645-651.
[5]	韦哲, 李从利, 沈延安, 等. 基于两阶段模型的无人机图像厚云区域内容生成[J]. 计算机学报, 2021, 44(11):2233-2247.
	WEI Zhe, LI Congli, SHENG Yan’an, et al. Thick Cloud Region Content Generation of UAV Image Based on Two-Stage Model[J]. Chinese Journal of Computers, 2021, 44(11):2233-2247.
[6]	TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway:IEEE, 2015:4489-4497.
[7]	CHANG Y L, LIU Z Y, LEE K Y, et al. Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway:IEEE, 2019:9066-9075.
[8]	KIM D, WOO S, LEE J Y, et al. Deep Video Inpainting[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:5792-5801.
[9]	HU Y T, WANG H, BALLAS N, et al. Proposal-Based Video Completion[C]//Proceedings of the European Conference on Computer Vision. Piscataway:IEEE, 2020:38-54.
[10]	XU R, LI X, ZHOU B, et al. Deep Flow-Guided Video Inpainting[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:3723-3732.
[11]	GAO C, SARAF A, HUANG J B, et al. Flow-Edge Guided Video Completion[C]//Proceedings of the European Conference on Computer Vision. Piscataway:IEEE, 2020:713-729.
[12]	LEE S, OH S W, WON D Y, et al. Copy-and-Paste Networks for Deep Video Inpainting[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway:IEEE, 2019:4413-4421.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All You Need[J]. Advances in Neural Information Processing Systems, 2017, 30:5998-6008.
[14]	ZENG Y, FU J, CHAO H. Learning Joint Spatial-Temporal Transformations for Video Inpainting[C]//Proceedings of the European Conference on Computer Vision. Piscataway:IEEE, 2020:528-543.
[15]	LIU R, DENG H, HUANG Y, et al. FuseFormer:Fusing Fine-Grained Information in Transformers for Video Inpainting[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway:IEEE, 2021:14040-14049.
[16]	TANCIK M, SRINIVASAN P, MILDENHALL B, et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains[J]. Advances in Neural Information Processing Systems, 2020, 33:7537-7547.
[17]	ZHANG R, ISOLA P, EFROS A A, et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:586-595.
[18]	SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition(2014)[J/OL].[2014-09-04].https://arxiv.org/pdf/1409.1556.pdf.
[19]	SUVOROV R, LOGACHEVA E, MASHIKHIN A, et al. Resolution-Robust Large Mask Inpainting with Fourier Convolutions[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway:IEEE, 2022:2149-2159.
[20]	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative Adversarial Nets[J]. Advances in Neural Information Processing Systems, 2014, 27:2672-2680.
[21]	WANG C, HUANG H, HAN X, et al. Video Inpainting by Jointly Learning Temporal Structure and Spatial Details[C]//Proceedings of the AAAI Conference on Artificial Intellingence. Menlo Park: AAAI, 2019, 33(1):5232-5239.
[22]	KINGMA D P, BA J. Adam:A Method for Stochastic Optimization[C]//Proceedings of the 3rd International Conference on Learning Representations. Piscataway: San Digeo, 2015,1-13.
[23]	XU N, YANG L, FAN Y, et al. Youtube-Vos:A Large-Scale Video Object Segmentation Benchmark(2018)[J/OL].[2018-09-06].https://arxiv.org/pdf/1809.03327.pdf.
[24]	CAELLES S, MONTES A, MANINIS K K, et al. The 2018 Davis Challenge on Video Object Segmentation(2018)[J/OL].[2018-03-01].https://arxiv.org/pdf/1803.00557.pdf.
[25]	杨静雅, 齐彦丽, 周一青, 等. CNN-Transformer轻量级智能调制识别算法[J]. 西安电子科技大学学报, 2023, 50(3):40-49.
	YANG Jingya, QI Yanli, ZHOU Yiqing, et al. Algorithm for Recognition of Lightweight Intelligent Modulation Based on the CNN-Transformer Networks[J]. Journal of Xidian University, 2023, 50(3):40-49.

方法	数据集
	YouTube-VOS			DAVIS
	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
VINet^[8]	26.911 8	0.915 1	0.081 9	25.671 2	0.899 0	0.095 2
FGVC^[11]	29.833 4	0.940 6	0.044 6	27.782 8	0.927 1	0.051 7
STTN^[14]	32.019 1	0.959 8	0.040 2	28.585 6	0.936 7	0.054 8
Fuseformer^[15]	33.359 2	0.968 1	0.034 9	30.002 7	0.951 3	0.048 6
文中方法	32.209 2	0.960 9	0.039 0	28.697 7	0.936 5	0.055 4

方法	参数量 /M	60帧		100帧		140帧		180帧		总时长/s
方法	参数量 /M	显存/GB	时间/s	显存/GB	时间/s	显存/GB	时间/s	显存/GB	时间/s	总时长/s
Fuseformer	36.59	12.51	5.04	21.65	11.80	20.27	21.62	20.75	35.37	931
STTN	16.55	4.06	3.38	5.69	6.67	6.97	10.50	8.59	15.35	656
文中方法	14.97	5.14	3.79	5.69	8.10	6.97	12.50	8.59	18.08	731

数量	PSNR/dB↑	SSIM↑	LPIPS↓
×1	30.696 2	0.949 8	0.046 8
×2	31.353 6	0.954 8	0.043 5
×3	31.629 6	0.956 8	0.042 0
×4	32.209 2	0.960 9	0.039 0

方法	PSNR/dB↑	SSIM↑	LPIPS↓
S	32.019 1	0.959 8	0.040 2
S+WSA	31.739 4	0.957 7	0.042 2
S+FFCR	31.975 0	0.959 4	0.040 1
S+FFCR+WSA	32.209 2	0.960 9	0.039 0

方法	PSNR/dB↑	SSIM↑	LPIPS↓
F	33.359 2	0.968 1	0.034 9
F+FFCR	33.438 4	0.968 5	0.034 4

Texture-aware video inpainting algorithm based on the multi-attention mechanism

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 17

References 25

Related Articles 6

Metrics

Comments

Recommended 0

[1]	LIU Wei, WANG Mengyang, BAI Baoming. Efficient semantic communication method for bandwidth constrained scenarios [J]. Journal of Xidian University, 2024, 51(3): 9-18.
[2]	HENG Hongjun, YU Longwei. Time series anomaly detection based on multi-scale feature information fusion [J]. Journal of Xidian University, 2024, 51(3): 203-214.
[3]	ZHANG Xiangnan, GAO Xinbo, TIAN Chunna. Complex text region detection based on polygon feature pooling and the transformer [J]. Journal of Xidian University, 2024, 51(3): 113-123.
[4]	ZHAI Fengwen, SUN Fanglin, JIN Jing. Study of EEG classification of depression by multi-scale convolution combined with the Transformer [J]. Journal of Xidian University, 2024, 51(2): 182-195.
[5]	ZHANG Xinyu, LIANG Yu, ZHANG Wei. Real-time smoke segmentation algorithm combining global and local information [J]. Journal of Xidian University, 2024, 51(1): 147-156.
[6]	XU Dian;SHI Xiao-wei. Microwave circuit optimization based on the immune algorithm [J]. J4, 2004, 31(6): 900-904.