[1] LI G, ZHU L C, LIU P, et al. Entangled transformer for image captioning[C]// 2019 IEEE/CVF International Conference on Computer Vision. 2019:8928-8937.
[2] DENG Z R, JIANG Z Q, LAN R S, et al. Image captioning using DenseNet network and adaptive attention[J]. Signal Processing Image Communication,2020,85:115836. DOI: 10.1016/j.image.2020.115836.
[3] ZHU X X, LI L X, LIU J, et al. Image captioning with word gate and adaptive self-critical learning[J]. Applied Sciences, 2018,8(6):909.
[4] GUAN Z B, LIU K, MA Y, et al. Middle-level attribute-based language retouching for image caption generation[J]. Applied Sciences, 2018,8(10):1850.
[5] YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:10677-10686.
[6] LAWRENCE S, GILES C L, TSOI A C, et al. Face recognition: A convolutional neural network approach[J]. IEEE Transactions on Neural Networks, 1997,8(1): 98-113.
[7] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[8] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
[9] TAN M X, LE Q V. EfficientNet: Rethinking model scaling for convolutional neural networks[J]. arXiv preprint arXiv:1905.11946, 2019.
[10] GRAVES A. Generating sequences with recurrent neural networks[J]. arXiv preprint arXiv:1308.0850, 2013.
[11] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997,9(8):1735-1780.
[12] CHO K, VAN B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computer Science, 2014,21(4):62-72.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// 2017 Advances in Neural Information Processing Systems. 2017:5998-6008.
[14] TAY Y, BAHRI D, METZLER D, et al. Synthesizer: Rethinking self-attention for transformer models[C]// Proceedings of the 38th International Conference on Machine Learning Computer Science. 2021,139:10183-10192.
[15] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
[16] CORNIA M, BARALDI L, CUCCHIARA R, et al. Show, control and tell: A framework for generating controllable and grounded captions[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8299-8308.
[17] CHEN S Z, JIN Q, WANG P, et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:9956-9968.
[18] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: Attentive language models beyond a fixed-length context[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2978-2988.
[19] HUANG Z H, LIANG D, XU P, et al. Improve transformer models with better relative position embeddings[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020.
[20] WANG B Y, SHANG L F, LIOMA C, et al. On position embeddings in BERT[C]// 2021 International Conference on Learning Representations. 2021.
[21] LIN T Y, MAIRE M, BELONGIE S, et a1. MicrosoftCOCO: Common objects in context[C]// 2014 European Conference on Computer Vision. 2014:740-755.
[22] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016,39(4):664-676.
[23] YONG H W, HUANG J Q, HUA X S, et al. Gradient centralization: A new optimization technique for deep neural networks[J]. arXiv preprint arXiv:2004.01461, 2020.
[24] WANG P, HWEE T N. A beam-search decoder for normalization of social media text with application to machine translation[C]// The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013:471-481.
[25] PAPINENI K, ROUKOS S, WARD T, et al. Blue: A method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.