Multi-modal Disaster Analysis Based on Embracing Fusion

Abstract

Abstract: The multi-modal information fusion of texts and images can improve the accuracy of disaster event analysis compared with single-modality. However, most of the existing works simply merge the text features and image features, resulting in feature redundancy when extracting and fusing features, while ignoring the relationship between modalities, and the correlation of features between images and texts is not considered. To this end, this article analyzes and studies the current popular multi-modal fusion algorithms, and proposes a multi-modal disaster event analysis algorithm based on embrace fusion. First, the feature vectors of texts and those of images are compared with each other, and the correlation between text features and image features is considered. Then, based on multinomial sampling, the redundancy of features is eliminated, and text features and image features are fused. The experimental results show that the classification accuracy rates of the two tasks of Embrace Fusion on the CrisisMMD2.0 dataset are as high as 88.2% and 85.1%, respectively, which are significantly better than other multimodal fusion models, proving the effectiveness of the model. At the same time, the second experiment also verifies the applicability of the hug model to different text and image deep learning models.

Key words: multi-modal fusion, embrace fusion, multinomial sampling, multi-modal disaster event, CrisisMMD2.0

MEI Xin, MIAO Zi-jing. Multi-modal Disaster Analysis Based on Embracing Fusion[J]. Computer and Modernization, 2022, 0(10): 82-87.

References

［1］邬柯杰,吴吉东,叶梦琪. 社交媒体数据在自然灾害应急管理中的应用研究综述［J］. 地理科学进展, 2020,39(8):1412-1422.
［2］ IMRAN M, CASTILLO C, DIAZ F, et al. Processing social media messages in mass emergency: A survey［J］. ACM Computing Surveys, 2015,47(4). DOI: 10.1145/2771588.
［3］ DALY S, THOM J A. Mining and classifying image posts on social media to analyse fires［C］// Proceedings of the ISCRAM 2016 Conference. 2016.
［4］ HUANG Y, DU C Z, XUE Z H, et al. What makes multi-modal learning better than single (provably)［J］. arXiv preprint arXiv:2106.04538, 2021.
［5］ BALTRUSAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: A survey and taxonomy［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(2):423-443.
［6］ ATREY P K, HOSSAIN M A, EL SADDIK A, et al. Multimodal fusion for multimedia analysis: A survey［J］. Multimedia Systems, 2010,16(6):345-379.
［7］ SNOEK C G M, WORRING M, SMEULDERS A W M. Early versus late fusion in semantic video analysis［C］// Proceedings of the 13th Annual ACM International Conference on Multimedia. 2005:399-402.
［8］ HALL D L, LLINAS J. An introduction to multisensor data fusion［J］. Proceedings of the IEEE, 1997,85(1):6-23.
［9］ NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction［C］// Proceedings of the 18th ACM International Conference on Multimodal Interaction. 2016:284-288.
［10］OFLI F, ALAM F, IMRAN M. Analysis of social media data using multimodal deep learning for disaster response［J］. arXiv preprint arXiv:2004.11838, 2020.
［11］PEREZ-RUA J M, VIELZEUF V, PATEUX S, et al. MFAS: Multimodal fusion architecture search［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6959-6968.
［12］XU N, MAO W J, CHEN G D. Multi-interactive memory network for aspect based multimodal sentiment analysis［C］// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019,33(1):371-378.
［13］ABAVISANI M, WU L W, HU S L, et al. Multimodal categorization of crisis events in social media［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:14667-14677.
［14］LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015:1449-1457.
［15］NGUYEN D, NGUYEN K, SRIDHARAN S, et al. Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition［J］. Computer Vision and Image Understanding, 2018,174:33-42.
［16］CHOI J H, LEE J S. EmbraceNet: A robust deep learning architecture for multimodal classification［J］. Information Fusion, 2019,51:259-270.
［17］MERITY S, KESKAR N S, SOCHER R. Regularizing and optimizing LSTM language models［J］. arXiv preprint arXiv:1708.02182, 2017.
［18］IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift［C］// Proceedings of the 32nd International Conference on Machine Learning. 2015:448-456.
［19］SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting［J］. The Journal of Machine Learning Research, 2014,15(1):1929-1958.
［20］WAN L, ZEILER M, ZHANG S X, et al. Regularization of neural networks using dropconnect［C］// Proceedings of the 30th International Conference on Machine Learning. 2013,3:1058-1066.
［21］LECUN Y, BENGIO Y, HINTON G. Deep learning［J］. Nature, 2015,521(7553):436-444.
［22］HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［23］ALAM F, OFLI F, IMRAN M. CrisisMMD: Multimodal Twitter datasets from natural disasters［C］// Proceedings of the 12th International AAAI Conference on Web and Social Media. 2018:465-473.
［24］HOWARD J, GUGGER S. Fastai: A layered API for deep learning［J］. Information, 2020,11(2). DOI: 10.3390/info11020108.
［25］DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization［J］. The Journal of Machine Learning Research, 2011,12:2121-2159.