Semantic-Guided Remote Sensing Image Captioning With Dual-Path Visual Feature Fusion

Keywords

remote sensing image captioning
semantic-guided encoding

Abstract

Remote sensing image captioning aims to generate natural language descriptions for complex aerial and satellite images. Unlike natural image captioning, remote sensing scenes often contain small objects, dense spatial structures, large-scale background variation, and domain-specific semantic categories. This article proposes a semantic-guided dual-path feature fusion framework for remote sensing image captioning. The framework separately encodes visual patch features and semantic category cues, then fuses them through cross-attention to improve object grounding and sentence generation. Inspired by recent developments in vision-language modeling, the framework emphasizes early multimodal interaction rather than late-stage feature concatenation. The model is designed to improve caption accuracy, semantic completeness, and spatial consistency across benchmark datasets such as UCM-Captions, Sydney-Captions, and RSICD. Attention visualization is used to analyze whether generated words correspond to relevant image regions. The study contributes to remote sensing vision-language research by presenting a structured approach for combining semantic guidance, visual attention, and transformer-based caption decoding.

References

Wang, C., Zheng, G., Zhang, R., & Liu, X. (2026). DPPF: Dual-Path Pre-Fusion With Semantic-Guided Encoding for Remote Sensing Image Captioning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

Zhu, Y. (2026). An Image Recognition Method Based on Multi-Scale Wavelet Transform Convolution and Convolutional Block Attention. Conference Paper.

Shao, W. (2026). Interpretable Ensemble Learning for Network Traffic Anomaly Detection: A SHAP-Based Explainable AI Framework for Embedded Systems Security. arXiv preprint arXiv:2603.28654.

Guo, Z., Zhao, K., & Zhang, L. (2026). InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 10577–10581. doi: 10.1109/ICASSP55912.2026.11462690.

Lu, X., Wang, B., Zheng, X., & Li, X. (2018). Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4), 2183–2195.

Shi, Z., & Zou, Z. (2017). Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Transactions on Geoscience and Remote Sensing, 55(6), 3623–3634.

Zhang, Z., Diao, W., Zhang, W., Yan, M., Gao, X., & Sun, X. (2019). LAM: Remote sensing image captioning with label-attention mechanism. Remote Sensing, 11(20), 2349.

Li, Y., Fang, S., Jiao, L., et al. (2020). A multi-level attention model for remote sensing image captions. Remote Sensing, 12(6), 939.

Hu, Y., Yuan, J., Wen, C., Lu, X., & Li, X. (2025). RSGPT: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 224, 272–286.

Xu, K., Ba, J., Kiros, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of ICML 2015.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.