Vision-Language Feature Fusion and Wavelet Attention for Fine-Grained Visual Recognition

Keywords

fine-grained recognition

Abstract

Fine-grained visual recognition requires models to distinguish subtle differences among visually similar categories. Recent vision-language models have shown strong semantic representation ability, while wavelet-based convolutional networks improve sensitivity to frequency-domain details. This article proposes a hybrid feature fusion framework that combines semantic-guided visual encoding with multi-scale wavelet attention. The framework first extracts visual features through convolutional and transformer-based encoders, then enhances texture, contour, and lesion-related information through wavelet transform convolution. A semantic guidance module introduces category-level descriptions to improve discriminative learning and interpretation. Finally, an attention module refines channel and spatial features before classification or caption generation. The method is designed for fine-grained recognition tasks such as remote sensing scene analysis, melanoma image recognition, breast cancer DCE-MRI segmentation support, medical imaging interpretation, and industrial defect identification. The article highlights the complementary relationship between semantic representation and frequency-aware visual detail. By integrating these two feature sources, the framework aims to improve both recognition accuracy and interpretability in complex visual domains.

References

Zhu, Y. (2026). An Image Recognition Method Based on Multi-Scale Wavelet Transform Convolution and Convolutional Block Attention. Conference Paper.

Wang, C., Zheng, G., Zhang, R., & Liu, X. (2026). DPPF: Dual-Path Pre-Fusion With Semantic-Guided Encoding for Remote Sensing Image Captioning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

Guo, Z., Zhao, K., & Zhang, L. (2026). InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 10577–10581. doi: 10.1109/ICASSP55912.2026.11462690.

Yang, J., Chung, C. I., Koach, J., Liu, H., Navalkar, A., He, H., et al., & Shu, X. (2024). MYC phase separation selectively modulates the transcriptome. Nature Structural & Molecular Biology, 31(10), 1567–1579. doi: 10.1038/s41594-024-01322-6.

Xie, S., Xu, L., Lei, C., Wang, J., Wang, J., Wang, Z., Sun, Y., Li, D., Li, F., Lin, R., et al. (2026). RST2G: Residual-Guided Spatiotemporal Transformer Graph Fusion Enhancement for Breast Cancer Segmentation in DCE-MRI. Cyborg and Bionic Systems, 7, 0502.

Liu, Y., Li, C., Li, F., Lin, R., Zhang, D., & Lian, Y. (2025). Advances in computer vision and deep learning-facilitated early detection of melanoma. Briefings in Functional Genomics, 24, elaf002.

Lang, H., Zhou, Y., Yu, Y., Su, Z., Zhuge, H., Wang, W., Fang, D., Qin, J., Wei, M., et al. (2026). Multi-modal low-dose medical imaging through instruction-guided unified AI. Frontiers in Medicine, 13, 1691143.

Zhang, C., Liu, W., Yang, P., Lin, R., Pu, L., & Zhang, H. (2025). Dual roles of innate immune cells and cytokines in shaping the breast cancer microenvironment. Frontiers in Immunology, 16, 1654947.

Ren, X., Ma, Y., Li, J., Liu, Y., Liao, X., Lin, R., & Qiu, Z. (2025). Development of an immune scoring system based on exosome-related gene expression for prognosis and treatment response prediction in breast cancer. Discover Oncology, 16, 957.

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. Proceedings of ICML 2021.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional Block Attention Module. European Conference on Computer Vision, 3–19.

Hu, Y., Yuan, J., Wen, C., Lu, X., & Li, X. (2025). RSGPT: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 224, 272–286.