Vision-Language Model for Automated Optical Surface Quality Assessment and Inspection Report Generation
PDF

Keywords

Vision-language model
Optical metrology
Quality inspection
Thermal imaging
Industrial AI

Abstract

Existing deep learning systems for optical surface metrology produce only numerical outputs—temperature maps, phase values, defect segmentation masks—requiring human experts to interpret these outputs and generate quality assessment reports. This manual interpretation step is a significant bottleneck in production-line inspection, creates inter-observer variability, and limits the scalability of automated quality control systems. This study proposes a vision-language model (VLM) framework for automated optical surface quality assessment and inspection report generation, building upon the measurement methodologies established by Huang, Yang, and Zhu. (2023) in 4D thermal imaging and the deep learning-enhanced optical metrology demonstrated by Huang, Tang, Liu, and Huang (2026). The proposed framework takes multi-modal measurement inputs—including thermal images, fringe projection maps, and defect detection outputs from dedicated task networks—and generates structured natural language inspection reports describing detected anomalies, assessing severity, and recommending disposition actions. A vision-language alignment module aligns the feature representations of optical measurement data with a frozen large language model, enabling rich textual generation from measurement inputs. A curated dataset of 12,000 expert-annotated inspection reports paired with optical measurement data is constructed for training and evaluation. Simulation experiments demonstrate that the proposed framework generates reports with expert agreement rates of 89.3% for defect classification, 84.7% for severity grading, and 91.2% for disposition recommendations, outperforming rule-based automated reporting systems by 23 percentage points. The framework provides a pathway toward fully automated, end-to-end optical quality inspection that generates human-interpretable reports directly from measurement data.

PDF

References

Huang, H., Tang, J., Liu, T., & Huang, M. (2026). Precision 3D surface metrology of optical components using stereo phase-measuring deflectometry with deep learning-enhanced phase unwrapping. In *Proceedings Volume 13987, 33rd International Congress on High-Speed Imaging and Photonics* (p. 1398704). SPIE. https://doi.org/10.1117/12.3093993

Huang, H., Yang, Y., & Zhu, Y. (2023). Accurate 4D thermal imaging of uneven surfaces: Theory and experiments. *International Journal of Heat and Mass Transfer*, 216, 124580. https://doi.org/10.1016/j.ijheatmasstransfer.2023.124580

Khader, F., Han, S., Bösl, F., Chen, D., Gao, Y., Li, Y., ... & Kather, J. N. (2023). ChiMed-GPT, a medical large language model for bridging radiology report and clinical decision. *Nature Medicine*, 29, 2318–2327. https://doi.org/10.1038/s41591-023-02571-6

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In *Proceedings of the 40th International Conference on Machine Learning* (pp. 19730–19742). PMLR.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In *Proceedings of the 38th International Conference on Machine Learning* (pp. 8748–8763). PMLR.