ENT Updates

Review

Research Progress in Intelligent Diagnosis of Vocal Fold Lesions Based on Multimodal Deep Learning: A Narrative Review

Gao, G., Zhao, K., & Liu, M. (2026). Research Progress in Intelligent Diagnosis of Vocal Fold Lesions Based on Multimodal Deep Learning: A Narrative Review. ENT Updates, 16(1), 26–41. https://doi.org/10.54963/entu.v16i1.2127

Authors

  • Ge Gao

    Graduate School, Medical School of Chinese PLA, Beijing 100080, China
  • Kai Zhao

    Department of Otorlaryngology Head and Neck Surgery, Hainan Hospital of Chinese PLA General Hospital, Sanya 572013, China
  • Mingbo Liu

    Department of Otorlaryngology Head and Neck Surgery, Chinese PLA General Hospital, Beijing 100080, China

Received: 19 December 2025; Revised: 9 February 2026; Accepted: 3 March 2026; Published: 12 March 2026

The diagnosis of vocal fold (VF) lesions relies on examinations such as laryngoscopy and voice analysis, which are highly dependent on clinicians' experience. This leads to a relatively higher risk of misdiagnosis and missed diagnosis among junior physicians. In recent years, with the rapid advancement of artificial intelligence (AI), numerous deep learning (DL)-based methods have emerged in this field. Early research primarily focused on single-modality image analysis, such as classifying white light or narrow-band images into benign or malignant lesions using convolutional neural networks (CNNs). However, such methods often fail to fully integrate complementary information from different modalities and fall to meet the clinical demands for multi-classification and risk stratification. Recently, DL and multimodal fusion have gradually become research hotspots. It enables the extraction of complementary multi-category feature information by integrating laryngoscopic images, videos, voice, and clinical text data (e.g., laryngoscopy reports and medical record information), to construct an end-to-end intelligent diagnostic system. This narrative review summarizes the research progress of DL and multimodal fusion in the diagnosis, classification, and severity grading of VF lesions over the past five years (2020–2025). Studies demonstrate that multimodal DL models outperform single-modality models across multiple tasks, which significantly improves the identification and classification accuracy of VF lesions. These models exhibit promising performance. However, DL and multimodal fusion still face numerous challenges, and their clinical translation remains difficult. 

Keywords:

Vocal Fold Multimodal Deep Learning Intelligent Diagnosis Laryngoscopic Images Voice Analysis

References

  1. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444.
  2. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90.
  3. He, K.; Zhang, X.; Ren, S.; et al. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008.
  5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv preprint 2021, arXiv:2010.11929.
  6. Ho, J.; Jain, A.; Abbeel, P.; et al. Denoising Diffusion Probabilistic Models. arXiv preprint 2020, arXiv:2006.11239.
  7. Zhao, Q.; He, Y.; Wu, Y.; et al. Vocal Cord Lesions Classification Based on Deep Convolutional Neural Network and Transfer Learning. Med. Phys. 2022, 49, 432–442.
  8. Cho, W.K.; Choi, S.-H. Comparison of Convolutional Neural Network Models for Determination of Vocal Fold Normality in Laryngoscopic Images. J. Voice 2022, 36, 590–598.
  9. Yang, K.O.; Kim, S.Y.; Kang, C.W.; et al. Diagnosis of Unilateral Vocal Fold Paralysis Using Auto-Diagnostic Deep Learning Model. Sci. Rep. 2025, 15, 27635.
  10. Wang, M.-L.; Tie, C.-W.; Wang, J.-H.; et al. Multi-Instance Learning Based Artificial Intelligence Model to Assist Vocal Fold Leukoplakia Diagnosis: A Multicentre Diagnostic Study. Am. J. Otolaryngol. 2024, 45, 104342.
  11. Yan, P.; Li, S.; Zhou, Z.; et al. Automated Detection of Glottic Laryngeal Carcinoma in Laryngoscopic Images from a Multicentre Database Using a Convolutional Neural Network. Clin. Otolaryngol. 2023, 48, 436–441.
  12. Tran, B.A.; Dao, T.T.P.; Dung, H.D.Q.; et al. Support of Deep Learning to Classify Vocal Fold Images in Flexible Laryngoscopy. Am. J. Otolaryngol. 2023, 44, 103800.
  13. Wellenstein, D.J.; Woodburn, J.; Marres, H.A.M.; et al. Detection of Laryngeal Carcinoma During Endoscopy Using Artificial Intelligence. Head Neck 2023, 45, 2217–2226.
  14. You, Z.; Han, B.; Shi, Z.; et al. Vocal Cord Leukoplakia Classification Using Deep Learning Models in White Light and Narrow Band Imaging Endoscopy Images. Head Neck 2023, 45, 3129–3145.
  15. Nobel, S.M.N.; Rahman Swapno, S.M.M.; Islam, M.R.; et al. A Machine Learning Approach for Vocal Fold Segmentation and Disorder Classification Based on Ensemble Method. Sci. Rep. 2024, 14, 14435.
  16. Wei, R.; Liang, Y.; Geng, L.; et al. A Non-Local Dual-Stream Fusion Network for Laryngoscope Recognition. Am. J. Otolaryngol. 2025, 46, 104565.
  17. Khazrak, I.; Zainaee, S.; Rezaee, M.M.; et al. Feasibility of Improving Vocal Fold Pathology Image Classification with Synthetic Images Generated by DDPM-Based GenAI: A Pilot Study. Eur. Arch. Otorhinolaryngol. 2025, 282, 4139–4153.
  18. You, Z.; Han, B.; Shi, Z.; et al. Vocal Cord Leukoplakia Classification Using Siamese Network under Small Samples of White Light Endoscopy Images. Otolaryngol. Head Neck Surg. 2024, 170, 1099–1108.
  19. Tsung, C.K.; Tso, Y.A. Recognizing Edge-Based Diseases of Vocal Cords by Using Convolutional Neural Networks. IEEE Access 2022, 10, 120383–120397.
  20. DeVore, E.K.; Adamian, N.; Jowett, N.; et al. Predictive Outcomes of Deep Learning Measurement of the Anterior Glottic Angle in Bilateral Vocal Fold Immobility. Laryngoscope 2023, 133, 2285–2291.
  21. Kavak, Ö.T.; Gündüz, Ş.; Vural, C.; et al. Artificial Intelligence Based Diagnosis of Sulcus: Assessment of Videostroboscopy via Deep Learning. Eur. Arch. Otorhinolaryngol. 2024, 281, 6083–6091.
  22. Larsen, C.F.; Pedersen, M. Comparison of Convolutional Neural Networks for Classification of Vocal Fold Nodules from High-Speed Video Images. Eur. Arch. Otorhinolaryngol. 2023, 280, 2365–2371.
  23. Kumar, S.P.; Narayanan, N.; Ramachandran, J.; et al. Convolutional Neural Network for Voice Disorders Classification Using Kymograms. Biomed. Signal Process. Control 2023, 86, 105159.
  24. Panchami, B.; Kumar, S.P. Comparison of Deep Learning Models for Voice Disorder Classification Using Kymographic Images. J. Voice 2025, in press.
  25. Panchami, B.; Kumar, S.P. Comparison of Deep Learning Models for Voice Disorder Classification Using Phonovibrographic Images. Image Anal. Stereol. 2025, 44, 183–196.
  26. Attia, D.; Benazza-Benyahia, A. Recognizing of Vocal Fold Disorders from High Speed Video: Use of Spatio-Temporal Deep Neural Networks. Int. J. Imaging Syst. Technol. 2025, 35, e70170.
  27. Compton, E.C.; Cruz, T.; Andreassen, M.; et al. Developing an Artificial Intelligence Tool to Predict Vocal Cord Pathology in Primary Care Settings. Laryngoscope 2023, 133, 1952–1960.
  28. Chen, Z.; Zhu, P.; Qiu, W.; et al. Deep Learning in Automatic Detection of Dysphonia: Comparing Acoustic Features and Developing a Generalizable Framework. Int. J. Lang. Commun. Disord. 2023, 58, 279–294.
  29. Song, J.; Kim, H.; Lee, Y.O. Laryngeal Disease Classification Using Voice Data: Octave-Band vs. Mel-Frequency Filters. Heliyon 2024, 10, e40748.
  30. Fu, D.; Zhang, X.; Chen, D.; et al. Pathological Voice Detection Based on Phase Reconstitution and Convolutional Neural Network. J. Voice 2025, 39, 353–364.
  31. Chaiani, M.; Selouani, S.A.; Boudraa, M.; et al. Voice Disorder Classification Using Speech Enhancement and Deep Learning Models. Biocybern. Biomed. Eng. 2022, 42, 463–480.
  32. Kim, H.-B.; Song, J.; Park, S.; et al. Classification of Laryngeal Diseases Including Laryngeal Cancer, Benign Mucosal Disease, and Vocal Cord Paralysis by Artificial Intelligence Using Voice Analysis. Sci. Rep. 2024, 14, 9297.
  33. Awad, A.; Eldosoky, M.A.A.; Soliman, A.M.; et al. Automatic Diagnosis of Hyperkinetic Dysphonia from Speech Recordings Based on Deep Learning Approaches. Eng. Res. Express 2025, 7, 035263.
  34. Hu, H.-C.; Chang, S.-Y.; Wang, C.-H.; et al. Deep Learning Application for Vocal Fold Disease Prediction through Voice Recognition: Preliminary Development Study. J. Med. Internet Res. 2021, 23, e25247.
  35. Roitman, A.; Edelstain, Y.; Katzir, C.; et al. Harnessing Machine Learning in Diagnosing Complex Hoarseness Cases. Am. J. Otolaryngol. 2025, 46, 104533.
  36. Rahman, M.U.; Direkoglu, C. A Hybrid Approach for Binary and Multi-Class Classification of Voice Disorders Using a Pre-Trained Model and Ensemble Classifiers. BMC Med. Inform. Decis. Mak. 2025, 25, 177.
  37. Xie, X.; Cai, H.; Li, C.; et al. A Voice Disease Detection Method Based on MFCCs and Shallow CNN. J. Voice 2026, 40, 524.e1–524.e11.
  38. Ma, S.; Liao, W.; Zhang, Y.; et al. Research on Automatic Assessment of the Severity of Unilateral Vocal Cord Paralysis Based on Mel-Spectrogram and Convolutional Neural Networks. Biomed. Eng. Online 2025, 24, 76.
  39. Jin, Z.; Shuai, Y.; Li, Y.; et al. A Vision-Language-Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation Study. JMIR Med. Inform. 2025, 13, e74902.
  40. Wang, C.-T.; Chen, T.-M.; Lee, N.-T.; et al. AI Detection of Glottic Neoplasm Using Voice Signals, Demographics, and Structured Medical Records. Laryngoscope 2024, 134, 4585–4592.
  41. Tie, C.-W.; Li, D.-Y.; Zhu, J.-Q.; et al. Multi-Instance Learning for Vocal Fold Leukoplakia Diagnosis Using White Light and Narrow-Band Imaging: A Multicenter Study. Laryngoscope 2024, 134, 4321–4328.
  42. Yousef, A.M.; Deliyski, D.D.; Zacharias, S.R.C.; et al. Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy during Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach. J. Voice 2024, 38, 951–962.
  43. Xiong, M.; Luo, J.-W.; Ren, J.; et al. Applying Deep Learning with Convolutional Neural Networks to Laryngoscopic Imaging for Automated Segmentation and Classification of Vocal Cord Leukoplakia. Ear Nose Throat J. 2024, in press.
  44. Majeed, T.; Assad, A. Advances in Deep Learning for Head and Neck Cancer: Datasets and Applied Methods. ENT Updates 2025, 15, 1–26.
  45. Yao, P.; Witte, D.; Gimonet, H.; et al. Automatic Classification of Informative Laryngoscopic Images Using Deep Learning. Laryngoscope Investig. Otolaryngol. 2022, 7, 460–466.
  46. Özcan, F. Differentiability of Voice Disorders through Explainable AI. Sci. Rep. 2025, 15, 18250.
  47. Ma, K.; Wang, Y.; Zhou, Y.; et al. Acoustic Signatures of Organic Lesions and the Role of Artificial Intelligence in Voice Disorder Diagnostics. Digit. Health 2025, 11, 20552076251376264.