BERT and Beyond: A Comprehensive Survey of Natural Language Processing Techniques for Information Retrieval

Journal of Intelligent Communication

Review

BERT and Beyond: A Comprehensive Survey of Natural Language Processing Techniques for Information Retrieval

Al‑Sarori, M. H., Sufyan, M. M. A., Al‑Asaly, M., & Al‑Maamari, G. A. A. (2025). BERT and Beyond: A Comprehensive Survey of Natural Language Processing Techniques for Information Retrieval. Journal of Intelligent Communication, 4(2), 93–114. https://doi.org/10.54963/jic.v4i2.1706

Authors

  • Mokhtar H. Al‑Sarori

    Faculty of Information Technology & Computer Science, University of Saba Region, Marib, Yemen
  • Mubarak Mohammed Al‑Ezzi Sufyan

    Department of Computer Information Systems, Al‑Jawf Faculty, University of Saba Region, Marib, Yemen
  • Mahfoudh Al‑Asaly

    Department of Information Technology, College of Computer, Qassim University, Buraydah 51174, Saudi Arabia
  • Ghassan Abdullah Abdulwasea Al‑Maamari

    Faculty of Information Technology & Computer Science, University of Saba Region, Marib, Yemen

Received: 10 July 2025; Revised: 18 August 2025; Accepted: 21 August 2025; Published: 6 September 2025

Information Retrieval (IR) has undergone a profound transformation in the field of Natural Language Processing (NLP), shifting from traditional keyword-based approaches to neural architectures and, more recently, to advanced generative Large Language Models (LLMs). Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) have substantially improved semantic understanding and retrieval accuracy by enabling contextualized embeddings, deeper interaction modeling, and more effective ranking mechanisms. The rise of Retrieval-Augmented Generation (RAG) represents a significant development by integrating retrieval with generation to produce factually-grounded, context-aware, and explainable outputs, while reducing the likelihood of hallucinations commonly associated with LLMs. This survey provides a comprehensive review of modern IR techniques, focusing on BERT-based retrieval models, emerging generative retrieval frameworks, evaluation methodologies, and key application domains. We provide a structured taxonomy of IR methods and conduct a comparative analysis of state-of-the-art research to highlight performance trends and methodological distinctions. Ongoing challenges, including scalability, computational efficiency, interpretability, and limitations of current evaluation benchmarks, are critically discussed. Additionally, the survey explores emerging directions such as multimodal and cross-modal retrieval, hybrid dense-sparse architectures, knowledge-graph-enhanced retrieval, and the integration of LLMs as unified retriever-generator systems. These advancements illustrate the rapid evolution of IR. They also underscore the need for adaptive, reliable, and transparent retrieval solutions in increasingly complex information environments. This work aims to provide researchers and practitioners with a clear and organized overview of the evolution, current landscape, and future research opportunities in IR within the era of LLM-driven NLP.

Keywords:

Information Retrieval Generative Models Retrieval‑Augmented Generation Natural Language Processing Transformer Models Semantic Search Deep Retrieval Architectures

References

  1. Wang, J.; Gardazi, N.M.; Daud, A.; et al. Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. ACM Comput. Surv. 2024, 56, 1–33.
  2. Gardazi, N.M.; Daud, A.; Malik, M.K.; et al. BERT Applications in Natural Language Processing: A Review. Artif. Intell. Rev. 2025, 58, 1–49.
  3. Salem, M.; Mohamed, A.; Shaalan, K. Transformer Models in Natural Language Processing: A Comprehensive Review and Prospects for Future Development. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics (AISI 2025). AISI 2025. Lecture Notes on Data Engineering and Communications Technologies; Hassanien, A.E., Rizk, R.Y., Darwish, A., Eds.; Springer: Cham, Switzerland, 2025; Vol 238, pp. 463–472. DOI: https://doi.org/10.1007/978-3-031-81308-5_42
  4. Zhu, Y.; Liu, X.; Zhang, H.; et al. Large Language Models for Information Retrieval: A Survey. ACM Trans. Inf. Syst. 2023, 44, 1–54. DOI: https://doi.org/10.1145/3748304
  5. Fan, W.; Chen, Z.; Yang, P.; et al. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August2024; pp. 6491–6501.
  6. Garcia-Carmona, A.M.; Prieto, M.-L.; Puertas, E.; et al. Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study. JMIR AI 2025, 4. DOI: https://doi.org/10.2196/68776
  7. Huang, Y.; Huang, J.X. Exploring ChatGPT for Next-Generation Information Retrieval: Opportunities and Challenges. Web Intelligence. 2024, 22, 31–44. DOI: https://doi.org/10.3233/WEB-230363
  8. Yang, W.; Some, L.; Bain, M.; et al. A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods. Knowl.-Based Syst. 2025, 318. DOI: https://doi.org/10.1016/j.knosys.2025.113503
  9. Premasiri, D.; Ranasinghe, T.; Mitkov, R. LLM-Based Embedders for Prior Case Retrieval. In Proceedings of Recent Advances in Natural Language Processing, Varna, Bulgaria, 8–10 September 2025; pp. 980–988.
  10. Wu, S.; Zhang, Y.; Chen, L.; et al. Retrieval-Augmented Generation for Natural Language Processing: A Survey. arXiv preprint 2025, arXiv.2407.13193. DOI: https://doi.org/10.48550/arXiv.2407.13193
  11. Alvarado-Maldonado, D.; Johnston, B.; Brown, C.J. Natural Language Processing Tools for Pharmaceutical Manufacturing Information Extraction from Patents. arXiv preprint 2025, arXiv.2504.20598. DOI: https://doi.org/10.48550/arXiv.2504.20598
  12. Zhang, Q.; Liu, M.; Zhou, Y.; et al. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv preprint 2025, arXiv.2501.13958. DOI: https://doi.org/10.48550/arXiv.2501.13958
  13. Qin, L.; Zhao, X.; Wang, H.; et al. Large Language Models Meet NLP: A Survey. FCS 2024, arXiv.2405.12819. DOI: https://doi.org/10.48550/arXiv.2405.12819
  14. Rojas-Carabali, W.; Liu, S.; Lee, J.; et al. Natural Language Processing in Medicine and Ophthalmology: A Review for the 21st-Century Clinician. Asia-Pac. J. Ophthalmol. 2024, 13. DOI: https://doi.org/10.1016/j.apjo.2024.100084
  15. Zhao, S.; Yang, Y.; Wang, Z.; et al. Retrieval-Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely. arXiv preprint 2024, arXiv.2409.14924. DOI: https://doi.org/10.48550/arXiv.2409.14924
  16. Kuo, T.-L.; Chiu, T.-W.; Lin, T.-S.; et al. A Survey of Generative Information Retrieval. arXiv preprint 2024, arXiv.2406.01197. DOI: https://doi.org/10.48550/arXiv.2406.01197
  17. Demotte, P.; Wijegunarathna, K.; Meedeniya, D.; et al. Enhanced Sentiment Extraction Architecture for Social Media Content Analysis Using Capsule Networks. Multimed. Tools Appl. 2023, 82, 8665–8690.
  18. Xu, Z.; Li, Q.; Zhang, Y.; et al. A Survey of Model Architectures in Information Retrieval. arXiv preprint 2025, arXiv.2502.14822. DOI: https://doi.org/10.48550/arXiv.2502.14822
  19. Yoran, O.; Wolfson, T.; Ram, O.; et al. Making Retrieval-Augmented Language Models Robust to Irrelevant Context. arXiv preprint 2023, arXiv.2310.01558. DOI: https://doi.org/10.48550/arXiv.2310.01558
  20. Wu, H.; Li, S.; Gao, Y.; et al. Natural Language Processing in Educational Research: The Evolution of Research Topics. Educ. Inf. Technol. 2024, 29, 23271–23297.
  21. Wang, X. The Application of NLP in Information Retrieval. Appl. Comput. Eng. 2024, 42, 290–297.
  22. Mansour, E.; Alsaud, A.B.F.B.M. Advanced Information Retrieval Techniques in the Big Data Era: Trends, Challenges, and Applications. Metall. Mater. Eng. 2025, 31, 466–483.
  23. Yan, H.; Xiao, J.; Zhang, B.; et al. The Application of Natural Language Processing Technology in the Era of Big Data. J. Ind. Eng. Appl. Sci. 2024, 2, 20–27.
  24. Dai, S.; Yang, X.; Liu, H.; et al. Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration. In Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 7052–7074.
  25. Genesis, J.; Keane, F. Integrating Knowledge Retrieval with Generation: A Comprehensive Survey of RAG Models in NLP. arXiv preprint 2025, 202504.0351. Available from: https://www.preprints.org/manuscript/202504.0351
  26. Gan, A.; Li, B.; Zhou, Q.; et al. Retrieval-Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv preprint 2025, arXiv.2504.14891. DOI: https://doi.org/10.48550/arXiv.2504.14891
  27. Lenadora, D.; Gamage, G.; Haputhanthri, D.; et al. Exploratory Analysis of a Social Media Network in Sri Lanka During the COVID-19 Virus Outbreak. arXiv preprint 2020, arXiv.2006.07855. DOI: https://doi.org/10.48550/arXiv.2006.07855
  28. Cheng, M.; Wang, J.; Li, S.; et al. A Survey on Knowledge-Oriented Retrieval-Augmented Generation. arXiv preprint 2025, arXiv.2503.10677. DOI: https://doi.org/10.48550/arXiv.2503.10677
  29. Mei, L.; Mo, S.; Yang, Z.; et al. A Survey of Multimodal Retrieval-Augmented Generation. arXiv preprint 2025, arXiv.2504.08748. DOI: https://doi.org/10.48550/arXiv.2504.08748
  30. Subi, S.; Shanthini, B.; SilpaRaj, M.; et al. Natural Language Processing Techniques for Information Retrieval Enhancing Search Engines with Semantic Understanding. ITM Web Con. 2025, 76. DOI: https://doi.org/10.1051/itmconf/20257605013
  31. Zhu, Z.; Huang, T.; Wang, K.; et al. Graph-Based Approaches and Functionalities in Retrieval-Augmented Generation: A Comprehensive Survey. arXiv preprint 2025, arXiv.2504.10499. DOI: https://doi.org/10.48550/arXiv.2504.10499
  32. Le, D.-V.-T.; Bigo, L.; Herremans, D.; et al. Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: A Survey. ACM Comput. Surv. 2025, 57, 1–40.
  33. Wu, Y.; Liao, L.; Fang, Y. Retrieval-Augmented Generation for Dynamic Graph Modeling. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, Padua, Italy, 13–18 July 2025; pp. 1434–1443.
  34. Nawaz, A. Advancements in Natural Language Processing for Multilingual Information Retrieval Systems. Multidiscip. Res. Comput. Inf. Syst. 2025, 5, 197–216.
  35. Li, M.; Zhang, Y.; Fang, X.; et al. A Survey of Long-Document Retrieval in the PLM and LLM Era. arXiv preprint 2025, arXiv.2509.07759. DOI: https://doi.org/10.48550/arXiv.2509.07759
  36. Hu, Y.; Lu, Y. RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing. arXiv preprint 2024, arXiv.2404.19543. DOI: https://doi.org/10.48550/arXiv.2404.19543
  37. Jiang, X.; Wang, W.; Tian, S.; et al. Applications of Natural Language Processing and Large Language Models in Materials Discovery. npj Comput. Mater. 2025, 11. DOI: https://doi.org/10.1038/s41524-025-01554-0
  38. Le, K.D.R.; Tay, S.B.P.; Choy, K.T.; et al. Applications of Natural Language Processing Tools in the Surgical Journey. Front. Surg. 2024, 11. DOI: https://doi.org/10.3389/fsurg.2024.1403540
  39. Alomari, E. Unlocking the Potential: A Comprehensive Systematic Review of ChatGPT in Natural Language Processing Tasks. Comput. Model. Eng. Sci. 2024, 141, 43–85. DOI: https://doi.org/10.32604/cmes.2024.052256
  40. Kim, H.-W.; Shin, D.-H.; Kim, J.; et al. Assessing the Performance of ChatGPT’s Responses to Questions Related to Epilepsy: A Cross-Sectional Study on Natural Language Processing and Medical Information Retrieval. Seizure 2024, 114, 1–8.