Evaluating Semantic Representation Strategies for Robust Information Retrieval Matching

Eoin O Connell; Niall McCarroll; Sujata Rani; Kevin Curran; Eugene McNamee; Angela Clist; Andrew Brammer

doi:10.54963/dtra.v4i3.1564

Authors

Eoin O Connell
A&O Shearman, 68 Donegall Quay, Belfast BT1 3NL, Northern Ireland
Niall McCarroll
School of Computing, Engineering and Intelligent Systems, Ulster University, Derry BT48 7JL, Northern Ireland
Sujata Rani
School of Computing, Engineering and Intelligent Systems, Ulster University, Derry BT48 7JL, Northern Ireland
Kevin Curran
School of Computing, Engineering and Intelligent Systems, Ulster University, Derry BT48 7JL, Northern Ireland
Eugene McNamee
School of Law, Ulster University, Belfast BT15 1AP, Northern Ireland
Angela Clist
A&O Shearman, 68 Donegall Quay, Belfast BT1 3NL, Northern Ireland
Andrew Brammer
A&O Shearman, 68 Donegall Quay, Belfast BT1 3NL, Northern Ireland

Received: 20 August 2025; Revised: 3 September 2025; Accepted: 26 September 2025; Published: 11 October 2025

Abstract:

Vector Space Models (VSM) and neural word embeddings are core components in recent Machine Learning (ML) and Natural Language Processing (NLP) pipelines. By encoding words, sentences and documents as high-dimensional vectors via distributional semantics, they enable Information Retrieval (IR) systems to capture semantic relatedness between queries and answers. This paper compares different semantic representation strategies for query-statement matching, evaluating paraphrase identification within an IR framework using partial and syntactically varied queries of different lengths. Motivated by the Word Mover’s Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements, as opposed to the common similarity measure of centroids of neural word embeddings. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. Our top-performing WMD + GloVe system consistently outperformed Doc2Vec and an LSA baseline across three return-rate thresholds, achieving 100% correct matches within the top-3 ranked results and 89.83% top-1 accuracy. Beyond the substantial gains from WMD-based similarity ranking, our results indicate that large, pre-trained word embeddings, trained on vast amounts of data, result in portable, domain-agnostic language processing solutions suitable for diverse business use cases.

Keywords:

Semantic Information Retrieval Word Embeddings Document Similarity Query‑Statement Matching GloV WMD

References

Cakir, A.; Gurkan, M. Modified Query Expansion Through Generative Adversarial Networks for Information Extraction in E-Commerce. Mach. Learn. Appl. 2023, 14, 100509.
Růžička, M.; Novotný, V.; Sojka, P.; et al. Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines. In Joint Proceedings of the International Workshops on Hybrid Statistical Semantic Understanding and Emerging Semantics, and Semantic Statistics (Hybrid-SemStats), Vienna, Austria, 22 October 2017.
Cohen, W.W. Data Integration Using Similarity Joins and a Word-Based Information Representation Language. ACM Trans. Inf. Syst. 2000, 18, 288–321.
Schallehn, E.; Sattler, K.U.; Saake, G. Efficient Similarity-Based Operations for Data Integration. Data Knowl. Eng. 2004, 48, 361–387.
Madhavan, J.; Bernstein, P.A.; Doan, A.; et al. Corpus-Based Schema Matching. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05), Tokyo, Japan, 5–8 April 2005.
Gries, S.T. Polysemy. Chapter 2: Polysemy. In Cognitive Linguistics: Key Topics; Dabrowska, E., Divjak, D., Eds.; De Gruyter Mouton: Berlin, Germany, 2019; pp. 23–43.
Al-Smadi, M.; Jaradat, Z.; Al-Ayyoub, M.; et al. Paraphrase Identification and Semantic Text Similarity Analysis in Arabic News Tweets Using Lexical, Syntactic, and Semantic Features. Inf. Process. Manag. 2017, 53, 640–652.
Brokos, G.I.; Malakasiotis, P.; Androutsopoulos, I. Using Centroids of Word Embeddings and Word Mover's Distance for Biomedical Document Retrieval in Question Answering. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Berlin, Germany, 12 August 2016.
van Opijnen, M.; Santos, C. On the Concept of Relevance in Legal Information Retrieval. Artif. Intell. Law 2017, 25, 65–87.
Maxwell, K.T.; Schafer, B. Concept and Context in Legal Information Retrieval. In Legal Knowledge and Information Systems; Francesconi, E., Sartor, G., Tiscornia, D., Eds.; IOS Press BV: Amsterdam, Netherlands, 2008; pp. 63–72.
Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting Similarities Among Languages for Machine Translation. arXiv preprint 2013, arXiv:1309.4168.
Dumais, S.T. Latent Semantic Analysis. Annu. Rev. Inf. Sci. Technol. 2005, 38, 188–230.
Vani, K.; Gupta, D. Unmasking Text Plagiarism Using Syntactic-Semantic Based Natural Language Processing Techniques: Comparisons, Analysis and Challenges. Inf. Process. Manag. 2018, 54, 408–432.
Zhang, C.; Zhang, L.; Wang, C.J.; et al. Text Summarization Based on Sentence Selection With Semantic Representation. In Proceedings of the 26th International Conference on Tools with Artificial Intelligence, Limassol, Cyprus, 10–12 November 2014.
Erkan, G.; Radev, D.R. LexRank: Graph-Based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 2004, 22, 457–479.
Lapata, M.; Barzilay, R. Automatic Evaluation of Text Coherence: Models and Representations. In Proceedings of the International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005.
Wegrzyn-Wolska, K.; Szczepaniak, P.S. Classification of RSS-Formatted Documents Using Full Text Similarity Measures. In Proceedings of the 5th International Conference, ICWE 2005, Sydney, Australia, 27–29 July 2005.
Gella, S.; Keller, F.; Lapata, M. Disambiguating Visual Verbs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 311–322.
Liu, T.; Guo, J. Text Similarity Computing Based on Standard Deviation. In Proceedings of the International Conference on Intelligent Computing, ICIC 2005, Hefei, China, 23–26 August 2005.
Jin, P.; Zhang, Y.; Chen, X.; et al. Bag-of-Embeddings for Text Classification. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016.
Darraz, N.; Karabila, I.; El-Ansari, A.; et al. Integrated Sentiment Analysis with BERT for Enhanced Hybrid Recommendation Systems. Expert Syst. Appl. 2025, 261, 125533.
Grainger, T.; Potter, T.T. Solr in Action. Manning Publications: Shelter Island, NY, USA, 2014.
Mikolov, T.; Sutskever, I.; Chen, K.; et al. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV, USA, 5–11 December 2013.
Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014.
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014.
Bojanowski, P.; Grave, E.; Joulin, A.; et al. Enriching Word Vectors With Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146.
Kusner, M.; Sun, Y.; Kolkin, N.; et al. From Word Embeddings to Document Distances. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015.
Tinega, G.A.; Mwangi, W.; Rimiru, R. Text Mining in Digital Libraries Using OKAPI BM25 Model. Int. J. Comput. Appl. Technol. Res. 2019, 7, 398–406.
Momtazi, S. Unsupervised Latent Dirichlet Allocation for Supervised Question Classification. Inf. Process. Manag. 2018, 54, 380–393.
Evangelopoulos, N.E. Latent Semantic Analysis. Wiley Interdiscip. Rev. Cogn. Sci. 2013, 4, 683–692.
Hyung, Z.; Park, J.S.; Lee, K. Utilizing Context-Relevant Keywords Extracted From a Large Collection of User-Generated Documents for Music Discovery. Inf. Process. Manag. 2017, 53, 1185–1200.
Islam, A.; Inkpen, D. Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity. ACM Trans. Knowl. Discov. Data 2008, 2, 1–25.
Uren, V.; Lei, Y.; Lopez, V.; et al. The Usability of Semantic Search Tools: A Review. Knowl. Eng. Rev. 2007, 22, 361–377.
Bruni, E.; Tran, N.K.; Baroni, M. Multimodal Distributional Semantics. J. Artif. Intell. Res. 2014, 49, 1–47.
Baroni, M.; Dinu, G.; Kruszewski, G. Don’t Count, Predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014.
Bengio, Y.; Ducharme, R.; Vincent, P.; et al. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 2003, 3, 1137–1155.
Schnabel, T.; Labutov, I.; Mimno, D.; et al. Evaluation Methods for Unsupervised Word Embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015.
Muneeb, T.H.; Sahu, S.; Anand, A. Evaluating Distributed Word Representations for Capturing Semantics of Biomedical Concepts. In Proceedings of the 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015), Beijing, China, 30 July 2015.
Cao, S.; Lu, W. Improving Word Embeddings With Convolutional Feature Learning and Subword Information. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017.
Bagheri, E.; Ensan, F.; Al-Obeidat, F. Neural Word and Entity Embeddings for Ad Hoc Retrieval. Inf. Process. Manag. 2018, 54, 657–673.
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019.
Nalisnick, E.; Mitra, B.; Craswell, N.; et al. Improving Document Ranking with Dual Word Embeddings. In WWW '16: 25th International World Wide Web Conference, Montreal, Canada, 11–15 Apr 2016.
Kim, S.; Fiorini, N.; Wilbur, W.J.; et al. Bridging the Gap: Incorporating a Semantic Similarity Measure for Effectively Mapping PubMed Queries to Documents. J. Biomed. Inform. 2017, 75, 122–127.
Goth, G. Deep or Shallow, NLP Is Breaking Out. Commun. ACM 2016, 59, 13–16.
Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828.
Levy, O.; Goldberg, Y. Neural Word Embedding as Implicit Matrix Factorization. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS 2014) Montreal, Canada, 8–13 December 2014.
Bonetti, L. Design and Implementation of a Real-World Search Engine Based on Okapi BM25 and SentenceBERT. Master Thesis, University of Bologna, Bologna, Italy, 2021.
Sharma, K.V.; Ayiluri, P.R.; Betala, R.; et al. Enhancing Query Relevance: Leveraging SBERT and Cosine Similarity for Optimal Information Retrieval. Int. J. Speech Technol. 2024, 27, 753–763.
Walsh, H.S.; Andrade, S.R. Semantic Search with Sentence-BERT for Design Information Retrieval. In Proceedings of the International Design Engineering Technical Conferences & Computers and Information in Engineering Conference (IDETC/CIE), St. Louis, MO, USA, 14–17 August 2022.
Hersh, W.R.; Cohen, A.M.; Roberts, P.M.; et al. TREC 2006 Genomics Track Overview. In Proceedings of the 15th Text REtrieval Conference (TREC 2006), Gaithersburg, MD, USA, 14–17 November 2007
Cohen, A.; Ruslen, L.; Roberts, P. TREC 2007 Genomics Track Overview. In Proceedings of the 16th Text Retrieval Conference (TREC 2007), Gaithersburg, MD, USA, 5–9 November 2007.
Lin, Y.; Ying, C.; Xu, B.; et al. Dual Cycle Generative Adversarial Networks for Web Search. Appl. Soft Comput. 2024, 153, 111293.
Lo, R.; Datar, A.; Sridhar, A. LIC-GAN: Language Information Conditioned Graph Generative GAN Model. arXiv preprint 2023, arXiv:2306.01937.
Foxtons. Available from: https://ftalphaville-cdn.ft.com/wp-content/uploads/2013/09/Foxtons.pdf (accessed 15 August 2025).
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv preprint 2013, arXiv:1301.3781.
Google. Available from: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g (accessed 15 August 2025).
Common Crawl. Available from: https://commoncrawl.org/ (accessed 15 August 2025).
Facebook Open Source. Available from: https://fasttext.cc/docs/en/english-vectors.html (accessed 15 August 2025).
Galke, A.; Saleh, L.; Scherp, A. Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval. In Informatik; Eibl, M., Gaedke, M., Eds.; Gesellschaft für Informatik: Bonn, Germany, 2017; pp. 2155–2167.
Kuzi, S.; Shtok, A.; Kurland, O. Query Expansion Using Word Embeddings. In Proceedings of the CIKM'16: ACM Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016.
Miller, S.; Curran, K.; Lunney, T. Detection of Anonymising Proxies Using Machine Learning. Int. J. Digit. Crime Forensics 2021, 13, 1–17.

Digital Technologies Research and Applications

Article

Evaluating Semantic Representation Strategies for Robust Information Retrieval Matching

Downloads

Authors

Keywords:

References