Enhancing information retrieval through concept‐based language modeling and semantic smoothing
Journal of the American Society for Information Science and Technology
Published online on June 05, 2015
Abstract
Traditionally, many information retrieval models assume that terms occur in documents independently. Although these models have already shown good performance, the word independency assumption seems to be unrealistic from a natural language point of view, which considers that terms are related to each other. Therefore, such an assumption leads to two well‐known problems in information retrieval (IR), namely, polysemy, or term mismatch, and synonymy. In language models, these issues have been addressed by considering dependencies such as bigrams, phrasal‐concepts, or word relationships, but such models are estimated using simple n‐grams or concept counting. In this paper, we address polysemy and synonymy mismatch with a concept‐based language modeling approach that combines ontological concepts from external resources with frequently found collocations from the document collection. In addition, the concept‐based model is enriched with subconcepts and semantic relationships through a semantic smoothing technique so as to perform semantic matching. Experiments carried out on TREC collections show that our model achieves significant improvements over a single word‐based model and the Markov Random Field model (using a Markov classifier).