Automated Geocoding of Textual Documents: A Survey of Current Approaches
Published online on June 17, 2016
Abstract
This survey article describes previous research addressing text‐based document geocoding, i.e. the task of predicting the geospatial coordinates of latitude and longitude, that best correspond to an entire document, based on its textual contents. We describe (1) early document geocoding systems that use heuristics over place names mentioned in the text (e.g. names of cities and states), (2) probabilistic language modeling approaches, where generative models are built for different regions in the world (usually considering a discretization based on a rectangular grid) from the words occurring in a set of georeferenced training documents, which are then used to predict per‐region probabilities for previously unseen test documents, (3) combinations of different models and heuristics, including clustering procedures, feature selection approaches, and/or language models built from different sources, and (4) recent approaches based on discriminative classification models.