Textual Analysis in Real Estate
Journal of Applied Econometrics
Published online on October 19, 2016
Abstract
This paper incorporates text data from MLS listings into a hedonic pricing model. We show that the comments section of the MLS, which is populated by real estate agents who arguably have the most local market knowledge and know what homebuyers value, provides information that improves the performance of both in‐sample and out‐of‐sample pricing estimates. Text is found to decrease pricing error by more than 25%. Information from text is incorporated into a linear model using a tokenization approach. By doing so, the implicit prices for various words and phrases are estimated. The estimation focuses on simultaneous variable selection and estimation for linear models in the presence of a large number of variables using a penalized regression. The LASSO procedure and variants are shown to outperform least‐squares in out‐of‐sample testing. Copyright © 2016 John Wiley & Sons, Ltd.