Follow Insights Blog


CoreLogic Econ


Turning Words Into Data: Part II

Extracting Property Condition Information from Realtor Comments

Matt Cannon    |    Property Valuation

Although the volume of delinquent borrowers and distressed sales has declined since the height of the Great Recession, the incentive still exists among lenders to accurately value the stock of real estate-owned (REO) properties and properties sold at foreclosure auctions. In the case of distressed sales, a property’s condition can have a significant impact on the sale price. A well-maintained property can sell for much more than a similar property that is neglected prior to the foreclosure auction or REO sale. However, property condition information is often not available for inclusion in the development of models to estimate property value. One potential source of this information could be found in realtor comments from property listings.

This blog follows up on a previous blog, and examines property condition information contained in realtors’ comments and the impact on estimating property sales prices. Text can be transformed into quantitative content that can be incorporated into models that estimate property value. Hedonic models can be used to estimate property value as a function of directly observable factors, such as the number of bedrooms or bathrooms, as well as less easily observable information, such as property condition, which may be contained in realtor comments.

In order to focus on sales where property condition may be an important factor, the CoreLogic analysis focused on single-family residences sold at foreclosure auctions, as well as the sale of bank-owned properties, in Los Angeles County from 2013 to 2015 as specified by MLS listings.

As previously noted, raw text can be transformed into quantitative information using a document-term matrix. In this analysis, documents represent individual property listings. Terms can represent individual words or groups of words. In many cases, groups of words can contain more information than individual words. For example, the meanings of “very good” and “not good” can be distinguished by word pairs, but not by the individual word “good.” The current analysis builds on the previous analysis, and uses word pairs (bigrams) to quantify text.

Home prices can vary significantly within a given county. When estimating a model using county-wide data, text analysis may be dominated by words that indicate smaller geographic areas within the county. This was the case in the earlier blog, where the words contained in the illustrated word clouds[i] often related to neighborhoods within Los Angeles County, such as Brentwood or Crestview. In the current analysis, neighborhood traits at the zip code level were accounted for separately in order to better extract property characteristic information from realtor comments.

Figures 1 and 2 show the word pairs with the greatest positive and negative effects on estimated sales price. As hypothesized, several of the word pairings associated with lower prices relate to property condition. Examples of these realtor comments associated with lower prices include “major fixer,” “need major,” and “need repair.”

Word Pairs associated with higher prices

Word Pairs associated with higher prices

Likewise, word pairings associated with higher prices for the foreclosure or REO sales often convey information about positive property characteristics. Examples include “indoor outdoor,” “stainless appliances,” “gourmet kitchen,” “grassy yard (or area)” and “master suite.”

Non-traditional data sources such as realtor comments can contain important information on property characteristic factors that affect property valuation. Property condition can have a large impact on sales price, especially for distressed sales where there can be significant variation in the property’s condition. Transforming realtor comments into meaningful numeric analysis can provide opportunities to improve the accuracy of valuation estimates and better account for property condition.

1 Word clouds can be used to graphically represent the words (or word pairs) that have the most positive or negative effect on estimated sales price. Font size is used to illustrate the variation in effect on sales price, and words with larger font size have a greater absolute value effect on price.

2The word pairs contained in the word clouds have been stemmed prior to generating the word clouds. Stemming refers to standardizing words, so that ‘run’ and ‘running’ appear as the same word.

© 2016 CoreLogic, Inc. All rights reserved.