Follow Insights Blog

CoreLogic

CoreLogic Econ

LATEST CORELOGIC ECON TWEETS

Turning Words into Data

Determining Property Value from Public Listing Remarks

Matt Cannon    |    Property Valuation

Homebuyers and homeowners consider multiple factors when determining the value of a home.  Characteristics of the physical property such as the number of bedrooms, the size of the property and the number of bathrooms all have an impact on the property value.  Amenities such as granite counter tops and high-end appliances also impact the value, as does the neighborhood in which the property is located.  Within the neighborhood, the strength of the school district and proximity to amenities such as parks and public transportation all play a role in determining value.

Information about property characteristics can be used to develop price estimates for properties.  Specifically, hedonic regression can be used to value a property as a function of the property characteristics.  In addition to providing an estimate of property value, hedonic regression provides information on the implicit value of the housing characteristics [1].

Property characteristics that are numeric, such as the number of bedrooms, are obvious candidates for inclusion in hedonic regression analysis. However, it is also possible to improve the model’s accuracy by including non-traditional data sources; for instance, the public comments provided by real estate agents when properties are listed for sale.  The public comments that agents provide frequently offer information about the neighborhood, property amenities and property condition. But how does one take non-numeric content and convert it to numeric data characteristics? In order to utilize information from the realtor comments, the information must be transformed from text to numeric data that can be incorporated into the hedonic regression analysis. 

The following simple example uses counts of the words contained in the comments [2] section of properties listed and sold in Los Angeles County in 2015.  The variables derived from the word counts are used in the hedonic regression in addition to the number of bedrooms, number of bathrooms and square footage provided in the Multiple Listing Service (MLS) database.

The estimated coefficients derived from the hedonic regression indicate the impact of the characteristic on the (log) price of the property.  When comparing the impact of characteristics derived from the comment section, coefficients with greater absolute values have a greater impact on price.

Figures 1 and 2 use word clouds to illustrate the words that have the most positive (Figure 1) and most negative (Figure 2) effect on price.   In both figures, larger font size represents a greater impact on price.  Figure 2 contains a greater range of font size than Figure 1 indicating a greater range in the impact on sales price from the most important negative words compared to the most important positive words.

Los Angeles County contains a variety of neighborhoods with different average home prices.  This is reflected in the word clouds, where words representing neighborhoods dominate.  The word with the largest positive impact on price is “Montana,” which represents Montana Avenue in Santa Monica.  Similarly, the word with the largest negative impact on price is “Lancaster,” which is a city in Los Angeles County that is 70 miles north of downtown Los Angeles.

The word clouds also contain words that reflect property and neighborhood amenities.  As expected, “oceanfront” is associated with higher home prices while “swamp” is linked with lower home prices, although not for the reason one might suspect.  While actual swamps are not common in Los Angeles, the word “swamp” is used as part of the phrase “swamp coolers” (evaporative coolers), which are a less-costly alternative to air conditioning.

Non-traditional data sources such as realtor comments can contain important information on factors that affect property valuation – both negatively and positively. Transforming these comments into meaningful numeric analysis provides opportunities to expand traditional analysis and improve the accuracy of valuation estimates.

[1] For more information about hedonic models, see Malpezzi, Stephen, Hedonic Pricing Models: A Selective and Applied Review, Wisconsin-Madison CULER working papers, University of Wisconsin Center for Urban Land Economic Research.

[2] More detailed analysis could account for word frequency across listings (term frequency – inverse document frequency) or could look at word groupings (n-grams) rather than individual words.

© 2016 CoreLogic, Inc. All rights reserved.