Solutions

Share on facebook
Share on twitter
Share on linkedin

How Do You Solve Property Data’s Oldest and Biggest Challenge?

A Conversation with Sachin Rajpal

Real properties have a jumble of identifiers rather than a single standardized number that clearly defines each piece of property. Often, these overlapping and incomplete data sets of property information can result in a mess that takes hours to unravel before the property data is clean enough to use. This applies to if data is being used for understanding property vulnerability, to a natural hazard, how close it is to a power line, to a home or complying with regulations to build a pipeline.

In this episode, host Maiclaire Bolton Smith sits down with Sachin Rajpal, head of enterprise data at CoreLogic about this issue and the impactful ways that CoreLogic is solving it. 

Maiclaire Bolton Smith:

Welcome back to Core Conversations, a CoreLogic podcast. I am your host Maiclaire Bolton Smith, and I’m the senior leader of research and content strategy with CoreLogic. In this podcast, we’ll have conversations with industry experts about key topics from housing affordability, to the impacts of natural disasters on property. Unique identifiers are common for the people and things we value. People have social security numbers, automobiles have vehicle identification numbers, banks use a combination of routing and account numbers for our money, yet real property, which are among the most highly valued physical assets, have a jumble of identifiers rather than a single standardized number that clearly defines each piece of property. And these identifiers have changed over time, from landmarks like rocks to addresses to geographic coordinates.

And oftentimes these overlapping and incomplete data sets of property information can result in a mess that takes countless hours to unravel before the property data is clean enough to use, whether it’s being used for understanding how vulnerable a property actually is to a natural hazard, how close it is to a power line, to a home or complying with regulations to build a pipeline. So today we’re talking with Sachin Rajpal, our head of enterprise data here at CoreLogic about this exact issue and the exciting new way that we at CoreLogic are solving it. So Sachin, welcome to Core Conversations.

Sachin Rajpal:

Thank you, Maiclaire, very excited to be here. The unique identification of the property is what’s something we are very passionate about and doing a lot to solve for this and happy to talk more about it.

MBS:

Awesome. So good. Well, we’re really happy to have you here. So why not to get started? Why don’t you tell our listeners a little bit about your background and yourself and what you do here at CoreLogic?

SR:

Yeah, I’m passionate about data and analytics applied to solving complex real-world problems. My background is in technology, SaaS products and analytics. I’ve been at CoreLogic for three years and prior to CoreLogic, I was at Dun and Bradstreet, responsible for their risk management platform business.

MBS:

Wow. Well you are the right person that we need here to talk about this today. So one thing we like to try and stay away from some acronyms, some might people might know, but can you just quickly define SaaS?

SR:

Yeah. SaaS is software as a service. So these are products that are available on a subscription basis to the customers and this all, either workflow or analytics and they provide them and mass the same solution to multiple customer sets.

MBS:

Perfect. That was exactly what we needed. Okay. So let’s dive in. I’m really excited about this episode because it really gets to the crux of what underpins the entire housing economy. So let’s start with the basics. Data is just information at its core. There’s four qualities when we think about data qualities and these things are crucial to us here at CoreLogic and those are coverage, currency, completeness, and accuracy. So can you talk a little bit about what do those four things actually mean?

SR:

Sure. Let me explain the definition of these four parameters. And then we’ll go into a little bit detail of what at CoreLogic we have in terms of statistics and numbers behind these four parameters. So starting with coverage. So coverage very simply means that how much information we have about the property ecosystem in US. So if we think about the total number of properties, property universe in US, it’s around 150 million plus. So coverage means that out of those 150 million, how much information CoreLogic has on these properties. Currency means that when a transaction, for example, on a property has happened, so how soon we are able to make it available to the end customers. So for example, our transaction may have happened yesterday. So can we make the information available on the buyer, seller, the sale price through our data supply chain to our customers, through our different products or through our bulk data licensing, or the APIs. Completeness means that how much information on a particular property.

So for example, we may have coverage. We may have some information on those properties, but how rich that information is. So property comprises of hundreds of attributes, for example, number of bedrooms, bathrooms, the land use score, the lot size. So completeness means that do we have information on these parameters for those properties? And accuracy is when we are saying a particular property has, for example, three bedrooms or two bathrooms, so how accurate that information is. Whether the property indeed has three bedrooms or two bathrooms and so on and so forth, and this is not just for the key property attributes, but the hundreds of attributes that describe the property.

SR:

So let me talk a little bit about these parameters in terms of what at CoreLogic we provide. So starting with coverage, we at CoreLogic have access to both proprietary and non-proprietary data sources.

And at enterprise data group, our role is to manage multiple data supply chains, to create the nationally recognized single source of truth, data repository of property information that supports our business units in their quest to help people find, buy and protect the homes that they love. All of this starts with our broad data collection efforts. We have more than 22,000 data sources, both public and proprietary. Among the public sources are municipalities, tax assessment, offices, taxing agencies, urban planning departments, and so on and so forth. The proprietary and geospatial information that we collect covers the entire United States, all of 150 million plus parcels. We also capture data on transactions. For example, purchase, refinance, home equity across all counties within days of the transaction being recorded. Transaction information maybe everything from what did the house actually sell at, who bought it, which lender provided the mortgage, etc.

SR:

Among the proprietary sources, we collect data from MLS. MLS is the multiple listing service. Realtors and brokers are typically part of the regional MLS. When you go to a realtor to sell property, part of their value add is the list, the property in their database with information, imagery and description of the property, as well as the asking price. The list of data is what then shows up in the websites. Through our proprietary partner Infonet, it’s also called Pin network, we have longstanding relationships with MLS agencies and capture deep data on 80% of all active listings. We also have access to appraisal data, appraisers and lenders use our software platforms to do collateral appraisals that are required for underwriting the property. As part of enabling that workflow, we collect an incredible amount of highly accurate data on property.

SR:

For about 35% of residential properties we have detailed information collected by an independent licensed professional who has been inside the home. A number of market leading mortgage lenders contribute appraisal data to us. We collect a lot of other data as well. For example, building permits, this is the data on changes to a home or reconstruction cost data. For example, cost of materials that goes into building a home. Our capital markets data, which is loan information critical to assess portfolio risks for investors. So the combination of these different data sources, they give us the broad coverage on the US market. So as I said, it on almost all 150 million parcels of the US properties. And the other aspect of our data supply chain is the speed at which we collect the data and process the data. So we are direct to source. So 99% of our data sources are primary. That is data is collected from the place where data is generated or officially recorded.

SR:

This gives us the speed and the currency of the data, which is very important for our customers. The other aspect is the completeness of the data. So because we have access to both proprietary and non-proprietary, the variety of the data sources, we are using advanced AI and machine learning techniques to increase the completeness of the data set. So our data set is rich because we are doing things like blending and using machine learning to impute the data so that we can provide the intelligence and insights for our customers, which may not be available in sources like public records. And while doing that, we are doing a variety of quality checks that ensure the high accuracy of our data. So hopefully that provides both definition and the details behind the different parameters that you mentioned.

MBS: 

Yeah, no, that was so great, Sachin. And that was a long answer with a lot of information in it. But I think one thing that really stuck out to me is a number of the things that we’ve talked about on this podcast already, you just talked about the underpinnings of them. So I think a lot of the things you mentioned we’ve already talked about or touched on in different podcasts, and now this is why you’re here. You’re here for us to really kind of tie everything together because we like to say here at CoreLogic, that our core is our data. That is the core behind CoreLogic. So super excited to have you to be able to dive into some of this today. I mean, we hear this all the time with everything, and we’ve talked about it a few times on this podcast is we often hear garbage in garbage out. But when we’re talking about data specifically, how do we know what is garbage?

SR:

Yeah, that actually touches upon the quality check processes that we have in place, and these are pretty elaborate and extensive. The value add that we are doing at enterprise at our group within ADG is to make sure that we identify when something in the norm called garbage, but when something is out of ordinary. So for example, we are doing various statistical techniques like variance analysis. So to give an example, if the data on a particular property on residential properties says that it has three bedrooms, two bathrooms, but the living area is 20,000 square feet. Something is out of track, right? So we have these checks, for example, this is the variance analysis or sampling and other statistical techniques through which we are able to detect these anomalies and then use multi-source cross-validation to correct these. So overall, to give you an example of the scale in our transaction supply chain, we have 20,000 data checks in place that ensure 99% accuracy of the data.

MBS: 

Wow, that’s great.

SR:

So we are not a pass through of the data which is coming in. So garbage in garbage out happens typically when a process is a pass through. We have a variety of data checks, quality checks in place, right from the ingestion of the data, as it flows through our supply chain, to make sure that what’s coming out of the other end is of high accuracy.

MBS: 

Okay. That is very helpful. And you also mentioned that we’ve got so much data, 1 billion property records, sourced and updated annually. And when we think about this, our record of data that we have spans over 50 years. Now, how do we do this? It can’t just be a person sitting there and manually parsing through billions of data records. That’s just not feasible. How does technology play a role in us dealing with this mountain of data?

SR:

Yeah. Some of the numbers that you just mentioned, that they speak to the scale at which we operate and this is not possible manually. So technology is a huge part. Absolutely. And it starts with our data collection efforts, right? So I had talked about the variety of the data assets that we have access to, both proprietary and non-proprietary, and which means that the data collection efforts that we have, they are at a very high scale and they involve a lot of variety in it. And the data that we receive is not standardized. It’s in different formats. Each county has its own protocol. It may be normalized, de-normalized. It maybe in 1 file or 10 files. So we have invested in RPA and we have over 1500 bots. So RPA is robotic process automation, and we have already deployed more than 1500 bots in production.

SR:

So all these repetitive kind of processes, so RPA comes in and then automate stack. We also use OCR, which is optical character recognition to extract the data from the non-digitized assets. So these are the transaction documents, the deeds that are filed at the county recorder’s office. This data is actually not digitized. So we use the OCR techniques to actually digitize the data and then process it through our supply chain. And once the data is collected, then it goes through an extensive curation and transformation process. And all of that is automated through the various systems in our supply chain. And underpinning this is about the various quality checks that I talked about, and as I mentioned, we have over 20,000 checks various statistical techniques, and that is underpinned by technology and analytics and statistics. And then this leads into the big data platform.

SR:

So we have the bleeding edge GCP stack. So we migrated over to Google cloud platform a couple of years ago, and we are using GCP for our big data. And this is where we bring the variety of data sources that we have onto one place. And then that enables the advanced analytics that our data scientists are doing to further enrich and curate the data. So as you can see, as the data moves from acquisition all the way to creating value through advanced analytics, technology is underpinning all of that term.

MBS: 

Yeah, definitely. I mean, you’ve literally are taking paper records and making them digital, and that’s just so remarkable for something of this scale. And I mean, that leads me to then think, big data is really one of those hot topics in our industry. And we’ve heard so many times that CoreLogic has all this data about all of these properties. And when you think about it, I kind of imagine, trying to envision and pick up a pack of playing cards from the floor and trying to make sense of what goes where. So can we talk a little bit about kind of demystifying the data world a little bit? How do we take this kind of data and make it useful? And how do we collate it so that we can actually say, “I live at 123 main street,” and pull up every single piece of information about that property.

SR:

That’s the big transformation which is going on at CoreLogic and through that, what we want to do in the industry. So traditionally, when you look at how the data has been organized, right, so it’s the public records or MLS or the appraisals. We talk about the sources, which makes sense as we are collecting the data. Right. But as we are consuming the data, for example, what you said, 123 main street, and give me everything about that. So the property industry, the various data sources, they are fragmented, they don’t talk to each other, right? But from a consumer perspective, you want that property 360 degree view, right? So what you just said was “Give me everything about the property.” So it doesn’t matter whether that’s in public record or it is in MLS or the listings, or it is in building permits, or is it in some geospatial data source. We want as a consumer, all the attributes associated with the property through one call in one view.

SR:

So that’s what we are solving through big data. That’s where we are bringing all these different disparate data supply chains, the output of that into one place. And we are linking them through clip. And clip is this CoreLogic integrated property identifier. And once you identify the various data sources to a particular property, so think of it as a hanger on which you can keep hanging the various additional information about the property. So clip is that hanger and the various different datasets that traditionally don’t talk to each other, you can keep hanging them on that hanger. And that gives you a 360 degree view of the property.

MBS: 

Yeah. That is super exciting and really a way forward for industry because as you mentioned, the average homeowner really wants to know all of this information about their property and businesses, want it to be able to provide it for them. And it’s so exciting that CoreLogic is really moving in this direction of providing this integrated property ideas we call clip, as you mentioned. So let’s just talk about that a little bit more. How revolutionary is this really? I mean, it seems like it’s something that’s never been done before, but then in high demand, how will it enable all of us as members of the property ecosystem to be able to use it? How will it help us find, buy and protect the homes and properties that we love?

SR:

Yeah, it is pretty revolutionary when you put it in the context of solving for a seamless experience across find my protect journey. So in by itself, it’s an identifier, it’s a 10 digit number that uniquely identifies a property. So other providers may have something similar, but what makes it unique is the connection across the different datasets to get to that property 360. So it’s in combination with the variety of the data sets that we have access to, and the service that it offers, where the customers may have their own datasets and CoreLogic has the data sets around the property. So it’s actually, is offered as a service where we can help customers to translate their datasets into clip and use clip to actually connect their property ecosystem and then also connect it with CoreLogic’s property datasets to create that unique combination where they can do their own analysis.

SR:

So it’s not just a number, but it’s also a service to actually bring the various property dealers together for us and for our customers. And in terms of how transformative it can be, so CoreLogic has positioned itself as the enabler for find my protect journey as the house goes through this value chain. And the big problem the property industry is trying to solve for, is that it takes, it still takes 35 40 days for somebody to close on a house. So I’ve been through house buying and selling. I think almost every one of us, it’s still a very disjointed experience and it doesn’t have to be, right? And the power of data, the power of connected data actually has the key to solve for this problem. So I’ll give you some examples where as you go through finding the various houses that you may be interested in, then you’re given offered on a particular house, then it goes to a lender and then the underwriting happens.

SR:

So similar to how the pre-qualification on the borrower happens, the collateral, why can’t collateral be pre qualified? In terms of its risk, in terms of how fast it can go through the underwriting process, and that may happen when the additional richer data is available more upfront in the process. And as the property goes through the various stages of, from fine to the bank, who does the underwriting and the buy to buying the insurance and the house, as it goes through the protect, then think of that, there is a data packet that moves from one stage to the other and just keeps getting richer, so that every step of the process, the different party that is not doing a redundant effort of finding the data on the property again.

SR:

So that takes out the inefficiencies, that takes out the redundancies and makes the process faster. And the key to that is connectivity. And that’s what clip provides. That connectivity, and as the data moves through the various stages, then it can be referenced through that unique number and it can help improve the efficiencies. So in that regard, I consider that a lot of potential, as you asked about, whether it can be revolutionary or not.

MBS: 

Yeah, definitely a lot of potential. And I think just to wrap up here, sometimes we like to end these podcasts by looking into the crystal ball. And I think, the way that you’ve just explained this, it really sounds like we’ve solved the ultimate data puzzle by making this unique identifier for property. Something that’s been highly coveted and we’re actually there. So is this it? Have we reached the peak of the data mountain? Or what does the future hold? Is there still more to come?

SR:

Oh, absolutely. I wouldn’t say that we have reached a peak. In fact, we just got started. So the vision that I explained, so these are the various building blocks offered. So when you think about the clip, so the first building block was to make sure that we can clip all the properties in the US, which we have already done. And then the second building block was can our customers access clip? So we built a service on the top where we can take customers’ address based files or assessor parcel number based files to convert it into clip. So that’s offering clip as a service. And then the next step is the integration of clip and the associated services into the ecosystems within CoreLogic and of our customers. So that division that I just described to make the find by protect more efficient, that comes to life. So we absolutely haven’t reached the peak. It is a long journey ahead of us, in terms of integrating clip and our enriched data and advanced analytics into the way that ecosystems across find by protect.

MBS: 

This has been so interesting Sachin, thank you so much for coming and really shedding some light on these mountain of data that we have here at CoreLogic and how we deal with it and what we’re doing with it to really try and push the industry forward, to get something, to really help people and businesses through this property journey, as they find, buy, and protect their home. So thank you so much for joining me today on Core Conversations, a CoreLogic podcast.

SR:

I appreciate the opportunity and we are super excited about what’s ahead of us.

MBS: 

Sounds great. So for more information on the property market and the housing economy, please visit us at https://corelogic.com/intelligence

© 2021 CoreLogic, Inc. , All rights reserved.

Share on facebook
Share on twitter
Share on linkedin
Get The Latest Updates

Subscribe To Our Newsletter

By submitting this form I agree that CoreLogic may contact me at the email address I provided for information about products, services or insights. I understand that consent can be withdrawn at any time by clicking the unsubscribe link contained in email messages.

Most Recent

Related Posts

smith_header_What Lessons Can We Learn from the USS Arizona_1
Intelligence

What Lessons Can We Learn from the USS Arizona?

At CoreLogic, supporting veterans is close to our hearts. One organization that we are proud to align with is the World War II Foundation. On Veterans Day, they will release a documentary, Elvis and the USS Arizona, and CoreLogic is a presenting sponsor.
In this episode, host Maiclaire Bolton Smith speaks with Tom Gray from the WWII Foundation to learn more about the documentary and the foundation.

home-price-insights
Find Stories

U.S. Home Price Insights

Through September 2021 with Forecasts from September 2022 Introduction The CoreLogic Home Price Insights report features an interactive view of our Home Price Index product