Analysing Singapore property prices — can I trust the data?
Can I trust my work and the data underneath it? This is usually one of the first questions my student’s ask. After planning their first data analysis - understanding patterns in lunchtime queue lengths at the company cafeteria is one — they then wants to talk about trust and integrity.
I think discussing these issues is great. Aside from stopping the spread of disinformation, when we think about integrity, it makes us better analysts. An analysis we can trust is also an analysis that is technically robust.
We can trust our analysis when we can trust our data. While we can and should implement statistical methods into our planning, there are also ways to do quick sanity checks that can get us 80% of the way towards deciding whether we can trust our data. Here’s a personal favourite:
Check for Source, Size and Span (3S). After wrangling with many datasets over many projects from many areas, I’ve come to feel that these three principles can be applied to almost dataset. Even better, doing only takes about a minute. ⏰
We can use the example of analysing data on the Singapore private property market. Below we have a table tracking the number of private residential units sold every quarter since 1999.
Private residential units (completed) sold in Singapore every quarter
Source: In one-click, the dataset is downloadable from data.gov.sg , Singapore’s portal to publicly available datasets on everything from properties, education, health and economics. Plus one for thoughtful accessibility. Looking more closely, the Urban Redevelopment Authority (URA) is the entity who provided the data. Plus one for reliability. The metadata clearly outlines when the dataset was made available (just this year in 2019). Plus one for recency (not being outdated). There is a Licence associated with the data that we should spend time reading, but for now knowing the data provider (URA) and seeing the thought and effort that went into presenting and maintaining the dataset means that this data source passes a basic health check! 🔎
Size: There is 20 years worth of data here. This is good — if we want to analyse a short time frame, we can subset the data. If we want to look at long-term trends, we have a decent sample size that should be able to draw out meaningful patterns.
Spread: Other than number of entries, every dataset also has variables (or columns). There aren’t a lot here, other than time and units sold. Unless we have a specific question around time and / or sales measured as units sold, this dataset may not be very helpful.
In fact, spread is an interesting criteria that deserves a deeper look since the number of variables can be key to determine how much hidden insight can be “mined”. 👷
Take this other, similar, dataset.
Private residential units (uncompleted) sold in Singapore every quarter
This second dataset is quite similar to the first one other than how it starts from 2004 and how it refers to uncompleted, not completed, private residences. But there is one key difference: each quarter is also broken down by market segment (Core, Outside and Rest of Central Region). There gives us slightly more information on each transaction, especially if we’re looking to add a geographic dimension to the data.
Still, the information here is quite minimal, although the property market is complex. I don’t see this dataset capturing the market’s complexity and giving us any “aha” moments. Most likely, at the end of hours of work, we find that outside the central region (not the business district but where the affluent have the means to live) have the highest number of units sold. (D’oh!). We’d be On the Road to D’ohwhere. 🤦♀
We can do better.
Private residential property transactions in the whole of singapore
This is out third dataset, also from data.gov.sg and also provided by the URA.
Good source, good spread, span is also more complex. What stands out are combinations between Completed/Uncompleted properties and New Sale/Resale and Subsale transactions. This combinatorial aspect of the data is useful! Combinations let us break down the data into groups and put them back together again to look for patterns. This data is probably a better reflection of reality of the property market compared to the first two sources. The spread is good. 🍞
Real trust is hard to build in a data analysis. But there are quick health checks we can apply to help us decide whether we want to invest effort into verifying our sources. Become a data doctor before becoming a data analyst.