Normal distributions in the wild
Normal distributions, or bell curves, appear often in data analysis. Here is a particularly cute one:
Bell curves are commonly seen because they describe many real-world patterns, from student exam scores to how the public feels about a government policy.
But the smooth bell curves that appear in textbooks are a different breed to the bell curves that emerge in the wild. Human Computer Interaction researcher Alan Dix explains this with an an example about human height. In real life, certain events, like a rare genetic condition that causes gigantism, can be “a single, overwhelming value” that skews the shape of the distribution.
We end up with something that looks more like this:
Putting on my field researcher’s hat and virtual binoculars 🕵, I “found” just such a less clean-ly shaped bell curve. It’s the distribution of price per square foot of a private property as published by Singapore The Straits Times published a few years ago. The small bumps on the right and left show how there are some uncommonly cheap and uncommonly expensive price per square feet offerings.
Normal distribution with emphasis on more extreme values
There are other properties that seem to fall into the same normal-ish pattern:
Sometimes though, there are distributions that don’t look quite “normal-ly” no matter how you turn you head and squint. Sometimes, I wonder whether what I’m seeing is a boa constrictor that swallowed an elephant rather than a distribution:
Boa constrictor distribution?
Other times, the distributions are clearly not normal at all. They would be better described as bimodal or multimodal. In other words, having two peaks or multiple peaks.
Bimodal distribution
Multimodal distribution
Distributions in the real world are influenced by the arbritrariness we experience everyday. Perhaps there was a season when property sellers had an upper hand; perhaps there was a favourable time in the market when units where cheaper. This randomness means that the assumptions behind the Central Limit Theorem, or the mathematical theory behind the normal distribution, may not be met (again, Alan Dix covers these assumptions well). For starters, some assumptions, like thinking that each unit measured is independent from all other units, may not be hold true.
Handling these breeds of distributions is what makes applied data analysis interesting, challenging, and more often than not, an exercise in persuasion rather than a presentation of hard facts. There is a subjectivity that needs to be made clear when we discuss an analysis (a theme that will definitely come up again).