A few months ago I wrote my first math-free intro to statistics, where I looked at the basic descriptive statistics of mean, median, mode and standard deviation. In this post, I’ll discuss the very scary concept of NORMALITY (in a statistical sense, not a philosophical one). As a reminder, this series of posts is meant for a theoretical understanding of statistical concepts sans math and (hopefully) without over-complicating things with technical terms. As last time, I’ll share some links at the end of the post so that you can better familiarize yourself with the more technical aspects, should you wish.
Why should I care whether my data is normal in the first place?
As you’ve probably figured out already, there are dozens upon dozens of statistical tests that you can use with your data. Each of these tests relies on a set of assumptions about the data in order to calculate correctly and reliably. One major assumption of many statistical tests is whether data is “normal.” Therefore, it is important to know whether your data is normal before moving forward and subjecting it to a bunch of statistical tests that might not be right for it. Without considering normality, you might accidentally use the wrong test or compromise the reliability of your findings.
Beyond that, looking at the normality of your data can also tell you a lot about it. Testing for normality is a good reason to start depicting your data graphically, which is one of the best ways to start exploring the trends found within it.
What does it mean to be “normal”? (I wish I knew)
We can say data is “normal” if it follows a normal distribution. When you think of data distribution, it’s best to picture it graphically (i.e. visually). A normal distribution is often called a ‘bell curve’ and can be graphically depicted like this:
How do we know this is a normal distribution? It has a few important qualities:
- It has a mean and median that are the same. That’s the line down the middle.
- It has a peak or bump in the middle, and tapers down towards the left and right.
- The graph is symmetric, meaning there is just as much data below the mean/median as there is above it.
- See those numbers on the bottom? Those are standard deviations (check out my last post for a refresher). In a normal distribution, 68% of the data fits within one standard deviation of the mean. 95% of the data fits within two standard deviations, and 99.7% of the data fits within three standard deviations of the mean.
Examples of normal distributions that are often given are things like human height, blood pressure or IQ. In a perfect world and under normal, stable conditions, these data would be depicted graphically much like the normal distribution photograph above.
But, we know the world isn’t perfect and there are plenty of factors that influence data to be ‘biased’ (meaning it leans in one direction). Indeed, almost no collected data is perfectly normal. Some reasons for this could be sampling biases. In the example of human height, we can’t measure all humans on the planet and might instead choose 100 people to represent the population. However, we might have accidentally chosen 50 extremely short and 50 extremely tall people, making our data look graphically like an inverted bell curve.
Most of the time, our data is just more complicated than this idealized depiction of normality. For example, data can be influenced by environmental or cultural factors. Some data collection processes may also rely on ‘messy’ things like human emotions or reflections. I mean, the last American presidential election probably demonstrated that human behaviour does not always follow logical paths.
So when we put data into a graph, it can take on a lot of other “non-normal” shapes. Here are a few good examples from MathIsFun.com of data that is NOT normal:
And this matters when it comes to analyzing it (as mentioned at the start of this post).
What kind of things tell us if data is normal?
When talking about data normality, there are two important properties: skewness and kurtosis.
1. Skewness looks at how symmetrical data is on either side of the mean. More specifically, it considers the size and length of the ‘tails’ on a graph, and whether they are symmetrical on each side, or if they stretch out longer to the left or the right (i.e. are biased). Here are some examples of data with different skewness:
We can measure skewness as ‘positive’ or ‘negative.’ The first photo depicts a negative skew, where the tail reaches out to the left and the peak is on the right. The middle picture is a normal distribution. The third photo is a positive skew, where the tail reaches out to the right and the peak is on the left.
What does this tell us? Let’s go back to the example of human height. In the first photo, the mean is larger than the median, meaning our sample is biased by having more tall people. In the second photo, the mean and median are the same, demonstrating we have a fairly normal sample of heights represented. In the photo on the right, the mean is smaller than the median, meaning our sample is biased by being weighted more towards short people.
2. Kurtosis considers the “peak” and how tall or flat our graph is (i.e. ‘how big is the bump?’). It is also concerned with the tails and how long or short they are. Here are some examples:
A negative kurtosis has a small peak (or no peak) and long tails. On the other hand, a positive kurtosis has a large peak with short tails. In this regard, kurtosis is influenced by the standard deviation.
So what does kurtosis tell us if we think about height again? A positive kurtosis means that our dataset is more concentrated on the median, meaning we have an overwhelmingly large population of participants who are of average height in comparison to short or tall people (and is, therefore, biased). A negative kurtosis means that our sample is more ‘spread out’ with more even numbers of short, average and tall people. This is also biased, as short or tall people do not occur in equal numbers in nature as those of average height. The normal distribution in the middle would demonstrate that we have participants in various heights of good proportion. We have a majority in the average range, with smaller and equal numbers of short and tall outliers.
How do we measure normality in a statistical sense?
You can tell a lot about the normality of your data just by graphing it, but we need a more definite answer about whether data is normal in order to move forward. Luckily, there are a few statistical tests that you can do to measure this. The most commonly used is the Shapiro-Wilk test (although there are a handful of others, depending on special circumstances surrounding your data). For a few good resources on how to calculate and interpret normality in SPSS using these tests, please see the links at the bottom of this post.
So now that I know if my data is normal or not….what now?
If you’ve determined that your data is normal, you’re on your way to being able to use statistical tests that assume normality (EDIT: my colleague Quan made a good point to me about normality assumptions considering residuals, not the variables themselves. Here’s a website about that for now, and I’ll dive more into this in another post). Don’t jump into these tests just yet, though, as there are other assumptions you need to consider first. That’s the subject of my next post in this series, but check out this link for a quick summary in the meantime.
If your data is not normal, don’t despair. There could be a number of reasons for this with easy solutions to ‘normalize’ your data. Check out this website for a good summary of how to move forward with non-normal data.
Want to get more technical? Here are a few good resources about normal distributions:
Understanding normal distributions with MATH
How to calculate skew and kurtosis with MATH
Testing for normality using SPSS
Normality tests in SPSS
Thank you for reading my second post in this series. My next post will talk about parametric versus non-parametric tests (i.e. WHAT ARE THOSE? And what assumptions do different types of statistical tests rely on?). I’ll make sure to get this next one out a little faster 🙂