I’ve recently had some questions from more qualitative-minded researchers about resources for understanding the foundations of quantitative methods. In attempting to compile a few resources to share, I found it a bit frustrating that there were relatively few ‘theoretical,’ easy-to-understand materials for beginners in this area (read: math-free). While I feel it’s also important to understand the technical details of statistical tools, I hope that a series of ‘math-free intros’ can ease some fear and pique the interest of those considering incorporating quantitative analyses.
So let’s start with the foundation: descriptive statistics.
What are descriptive statistics?
Descriptive statistics are pretty accurately named: they describe basic features of your data. They don’t make inferences or find conclusions. If you compare it to literature, descriptive statistics would be a simple narrative tale. In art, they would be the initial sketch or outline. The good thing about descriptive statistics is that they don’t require any fancy analytics programs — you can even do them in Excel.
When do I need descriptive statistics?
Descriptive statistics can help you take a big set of data and condense it to a more manageable form. They help identify simple trends in your data. Incorporating descriptive statistics can also help convey a lot of information in a limited amount of space. If you’re working in mixed methods or research with a qualitative focus, they can add more rigour and ‘back up’ your qualitative analysis or highlight complementing trends on a larger scale. If you’re interested in incorporating further or more advanced quantitative tools, descriptive statistics are the first step in understanding your data. They are the training wheels to your bicycle.
Alright, alright. So what are these descriptive statistics thingys?
Descriptive statistics can help you understand three concepts or trends in your data: (1) distribution, (2) central tendency, and (3) dispersion. Don’t panic yet! I promised to keep this math-free, so here we go:
1.) Distribution: the easy part
When we talk about distribution, think ‘frequencies.’ Simply put, this is when we do things like count or calculate percentages, and start using things like bar charts or pie graphs. Let’s think about classroom grades. A frequency distribution would tell you how many and what percentage of students received scores in each grade category:
50 – 59% (F): 3 (6.5%)
60 – 69% (D): 6 (13%)
70 – 79% (C): 11 (23.9%)
80-89% (B): 16 (34.8%)
90-100% (A): 10 (21.8%)
This can be interesting and, in some cases, very relative to your research questions. In some cases, you might want to demonstrate frequencies with micro-level details. However, oftentimes a broader, more macro perspective is needed, and frequency distributions don’t necessarily demonstrate trends in your data. They also take up a lot of space and can be information overload. Luckily, we have tools to simplify this data:
2.) Central tendency: looking more into data trends
Central tendency measures reduce information in a frequency distribution into a much more manageable, quick and dirty form. It answers: What is the most ‘typical’ value in your data? How do we sum it up quickly? You can think of central tendencies as demonstrating ‘stereotypes’ in your data in a few different ways:
- Mean or ‘average’: You know this term from every news article ever written about research. The mean is when you add up all the values and then divide that sum by how many values you have. In many cases, the mean is sufficient enough to highlight the ‘typical’ value.
- Median: The median can be thought of as the ‘middle.’ When we list out 1, 2, 3, 4, 5 — ‘3’ is the median because it’s the number in the exact middle of the values. Why would you use this instead of (or in addition to) a mean? Let’s say you have ‘outliers’ in your data (i.e. values that have some distance outside of most of the rest of your data). In this case, a median can give a more ‘fair’ typical value.
- Example: We have three houses on a street with estimated worths of: £300,000; £400,000 and £2,000,000. The mean housing price is £900,000. This makes the neighborhood seem more ‘well-to-do’ than it actually. In reality, it’s just 2 ‘normal’ houses and one really fancy one (not three pretty darn nice ones). The median, however, is only £400,000, which gives a much more accurate description of the neighborhood.
- Mode: The mode is the number that occurs the most often. This can show you the ‘most popular’ or ‘highest frequency’ score or choice. It is possible to have multiple (or many) modes (example: when asking about what pet people own, the most popular answers — cat and dog — might have the same number of responses). You can also have no mode (example: no participants have the exact same weight).
Including one or several of these scores can very quickly highlight and condense data trends in a limited amount of space. HOWEVER –> At the same time, central tendencies don’t provide very much detail about how spread out the values are. Have no fear, there’s an easy fix for that:
3.) Dispersion: understanding ‘spread’
Central tendencies can occur in several different ways. Think back to the example of classroom grades. If the class test score mean (‘average’) is 75%, one explanation could be that everyone in the class scored a ‘C’ (70-79%) on the test. However, another explanation could be that half the class scored 50% and half the class scored 100%. As a teacher, this gives two very different perspectives on your classroom. In the first scenario, everyone is performing at a decent level. In the second scenario, half the class is failing and half the class is bored. Thankfully, we have a few tools that can demonstrate how we can ‘read into’ the story told by the central tendency.
- Minimum and maximum: The most primitive way to show the ‘spread’ in the data is to simply state the ‘highest’ and lowest’ scores recorded. In the test score example, we might see (Min = 43%, Max = 100%). This can help demonstrate how far apart observed values are. However, it doesn’t tell us if these minimums or maximums are outliers or how common the lowest and highest scores are.
- Range: A quicker way to demonstrate this is through the range, which is the largest value minus the smallest value. If the range of test scores is small, this means everyone scored mostly the same. If the range is large, it means there are wide variations between scores. However, the range can also be misleading due to outliers. If only one student scored 100% and the rest scored 75%, this means that the range is 25, which gives an inaccurate picture of what is actually happening.
- Standard deviation (SD): A more accurate way to describe the range is by standard deviation. Standard deviation shows us how close or far the overall data is from the mean (average). A small standard deviation means most of the data is close to the average (i.e. the classroom where everyone got a ‘C’). A high standard deviation means there is high variation in the data (i.e. the class where half failed and half aced it). A standard deviation of zero is very rare, but means that everyone scored or responded exactly the same. Because I promised no math, I won’t go into details about how this is calculated, but I will provide some resources at the bottom of this post to point you in the right direction of understanding the mechanics.
That’s a lot of stuff. Do I have to include all of that in my work?
Not necessarily. I personally find it useful to calculate all of these in my initial analysis phases, just to get a ‘feel’ for the data and familiarize myself with what I’ve collected. However, what you decide to write up, publish or disperse heavily depends on your data set and your research questions. Some of these calculations may just not make sense in your context. That’s where understanding the theoretical background and meaning of statistical tools comes in handy.
Want to get more technical? Here are a few good resources for descriptive statistics:
Descriptive statistics in presentation format
How to calculate these using MATH (shudder)
Making these calculations in Excel
Making these calculations in SPSS
Thanks for reading the first part in this series! My next post will dive into normal distributions and kurtosis/skew (i.e. Does my data have ‘bias’? Is it ‘symmetrical’? And why do I care?).