Data visualization is fast and easy for me because I have a solid foundation in spreadsheet skills. I feel totally at home in spreadsheets. Nerd alert: I actually enjoyed doing taxes this year (and FINALLY finishing and getting a refund didn’t hurt, either).
I initially learned about spreadsheets in the early ’90s from my nerdy economist dad. I was approximately 4 feet tall and was supposed to be studying for a 2nd grade spelling test. My dad convinced me that learning formulas early and practicing them often would change my life. As a 2nd grader, that sounded ridiculous. As an adult, I’m so thankful for those lessons. We sat together in the home office, staring at the glowing Lotus 1-2-3 screen, and practiced averages and medians. I didn’t know what a median was. But I was promised that it would be important someday. Turns out, my nerdy dad was right. This is the same guy who got me a toolbox for my 8th birthday that same year. I asked for an American Girl doll. Go figure. But I used that hammer and screwdriver just last weekend, so… thanks, dad.
Nowadays I spend the vast majority of my time teaching others about data visualization. But all too often, my students run into roadblocks because they don’t have a solid foundation in spreadsheet skills. A graph that should take them 5 minutes to design ends up taking an hour. They haven’t been exposed to the time-saving shortcuts yet. They’re not sure how to transform their raw dataset into summary tables (which would then feed into their graphs). Data visualization feels overwhelming because the data analysis is overwhelming.
Today, let’s relieve some of those data analysis burdens! I’m going to share my favorite two techniques for categorizing values.
You might need to categorize values if:
- You have a list of zip codes but you really just care about the states.
- You have a list of states but you really just care about the regions where those states are located.
- You have a list of countries but you really just care about regions of the world.
- You have a list of ages (0, 1, 2, 3, 4, 5, etc.) but you really just care about age ranges (0-9, 10-19, 20-29, etc.).
- You have a list of schools but you really just care about which district the school is located within.
- You have a list of test scores (40%, 55%, 70%) but you really just want to focus on students who passed or didn’t pass the exam.
- You have a list of body mass indices (19, 24, 29, 32, etc.) but you want to categorize the raw numbers into underweight, normal weight, overweight, and obese.
- You have a list of languages spoken but you really want to divide people into those who speak Mandarin and those who don’t.
- You have a list of countries where people were born but you really just want to divide people into born in U.S. and not born in U.S.
- … and so on.
The raw dataset you’re given is rarely the one you actually want for your analyses. Clean, re-code, and re-categorize until you’ve got the groupings you need.
Recategorize Values with =IF()
There are a couple ways to categorize your values: with if or vlookup.
This first option uses if. In this example, we’ll categorize a full list of produce types (apples, bananas, carrots, or watermelon). In the video, I’m demonstrating these formulas on a PC with Excel 2016. The formulas aren’t unique to Excel, though. You can use these formulas in any spreadsheet program (like Google Sheets, which I actually use more often than Excel these days).
First, we’ll categorize the individual values as either being either an apple or other type of food. I use these simple two-category breakdowns in projects a lot. No, I’m not usually categorizing apples. That would be a fun job though–an apple analyst! But I use this same technique to handle a lot of the transformations in the bullet point list above. For example, you could use this exact approach to categorize a long list of countries into being either United States or other.
Think of if formulas as if-then statements. You have to fill in three pieces of the formula:
- A logical argument (a mini statement that includes the greater than symbol, less than symbol, and/or equal to symbol)
- What your computer should do if that mini logic statement is true
- What your computer should do if that mini logic statement is false
Second, we’ll categorize the individual values as being either a fruit or a vegetable. This formula is a little more complex. It’s called a nested if formula because there are if statements inside of if statements inside of if statements inside of if statements! You’ll feel like a million bucks the first time you get a nested if formula right. But nested if statements are also a lot of work… which brings us to the second option, using a lookup formula like vlookup.
Recategorize Values with =VLOOKUP()
Your second option for categorizing your values is using a lookup function like vlookup.
I use lookup functions to join tables together. You can use vlookup in all sorts of scenarios. That’s why it’s one of my all-time favorite features of spreadsheets.
In this scenario–using vlookup to categorize values–we’ll need to build a separate lookup table. In this video, I’ll show you how I have the regular table that I’m trying to fill in. That’s the table where I want to turn my apples, bananas, carrots, and watermelons into apples or others and then into apples or vegetables.
Once we’ve built our lookup table, then we can use vlookup to help those tables talk to each other. I show you how in this video.
You can use either if or vlookup to sort your values into categories. You’ll get the same result either way. If tends to work best when you’ve only got a couple of categories. Vlookup tends to work best when you’ve got lots and lots of categories. Vlookup also tends to work better when your dataset is especially messy (apple and apples and typos like appel and aple). But, to some extent, it’s also a matter of personal preference. Both techniques take a while to master, so go easy on yourself if you have to practice a few times before you’re fluent.
Download the Spreadsheets
Want to practice if and vlookup? Download the materials and practice along with me.
Comment and Share Your Ideas
Share some real-life examples. What types of values have you needed to categorize in your own projects? I tried to add a ton of examples to the bullet point list above, but there are infinite ways to apply these techniques.