Thank you, Ann Emery and thanks especially to the visitors of AnnKEmery.com. It meant a lot to be asked to do a guest blogpost because Ann’s approach is practical, focused on real-world experience, and her dedication to empowerment of others is a key theme at my own site, DataScopic.net.
I’ve been working with Excel and data for 15 years and developed a skill for scrubbing data. So, data quality is always on my mind. This year, I’m teaching more workshops so that others are empowered to manage their own data quality and develop sound spreadsheets.
Hopefully, you’ll find this blogpost useful. Please comment, ask questions and be in touch.
– Oz du Soleil
Today, we’re going to discuss data quality, messy data, or, as I’ve described in several blogposts, crap data. As a complement to this post, there is a dataset of 455 rows for us to work. Click here to download the sample dataset.
We Have Our Data: What Do We Want To Know?
We want to dig in and find out things like:
- Of the members we had in 2009, how many are still active members?
- What were the 2010 donation amounts of the currently active vs. inactive members?
- What were the 2008-2010 donations for the states where we plan to close offices: IN, MA and SC?
- What were the 2008 donation totals of the Gold, Platinum and Lifetime members?
- Are our Lifetime members clustered in a region, in a few states or, is there no correlation between residence and Lifetime membership?
This is so easy with pivot tables. You don’t need subtotals or SUMIFS formulae. But… stop!
Question: Can We Trust This Data?
A clever use of pivot tables is to throw your information into a pivot table as soon as it arrives at your desk. There are two things that we want to know before we start our analysis:
- What are we dealing with?
- We have states in the dataset. Do we have 50 states and Washington, DC? How about Puerto Rico, Guam and US Virgin Islands?
- For Marital Status, do we have Married, Single, Widowed, Divorced and Domestic Partnership? Or, just Married or Single?
- Is this clean?
- Are there empty fields? Where are they and are they critical? We can live with a missing fax number, we can’t live without a missing membership level.
- Are there any duplicates that need to be merged into single entries?
- Is there anything just plain bizarre? Are there complete addresses in the State field or, “NOV” in a field that should only have YES or NO?)
Don’t make a single bar chart or summary table until we know the answers to those two questions.
Answer: Use a Pivot Table to Assess Data Quality
The old way of checking our data quality would be to scroll through, eyeballing for obvious duplicates; we would sort by the State field and eyeball through for blanks and nonsensical entries.
That is painful, tedious, and time-consuming. Eyeballing datasets is also prone to errors and must end. TODAY. Rest your eyes. A pivot table can save hours or even days, depending on the size and complexity of the dataset.
In this video, I generate a pivot table and focus only on the data quality. We see that there are duplicates and bizarre information that render the dataset untrustworthy until we get it cleaned up. As you watch the video, don’t focus too much on the results or the “how to.” Instead, listen to the thought process and questions I ask about the dataset.
Now we know:
- What are we dealing with?
- There are 25 states represented in our dataset, including Puerto Rico. Eight people aren’t assigned to a state.
- There are 5 membership levels: Rookie, Silver, Gold, Platinum, and Lifetime.
- There are 422 members.
- In terms of marital status, we only have Married and Single options represented in the dataset.
- We also know our donation levels between 2008 and 2010.
- Is this clean? NO!
- Kenneth is in the dataset 5 times. Adara is represented 3 times.
- There are 422 members and 455 rows of data. That’s 33 rows too many. They need to be investigated and merged into single entries.
- “17” is not a state. The people who live in “17” need to be researched and corrected. Also, review the data-entry process to see how that was allowed.
- There are 56 people whose active/inactive status is unknown.
Now What? Conclusions
The dataset has to be scrubbed. More importantly, Ann said it best in one of our conversations: “The main skill in working with data is developing your personal computer program: your brain.”
One goal of the video is to show how to think through the ways we might expose crap data. Using pivot tables eliminates the need to eyeball for errant data. This minimizes the filtering, sorting and scrolling that we’d otherwise use. Pivot tables save time and yield more accurate insight than our old ways.
For many years this wasn’t something I even thought of, and I was pretty darned lucky that nothing went wrong. Eventually I just got embarrassed when my boss kept noticing things that didn’t add up. The problem wasn’t my math. It was naive trust of the data that had been handed to me.
I’m curious. How do you go about investigating data quality? How much time do you spend on it? What happens when you expose miscreant information?