Excel: Testing Data Quality in ExcelBy
Microsoft is great at adding new functionality to Excel but hiding it away in an obscure corner of the program where it’s difficult to discover. For instance, there are several data quality tools that were added in 2019, but they’re hidden where most people wouldn’t discover them.
Figure 1 shows the top of a data set with some survey responses. Column A contains a unique ID. Column B contains age; Column C is a multiple-choice question with answers A, B, C, or D; and column D contains a yes/no question where the answers should be Y or N. Before diving into processing the survey data, you’d like to make sure the data is accurate.
LOADING YOUR DATA TO POWER QUERY
The Data tab contains a section of tools labeled “Get & Transform Data” that dates back to Excel 2016. When you use any of the tools in Get & Transform Data, Excel will launch the Power Query Editor. In 2019, Microsoft added three data profiling tools to the View tab in the Power Query Editor. These are easy to use to understand the composition of the values in each column.
The Get & Transform Data tools work best with either a named range or a data set that has been formatted as a table using Ctrl+T. In Figure 1, the data is formatted as a table.
The command that you need on the Data tab has been given different names recently. For many years, it was called “From Table/Range.” Then Microsoft allowed for an array formula to be analyzed, so it changed the command to “From Sheet.” Many people said this was confusing, so Microsoft recently changed it to “From Selection.” While my Office Insider Excel builds currently show “From Selection,” your version of Excel might still use “From Table/Range” or “From Sheet.” In all cases, the location and the icon have remained the same.
Select one cell in your table and choose From Selection. A new window will open with the Power Query Editor:
- The Query Settings pane will be on the right-hand side of the editor. You won’t need this for this task, so close the pane (using the X to the right of “Query Settings”).
- Across the top are tabs for File, Home, Transform, Add Column, and View. Select the View tab.
- In the lower-left is a status bar showing “4 Columns, 999+ Rows Column profiling based on 1000 rows.” Because Power Query is often used for cleaning data, it was designed to show you only the first 1,000 rows. Some people refer to this as “the first inch of water in the fire hose of data.” The 1,000-row limit is fine for writing rules to cleanse data, but it needs to be changed when you’re using the data profiling tools. Click the words “Column profiling based on top 1000 rows,” and a pop-up menu appears with two choices. Choose “Column profiling based on entire data set.”
CHECKING DATA QUALITY
Look on the View tab in the Power Query Editor ribbon (Figure 2). The Data Preview section has five checkboxes, and the Data Profiling Tools are all turned off by default. You can turn on any or all of these tools.
The Column Quality section is outlined in red in the figure. It shows the percentage of cells in the column that are valid, error, or empty. Notice that the Empty is reported as “<1%” for the last two columns. This means that there are some empty cells in that column.
The Column Distribution section is outlined in blue. It contains a histogram of the results. There are some red flags here. The first column is called UniqueID, but the Column Distribution shows only 7,994 unique values out of 7,997 values. That means perhaps three people managed to answer the survey three times. Or maybe one person answered it four times.
Look at the Age column. This survey is of college undergrads. You might expect people to be 18, 19, 20, 21, or 22, and this makes sense given the five tall columns. But there were three other answers that are outside of the normal answers.
The Question 2 column shows a few empty answers. You expected Y or N in this column, but the histogram shows four columns instead of two. To dig deeper into Question 2, click the heading for Question 2 and turn on the Column Profile. This draws the large section outlined in teal at the bottom of the screen.
A bar chart on the bottom right shows the relative popularity of each answer in the column. Hover the cursor over any bar for the exact count.
Once you’re done investigating the data, use the X in the top right of the Power Query Editor to close it. Excel will ask if you want to save your query. Since this exercise was merely for information gathering and not to clean the data, you can choose to discard it.