Skip to main content Skip to Footer


May 14, 2015
To Clean or Not To Clean
By: Joe Bynoe, Data Insights R&D group

Similar to data munging described in my previous post, enterprises must clean their data to derive the best possible insights for the business. Data cleaning is part of the “transform” step in the extract, transform, load (ETL) process. Arguably the most tedious of all the steps in the data manipulation pipeline, cleaning is the act of fixing spelling errors, deleting erroneous white spaces, removing duplicate entries and performing any other action to eliminate inaccurate entries in a data set. With big data containing millions of entries, addressing these issues can be prohibitively time consuming.

What happens if a company decides not to clean its data? Embarrassing mistakes are a best-case scenario; worst case, executives draw incorrect conclusions about the business and act on them.

Thankfully, there are a number of data cleaning programs that when used correctly can reduce a month-long project to mere days. This is a fast growing market, and here are a few new tools that have entered recently:

Open Refine (Formerly Google Refine)
Once owned by Google, Open Refine is an open source project that provides users with an array of cleaning tools, including the ability to quickly remove white spaces and split columns based on delimiters. One useful tool is the built-in clustering algorithm that allows a user to quickly find and group similar entries.

For instance, say an accountant is managing a spreadsheet with entries for Accounts Payable labeled as “AP” and “Accounts Payable.” Intuitively, these are equivalent; however, to a visualization tool, these are two distinct entries. Using the built-in clustering algorithm, Open Refine will attempt to find entries that are related and allow renaming of them all at once. The ability to find these similarities in large data sets by simply clicking a button reduces cleaning time dramatically. 

Growing out of a research project at Stanford University called Data Wrangler, Trifacta offers cleaning tools similar to those found in Open Refine (including the clustering algorithm), but complements them with an intuitive user interface and even more time saving features. Trifacta works on a sample (up to 500MB) of a dataset, allowing users to test transformations before extending them to the entire data set. As the data set grows, so does the run time necessary to perform the transformations. By only working with a sample, the tool keeps run times low, allowing users to quickly iterate the process until completely satisfied with the steps.

Another one of Trifacta’s features is the “Predictive Interaction,” which recommends the next best cleaning step based on the provided data set. If highlighting a “/” in a column, the recommendation engine might suggest splitting the column into two columns using the “/” as the delimiter. Machine-learning algorithms help Trifacta learn from a user’s actions, tailoring its recommendation engine accordingly. This takes some of the guess work out of cleaning data and provides the means to quickly manipulate data.

These are just two of the many available tools capable of reducing the time enterprises spend cleaning data. It is important to evaluate all of the features and functionality to find the tool that will work best for your business.

To learn more about the critical steps companies must take to make big data ready to generate business insights,

More blogs on this topic


      Industry & topics highlighted