Contrary to what this clever twist on a classic Shakespearean line suggests, there is no option when it comes to data—enterprises must munge. Data munging, otherwise known as data wrangling, is the process of converting a raw data set (extracted from any number of sources) into a new data set. In the extract, transform, load (ETL) process, munging is a part of the “transform” step.
It’s rare for fresh data from the pipelines to be in a format that is immediately usable. The data may be spread over multiple files, with varying formats requiring manipulation. Inevitably, it takes some effort to get it into a form that a visualization engine can interpret correctly.
Companies often use Excel or SQL to manipulate a table, which means they have used some sort of munging techniques. Although not an exhaustive list, some of the most common munging transforms are filtering, joining and creating calculated fields.
Filtering is the process of eliminating data based on a set of criteria. For example, a company might take sales data for the US and only keep the data for California. This is important when delivering a message focused purely on a subset of the total data set.
Joining involves bringing together disparate data sources to potentially enrich the visualization message. A local csv file might provide longitude and latitude information for all of the offices in a network, while a server might contain a file with headcounts for each office. In order to visualize this on a map, the information needs to be combined. There are number of different joins (inner, outer and full)—all of which have the ability to combine data into new data sets.
Calculated fields show the result of a specified formula. Sometimes the data doesn’t contain the explicit information needed. For example, if a company wanted to display revenue but the data set only contained net income and expenses, a calculated field would contain the result of subtracting the two values.
Munging expedites data visualizations
In the workplace, a good munging tool can save data handlers several hours a day. In combination with a cleaning tool, it’s common to see a project timeline shrink from weeks to days. But before purchasing a munging tool, companies should evaluate the features and make sure it contains
Visual flow drag and drop interface
This will allow for easier debugging and quick substitution of transformations.
Input/output of various data sources and formats
Large built-in library of standard actions with the option to customize using code
Now that we’ve discussed the power of data munging, I’ll ask again: To munge or not to munge? Clearly, there is no option…enterprises must munge!
In my next post, I’ll talk about another critical step in getting data ready to generate business insights—cleaning.