Skip to main content Skip to Footer

BLOG


May 13, 2015
To Munge or Not to Munge?
By: Joe Bynoe, Data Insights R&D group

Contrary to what this clever twist on a classic Shakespearean line suggests, there is no option when it comes to data—enterprises must munge. Data munging, otherwise known as data wrangling, is the process of converting a raw data set (extracted from any number of sources) into a new data set. In the extract, transform, load (ETL) process, munging is a part of the “transform” step.

It’s rare for fresh data from the pipelines to be in a format that is immediately usable. The data may be spread over multiple files, with varying formats requiring manipulation. Inevitably, it takes some effort to get it into a form that a visualization engine can interpret correctly.

Companies often use Excel or SQL to manipulate a table, which means they have used some sort of munging techniques. Although not an exhaustive list, some of the most common munging transforms are filtering, joining and creating calculated fields.

Filtering is the process of eliminating data based on a set of criteria. For example, a company might take sales data for the US and only keep the data for California. This is important when delivering a message focused purely on a subset of the total data set.

Filtering

Joining involves bringing together disparate data sources to potentially enrich the visualization message. A local csv file might provide longitude and latitude information for all of the offices in a network, while a server might contain a file with headcounts for each office. In order to visualize this on a map, the information needs to be combined. There are number of different joins (inner, outer and full)—all of which have the ability to combine data into new data sets.

Joining

Calculated fields show the result of a specified formula. Sometimes the data doesn’t contain the explicit information needed. For example, if a company wanted to display revenue but the data set only contained net income and expenses, a calculated field would contain the result of subtracting the two values.

Calculated Fields

Munging expedites data visualizations
In the workplace, a good munging tool can save data handlers several hours a day. In combination with a cleaning tool, it’s common to see a project timeline shrink from weeks to days. But before purchasing a munging tool, companies should evaluate the features and make sure it contains

  1. Visual flow drag and drop interface

    Munging Expedites

    This will allow for easier debugging and quick substitution of transformations.

  1. Input/output of various data sources and formats

    Input-Output
  1. Large built-in library of standard actions with the option to customize using code

With a standard library of actions (import options, joins, etc.), companies can access commonly used transformations. To account for irregular data sets, the tool should allow the user to customize the standard transformations. The most flexible systems allow for custom code snippets (commonly written in JavaScript) to be inserted into the workflow, providing the user with the flexibility to tailor transformations to their unique data sets.

Now that we’ve discussed the power of data munging, I’ll ask again: To munge or not to munge? Clearly, there is no option…enterprises must munge!

In my next post, I’ll talk about another critical step in getting data ready to generate business insights—cleaning.

More blogs on this topic

    Archive