A Google search for “Big Data” returns about 6.4 billion results. Over the past decade, Big Data has been a magical term across all industries. From Main Street to Wall Street, a huge amount of data has been generated, contributing to a multitude of Big Data use cases over the said period.

The main characteristics of Big Data are large volume, variety, velocity and veracity (4 Vs). To process Big Data, we need extensive computing power. Computing power is not a big concern anymore due to the luxury of infinitely scalable computational environments through cloud platforms like Amazon AWS, Microsoft Azure, or Google Cloud.

Beyond the Big Data processing power, we leverage intelligent algorithms to make sense of all the data. With the improvements in robotic science and technology, the algorithms are very advanced. Though the question of having enough of the right data to perform the algorithms upon still lingers for many.

<<< Start >>>

<<< End >>>

Do you have intelligent Big Data?   

Here are some of the best practices we recommend to boost the intelligence of Big Data.

Construe non-conventional data

Consider any industry. There are huge amounts of data sitting in non-conventional formats going unutilized, such as images, PDF, tapes, audio, video, logs and sensors. For example, in the medical industry, a lot of patient vitals and notes are still on paper or in image format. In the automobile industry, a lot of data is available in sensors and remains unused. Reading and ingesting non-conventional data adds more intelligence for consumers of Big Data. 

Complement with external data

Individuals generate enormous amounts of data nowadays which are available on open social media platforms, administration platforms like property taxes sites, and more. For example, in a financial institution, behavior scoring can be improvised by adding social and wealth data from open data platforms. The addition of external data augments the dimension of our existing Big Data.

Correction of raw data

Sometimes the parsing of data from images, videos and audio files will end up providing incorrect value. For example, the letter “H” can be interpreted by an image parser as “1-1” due to bad image quality. It's wise to correct the raw data using a Natural Language Processing (NLP) engine to avoid false positives.

Shape data smarter

Applying appropriate data structures enables data consumers to garner intelligence from Big Data. The Big Data Lake can easily become a swamp without following some fundamental data architecture practices. The data should be:

  • Organized - Big Data needs to be organized for appropriate use. There are different data models applied for exploring data, reporting from data and building Business Intelligence (BI) solutions on the data.
  • Integrated - Commonly, data is ingested into the Big Data Lake from various sources. In most cases the ingested data needs to be, or appears to be, integrated to serve up consistent information. This may require some form of data mastering, possibly using an MDM (Master Data Management) tool, say for customer or product integration. 
  • Identified - As data is ingested from multiple sources, the data architecture also has to deal with forming common identifiers across the Big Data. Again, this ensures consistency in the intelligence gleaned from the Big Data.
  • Cataloged - To enable the data consumers to self-serve intelligence from Big Data, some form of a metadata catalog is needed. Consumers need to be able to understand what data is available in the Big Data Lake, when it was last refreshed, where in the Big Data Lake it is located, where it came from, how good the quality is, and more. (Refer to “Provide Transparency of the Big Data” point below).
  • Provide transparency of the Big Data - A key feature for a Big Data Lake to be easily usable is the transparency into the contents. This is achievable through a catalog or metadata management capability. The catalog should capture important attributes about the data coming into and being processed within the data lake, from ingestion to publishing. The catalog enables users, or potential users, to understand what data is already managed within the data lake, where it originated, when it was updated, what stage the data is in (raw, refined or published), what rules or derivations were applied to the data and the level of data quality, among other things.    
  • Capitalize upon various stages of data - Typically, as data is processed through the Big Data Lake, it appears in many forms. It lands on the platform, sits in its raw file form, gets refined and ultimately is published for general consumption. As the saying goes, “One man’s trash is another man’s treasure”. Each stage is important for different use cases. For example, data in raw stage helps us identify errors and improve source system processing. Persisting data in each stage will add more value to the business now or in the near future.

These are a few viewpoints for adding intelligence to your Big Data. As time and technology progress, there will be even more fortune in reaping the value out of Big Data.

Nick Bonamassa

Data Architecture Lead

Natesan Dhanasekar

Hadoop Lead – Data Business Group

Subscription Center
Subscribe to Software Engineering Blog Subscribe to Software Engineering Blog