Skip to main content Skip to Footer

BLOG


Nicholas Gulrajani
Nicholas Gulrajani
June 11, 2018

Why Big Data needs DevOps

DevOps aims at shorter development cycles, increased deployment frequency and more dependable releases - along with close alignment to business objectives.

For example, in the healthcare industry, most projects today are dealing with, or need to deal with Big Data that is changing quickly and needs to be published rapidly (almost real time) in a consumable form for stakeholders.

What is Big Data?
Big Data consists of massively voluminous and complex data sets. In fact, traditional data processing application software is inadequate to deal with them.

Big Data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

There are five dimensions to Big Data known as volume, variety, velocity and the recently added veracity and value.

With the goal to increase speed at which data needs to be ingested from variety of data sources—mainframes, relational database management systems (RDMBS) and flat files to targets on a Hadoop cluster, a collection of opensource software utilities, where it needs to be transformed and be published - DevOps Continuous Integration and Deployment (CI/CD) patterns need to be adopted. It needs the right sets of tools for the data to be ingested and transformed rapidly and tested thoroughly to provide expected business value.

So how does one apply the DevOps CI/CD patterns to Big Data?
When applying CI/CD patterns, there are three dimensions that need to be considered.

  1. Code assets that ingest and transform data need to be promoted through the pipeline and need to comply with quality gates as it gets deployed to
    DEVELOPMENT(DEV), PRE-PRODUCTION(PRE-PROD), and PRODUCTION (PROD).

  2. CI/CD pipeline needs to trigger data ingestion and track deployment into DEV, PRE-PROD and PROD.

  3. The pipeline needs to support testing of not only code assets, but also data, as it gets deployed into DEV, PRE-PROD and PROD.

Reference CI/CD pipeline model
The following steps in the CI/CD pipeline illustrates how to deploy code assets to DEV, PRE-PROD and PROD environments. These steps are illustrated in Exhibit A.

  • Developer creates feature branch from JIRA story to start building code.

  • Developer completes code commits and creates a pull request to merge code to develop branch in Bitbucket.

  • Jenkins initiates the CI/CD pipeline.

  • Static code analysis is performed.

  • On success, the code is packaged as a development snapshot.

  • Code deployed to Dev environment.

  • Automated unit tests are run in the dev environment.

  • On success, the code is packaged as a test snapshot.

  • Code deployed to pre-production environment.

  • Automated data tests are run on the test environment.

  • On success, the code is packaged as a release snapshot.

  • Code deployed to production environment.

Exhibit A


image

What about Test Automation?
The following steps in Test Automation reviews the quality of code and data as it flows through the pipeline and into various data zones. These steps are illustrated in Exhibit B.

  • Unit Testing: Automated test scripts are created by developers to make sure the units of code are functioning properly. These test scripts are managed by Bitbucket for every associated unit of code.

  • Static Code Analysis: Tools can be leveraged to provide continuous code quality inspection capability. SonarQube is a recommended tool for code analysis/inspection and making sure that the code meets quality standards before migration to production.

  • Functional Testing: Data validation performance ensures the actual output matches the expected output when moving data from each zone to the next.

  • Integration & Regression Testing: End-to-end test automation tests movement of data from source systems to raw zone all the way to the app zone. This enables fast and accurate end-to-end testing whenever there are code enhancements, configuration changes, etc.

  • Performance Testing: Using Cloudera Hadoop (CDH) cluster resource utilization statistics and comparing average job run times and resource utilization to the available benchmarks/baselines.

Exhibit B


image

Sample Test Cases for Data (Recommended)

  • Missing data

  • Data truncation

  • Data type mismatch

  • Null translation

  • Wrong translation

  • Misplaced data

  • Extra records

What about Data Ingestion?
The following steps show how data can be ingested from various sources to targets such as a “Raw Zone” on an *Hadoop cluster. These steps are illustrated in Exhibit C.

  • Control-M Triggered: All raw zone activities are triggered through Control-M automation jobs

  • Ingestion Patterns, include:

    • File Based Ingestion: Files sourced to the edge node or HDFS via indirect ingestion paths.

    • Sqoop: Data ingested from RDBMS sources using Sqoop and the accelerator framework.

    • Change Data Capture (CDC) Replication: Data ingested from external sources using one of the change-data capture methods.

    • Kafka Queue: External applications may push data to a Kafka Queue which can then be retrieved into the Data Fabric

  • Ingestion Apps: Custom ingestion code processes retrieve and ingest source data in HQL, Python & Spark (PySpark) and Java.

  • Raw Zone Directory Flow, include:

    • Inbound for incoming file landing prior to processing.

    • Stage for temporary / in-process datasets used only during load processes (non-permanent).

    • Backup is for inbound files are moved to backup for 30 days to support process recover.

    • Warehouse is the final load zone and the RAWZ endpoint for all loaded data.

  • Hive SQL is used for query access to the Hive tables loaded in the RAW zone

Note: Only one example is discussed. There are multiple other use cases such as transporting data from a flat file, and RDBMS which are not discussed here.

Exhibit C–Data Ingestion into the Raw Zone


image

Summary
In my experience, typical results using data ingestion and DevOps patterns, include:

  • Architecture framework reduces architecture decision making and implementation by 80 percent.

  • Ingestion and raw data management framework will reduce development effort by 60 percent.

  • CI/CD and automation framework will reduce testing and deployment efforts by 70 percent.

Thanks to Paul Kuk, Senior Analytics Executive, Accenture Digital for his help on Data Ingestion Patterns.



COMMENTS (1)

SIGN IN WITH SOCIAL

COMMENT WITH SOCIAL

OR COMMENT WITH EMAIL

Your Data Privacy

By providing your e-mail address, you agree to the terms
outlined in our privacy statement associated with
commenting on the site. Your e-mail address will not be
used for promotional marketing purposes.

CAPTCHA
Change the CAPTCHA codeSpeak the CAPTCHA code
 


Laxmipat • July 17, 2018

Very insightful and useful content of Data Ops. I have question- can you please explain little more on what we should do for Data assets in Big data for following CI/CD pipeline. Are you stating moving flat file(stored on HDFS) records from Dev-QA-prod through pipeline method ?

CANCEL
Are your sure you want to delete this comment?