A primary pain point for most data scientists is to install the necessary software needed to run their jobs and analytical models. This process can take a very long time depending on the type of installation and software. This problem is further exacerbated when we need to scale the installation based on the data and user needs. While this has been addressed by public cloud providers with their Hadoop-as-a-Service offerings, it is still in its infancy on-premises.
To solve this not so trivial problem, the Accenture Data Insights-Architecture’s research agenda is primarily focused on on-premises automated architecture deployment. Our approach involved researching vendors that would allow us to set up this on-premises environment for Hadoop and Spark in our data centers. After talking to BlueData at the Strata + Hadoop World conference in New York City last fall, we invited them to our office for an in-depth discussion to understand their platform, its capabilities and how we can work together on this problem.
In a nutshell, BlueData’s EPIC software is an on-premises platform for big data infrastructure. It is primarily designed for large-scale data-intensive computing systems such as Hadoop and Spark. The BlueData software uses virtualization technology to help simplify and streamline big data deployments, providing Hadoop-as-a-Service or Spark-as-a-Service in an on-premises model.
BlueData EPIC can be deployed on top of a cluster of physical servers which have Hadoop Distributed File System (HDFS). EPIC enables the creation of virtual clusters across these nodes which can access the data stored in HDFS. It also has the capability to connect the virtual clusters to other file stores (NFS, HDFS, Gluster) through EPIC’s "DataTap" functionality.
After our initial discussions and evaluation, we procured a 5-node software license from BlueData to set up EPIC in our data centers. Here are the technical requirements for deployment:
1. Deployment Requirements
1.1. On-premise hardware
VM Compatible AMD CPU with at least 4 cores. 32GB RAM (64GB-192GB recommended). Two or more hard drives on the Controller and Worker nodes.
1.2. Linux operating system
For Red Hat Enterprise Linux, subscribe via Red Hat Subscription Manager. Our servers were installed with RHEL 6.5. All physical nodes were KVM enabled.
1.3. Networking requirements
10 GB Ethernet Cards where the Private Network Interface handles traffic between the virtual nodes. 1GB Ethernet card where the Public Network Interface handles traffic between physical machines. Allocate a subnet pool of IP address to be used for the virtual nodes.
2. Installation Lessons Learned
We had several WebEx sessions with the team at BlueData to get the EPIC software installed on our 5-node cluster. We were able to do the installation on our own. During this process we ran into some known as well as unknown issues; they were minor issues and the BlueData team was very supportive and responsive in addressing each of them. The following documents some of the lessons learned.
2.1. No xfsprogs Error
During installation we ran into a “no xfsprogs” error. This error occurs if the Linux distribution cannot find the utilities needed to manage the XFS File System. This was fixed by selecting the right subscription and adding the following RHN Channels on Red Hat:
Subscription : Red Hat Enterprise Linux Server (v. 6 for 64-bit x86_64)
Channels : RHEL Server Scalable File System (v. 6 for x86_64)
RHEL Server Optional (v. 6 64-bit x86_64)
2.2. Version of hypervisor
Our servers are set up with nightly security updates which automatically upgrade the software downloaded from the yum repo. We found that this process upgraded the hypervisor (KVM) to the newest version that was not yet supported by BlueData EPIC. In order to fix this issue, we downgraded the hypervisor to the correct version and included an “exclude” line in the /etc/yum.conf file to prevent the nightly update from upgrading the hypervisor.
2.3. Removing a node from a Cluster
The version of BlueData EPIC that we installed did not yet have the feature to remove an existing node from a cluster. We ran into some issues with one of our nodes that had hard drive issues. Since the node was not working, the only way to keep the cluster functional was to remove the node from the cluster. The BlueData team responded quickly and created a patch that we applied to our cluster to remove the node.
Testing the BlueData Web GUI
BlueData EPIC has a web-based GUI called “ElasticPlane” that provides administration functionality as well as self-service capabilities for virtual Hadoop or Spark clusters in a multi-tenant environment. We did our due diligence on the BlueData EPIC platform, testing the following ElasticPlane features on our 5-node cluster. Each of the features below works as expected and we didn’t uncover any issues.
Creating user accounts, setting permissions, creating and managing tenants, switching user roles and creating new catalog items.
Creating virtual clusters (Spark & Hadoop) in a given tenant.
Running Spark and Hadoop jobs on existing clusters as well as transient clusters (setup of cluster on the fly for the job followed by a teardown after execution).
Working with APIs
For automating deployment on our end we needed the capability to programmatically instantiate clusters and run jobs. We worked with the BlueData team who exposed a partial set of RESTful APIs for our use. We built python wrapper scripts hitting the BlueData APIs to create virtual Spark clusters, submit Spark jobs to be run on these virtual clusters and list the existing clusters available for use for a given user in a tenant. This is an ongoing process and we will continue to iterate with the BlueData team for additional APIs when we need them.
Next Steps - BlueData for Customer Data Use Cases
The next step in our research was to use the BlueData EPIC software platform in Accenture projects that deal with a real-world use case. Our Digital Customer group is working with social and transaction data to build a customer genome. Their software stack includes HDFS, Kafka, Spark Streaming and Cassandra. To support these technologies on-premise, we decided to use BlueData’s EPIC platform for Spark Streaming*.
The jobs created for the project required the different technologies to interface with one another. For example, when social streaming data is ingested using Kafka, the data is pushed to Spark Streaming where after some initial munging/pre-processing, the customer data is stored in Cassandra. Running these jobs on BlueData required configuring the necessary network IP addresses and port so the different physical servers could talk to each other. Specifically, we had to set a route to use the BlueData Controller (BD-Controller) node as the gateway for the virtual IP network. This facilitates traffic destined for the virtual nodes to be routed via BD-Controller allowing the jobs to access the virtual nodes. More to come soon!
*Note: This project is not exclusively using EPIC for Spark deployments