Self-Service Data Access Unlocks Power of Data
The role of the digital platform for Internet of Things (IoT) lies in the capture and analysis of data from a massive amount of sensors that is combined with other contextually relevant data (such as enterprise, social, and 3rd party data) to produce actionable insight.
A key objective is agility to significantly reduce complexity in finding relevant data, transforming into a consumable format, and processing it—all made more complex where the definition of “relevant data” evolves along with changes in use case, context, and data availability.
Another key objective is interoperability in bringing together data from across sources and directly accessing data from new data stores. Direct access is a critical differentiator that allows for real-time analytics and dynamic decisions made on a single source of truth rather than out-of-date copies.
Traditionally bringing together the “relevant data” requires significant technology know-how since data is distributed across a number of disparate data stores with different organizational governance and architectures: for example, an asset hierarchy stored in a relational data store, customer contacts in an enterprise data warehouse, sensor readings in a data lake, and weather data from a 3rd party API.
Data preparation through creation of manual pipelines requires specialized skills to render into a useable format, and this process is made more complex with the need to navigate a variety of data architectures. The manual creation of data processing pipelines that bring together data for application use can take up to several weeks, if not months, to develop.
However, the promise of IoT lies in deriving insights from the massive amount of sensor data through enabling data scientists and domain experts—many who lack the technology skills needed to access and prepare the data. And so one way the IoT platform differentiates itself is through its ability to open up data usage beyond IT and BI specialists, and to democratize all these features for data consumers.
A Data Virtualization Solution
To address this promise, we recently tested the viability of data virtualization for democratizing data access in digital platforms for Internet of Things (IoT).
Data virtualization is a technology that allows for the definition of virtualized or “logical” views that present data from across data stores and representations in a single query-able format. It abstracts away the details of the specific data stores to be accessed through a set of pre-built integrations and transformations. Creation of these views is accessible and uses standard SQL, a more commonly understood language by data scientists, BI analysts, and IT practitioners as compared to many of today’s architecture-specific protocols.
Accenture Technology Labs along with Accenture Resources has been creating digital platforms for Internet of Things (IoT) that handles the large volumes of data created by a massive number of sensors and devices that must be processed both in batch and in real-time. At the heart, the platform is a version of a lambda architecture that is comprised of a number of data stores that handle the fast writes of the streaming real-time data, the comprehensiveness of the massive quantity of all data, and the service needs for application consumption.
For our first step in using data virtualization, we focused on a version of lambda that uses Cassandra for serving and capturing streaming data from sensors and devices, and Amazon S3 for comprehensive storage. And we selected Metanautix Quest, whose roots stem from Google’s Dremel project, as the data virtualization solution to query these large datasets at the Internet scale because of its ability to scale both horizontally and vertically.
We had previously implemented manual pipelines for migrating, tracking, and querying data across data stores. Migration followed a usual tiered storage approach directing newly arrived hot data to the write-optimized operational store Cassandra, and later compressing and migrating that data into S3 for archival after a pre-determined period of months. Querying the data was either left to connecting to a single source (e.g., connecting to Cassandra would be limited by the data contained within), or queries across stores required creation of pipelines with custom logic to fetch and ingest data from each and then to join them.
With the introduction of Metanautix, we started with replacing our existing pipelines with data virtualization’s logical views. Out-of-the-box we used Metanautix’s existing connector and transformations to access S3, its trigger-based actions for executing the migration. At the beginning, there was no out-of-the-box connecter for Cassandra so we used a user defined function (UDF) to create a call-out to store and fetch the data.
To an end user, the complexity of the multiple data stores and formats sat behind a logical view that would fetch and union the results from all sources. The consumer could query the data regardless of whether it resided on Cassandra or S3.
The ease at which to create the pipelines allowed for this use case to be set-up in the order of days with the most complexity lying in the custom UDF. This set-up was a great way to quickly evaluate the ease of use and for a user to see the relations across the data.
But to scale, we wanted to have a more optimized solution than an UDF and worked with Metanautix who created a new native connector for Cassandra. This Metanautix native connector allowed ease of configuration of logical views using the standard SQL which is familiar to a broad set of data users and is doable within minutes.
In addition the native connector brought together significant improvements with performance time—we were able to achieve reads up to 130,000 rows per second. We ran into this limit not due to Metanautix, but because we were taxing our single node Cassandra instance running on Amazon Elastic Compute Cloud c3.2xLarge instance.
In addition, we added Postgres and MySQL as alternative data stores to introduce relational stores for traditional BI reporting to the mix and to add new data to enhance the sensor readings in Cassandra and S3. With out-of-the-box connectors for Postgres and MySQL, the change was again straightforward and now unlocked that data for use across the IoT platform.
Our first steps with data virtualization for IoT have been very promising. IoT presents a use case that must take advantage of new data as it becomes available and requires ease in onboarding the associated data store technologies and to decouple the complexity for end users. Data virtualization is a key enabler of this capability and is required to scale the application of data.