Data continues to be progressively more important in defining business strategies. In an effort to increase the level of detail and efficiency captured by data warehouses, data store types are becoming more diverse and abstracted.
For example, retailers have customer information stored across different architectures between their customer relationship management (CRM), point of service (POS) and loyalty systems. These retailers also want to find ways to incorporate third-party data providers related to product, customer research and social networks. To accommodate this, the processes to manage and integrate data have to become more robust.
In traditional relational database management systems (RDBMS), tables are structurally similar. Storing data in logical row-column structures allows relatively straightforward structural joins between tables using lookup tables or primary keys. Yet, as demand for data grows, so does the types of data collected. Many companies believe that RDBMS are insufficient for their particular data set and are looking to other types of systems to fill the gap.
Examples of alternative store types include variants of NoSQL, which eschews the structures of RDBMS and uses key-value pairs to eliminate the need to store empty, null data. Relational graph databases are also growing in favor due to their ability to represent data through direct relationships from node to node, rather than inferred via shared keys (see Figure 1).
Defining new data integration strategies
Introducing new types of data to an existing warehouse can add new levels of complexity, as data types may not be structurally similar. This lack of similitude can translate to an uneven view of the overall data without a proper integration strategy, which can be difficult to deploy properly.
To design a data integration strategy, companies must create a new data structure to accommodate the new data being integrated, which must also be robust enough to handle additional structural changes. They must also develop a new warehousing strategy to store the new integrated dataset. Both of these steps require strict adherence to methodology and development of good governance practices that are as robust as the data.
Without a good integration strategy, a company can end up with large, inefficient data warehouses that waste valuable resources and computing power—or worse, a warehouse that contains a high number of improperly processed and inaccurate data sets.
Pros and cons of data virtualization
To alleviate some of these issues, data virtualization has become a popular alternative for data integration. Data virtualization integrates and creates a single view of data aggregated from multiple sources, offering it seamlessly to the end-user without requiring knowledge of technical details about the data, such as formatting or storage location.
Rather than creating a new replication of the data to be physically stored on servers, data virtualization acts as a data broker, combining multiple sources of data and matching redundant data deterministically (see Figure 2).
From this view, the original data can be managed, standardized or matched. In this way, data virtualization acts as a meta-database, connecting smaller, disparate databases and presenting the data in a unified manner. Done properly, virtualization reduces the risk of data errors and system workload due to the source data remaining in place. Since companies do not need to build an extensive infrastructure, it can also reduce development and support time.
However, data virtualization is not a magic bullet. Where speed is crucial, such as in real-time recommendation engines, it may be better to consolidate data under a single structure using traditional RDBMS integration tools instead of virtually accessing data from multiple sources.
Furthermore, very complex transformations often take much longer in virtualized scenarios. And regardless of integration method, cross-linking and matching like-for-like data accurately is still a massive challenge for even the newest data integration suites, requiring a deep and thorough understanding of the data itself. Currently, there is no tool that can do this completely automatically.
To sum it up, data integration projects are difficult and often messy. Today’s companies must wrangle multiple types of data stores, methods of access and even multiple physical locations. While there is work being done to automate data integration and make it as seamless as possible, there is still no fully automated method to do so. For now, at least, the most important aspect of a company’s data integration strategy should be the governance and integrity of the data, not necessarily using the newest technology suites.
To learn more about data stores, data integration or data science at large, please contact Hyon S. Chu at firstname.lastname@example.org.