I can be paranoid about many things. Some of us worry if the sky is falling or if world is going to end. Will it end on December 21st, 2012 as many believe that the Ancient Mayans have predicted? We all have fears in life and if some recent news is any indication, (here is a reference to some disturbing news that should get you thinking ) then I should probably be wearing a foil cap, lead lined underwear, and a personal cooling system. Whenever I am at a restaurant, I always moan and groan about those in front of me in the buffet line. One has to ponder many questions that validate the desire to consume said food. First, who touched the buffalo wings? Second and thirdly, when did they touch it and how did they do it without mixing utensils from other food trays? Fourthly, why did they have to touch my beloved buffalo wings that I have been salivating over since standing in the buffet line? Most importantly, why is there this weird green, partially fuzzy, semi-gelatinous blob on the same tray as the buffalo wings (no thank you, I will pass on buffalo hot wings today and instead take the greasy cheesy goodness known as macaroni)! Pondering these basic questions makes my mind race with uncertainty, I begin to distrust what I see before me and I begin think of follow on questions that eat away at me. Why did the disheveled person in front of me in line not wash their hands before exiting the restroom after exiting a stall (sadly, a true story that many have had the horror of witnessing)! Maybe it is my imagination but I find that many people are not very sanitary and kids are worse with their perpetually sticky hands. You have to wonder where they have been, what they have been doing, and how did they arrive at their current state of affairs. In addition to who is around you, you also have to start to ask questions about what happened to the food before you got to the buffet line and what happened before the food itself arrived in its little heated and stainless steel tray. Did the staff play hockey puck with the fish patties, is there MSG in any of this food, how many food violations has the kitchen had total, and so on. Every time one question arises and closes, many more can pop-up. The same types of worries and concern permeate many aspects of the Enterprise environment. During the process of data retrieval, many of the same questions arise. Tracking the origin of information is the domain of data lineage and if we could only do this for the food items I like to eat, I probably would NEVER go out to eat again. What is this foreign concept of data lineage that I speak of, why is it so important, and what can we do about it?
Comparing enterprise data lineage to the local hole-in-the-wall buffet sounds hair-brained, but the very same questions ring true. Just as one might ask about the greasy buffalo hot wings so does one need to ask about data, what is data lineage, and what are the implications. Information and data flows fluidly from all points within a company, passes through systems and subsystems, is consumed and exported by applications, may or may not be modified, and can become an aggregate of several other data points of information. Data may eventually be stored in one or more locations and can reside in databases, documents, spreadsheets, or even emails. Along the way, the origination point of data, its lineage information (who, what, where, when, why, and how) is obscured, may contain gaps, or may be lost. The process of capturing the changes of data over time involves the tracking of its lineage as applications consume it and interact with it. Data lineage is meta-data that captures information about the history and provenance of data, which is critical to answering key business questions such as:
Who created or modified the data?
What operations were performed on the data?
Were there any elements of uncertainty introduced or injected into the data?
When were modifications or changes introduced to data?
Which business processes and/or systems touched and processed the data?
What were the previous values of the data throughout its life thus far?
From which sources does the data originate?
What is the reasoning for any modifications to the data?
The next question to ask is why is data lineage important? The lineage of data can be imperative for businesses such as financial institutions that must abide by governmental regulations, such as Basel II  and the Sarbanes-Oxley Act  as enforced by the Federal Reserve  and the U.S. Securities and Exchange Commission (SEC) . Such regulations require knowledge of the life and timeline of data. In such instances, institutions must provide proof to the veracity and authenticity of information within a timely manner after a request for any information (i.e. section 409 of the Sarbanes-Oxley Act ). Acquiring the entirety of the lineage for data can be time consuming and may be error prone due to the various interactions of people and data retrieval. The complications of tracking data lineage increase when branching of information occurs and various departments or people have different versions of the data within an organization. Matters begin to complicate when tracking down changes from the branching of data may merge at later steps in the lifecycle of the data. Utilizing data lineage in an enterprise can begin to tackle several of these business needs:
Shorter business making cycles
More efficient and cost-effective compliance and audits
Enhanced data loss prevention
Especially in data aggregation situations
Allow for finer grain access control
In-depth data analytics
Decision-making occurs in all industries, but the effort varies between them. Pharmaceutical companies, for example, spend significant effort to determine the lineage of clinical trial data. This can prolong the decision making process of whether to advance or kill a drug.
Compliance and auditing is a necessity in financial services and many financial service companies spend significant effort to locate and prepare information for audits, much of which is lineage related. Lineage also plays an important role in determining how reliable are exposure risks reported by banks to regulators required by compliance initiatives such as Basel III.
Without proper lineage information, combining data from multiple sources can result information release and data loss problems. Once data has left an enterprise or entity, it can be very difficult to control its exposure and next to impossible to retract. Additionally, data loss has risks associated with it.
The lack of proper data lineage information can also degrade the impact and value proposition of data analytics. For example, insights have less value if the sources of the data and information are not trusted or are unknown.
Now, imagine opening an Excel spreadsheet or Microsoft Word document and distressing over by the numbers that appear before you. The document is a quarterly earnings statement, but there is something amiss, the numbers do not appear to add up. Normally when this happens, you would go back and check your numbers. If one of your job tasks is to track where the numbers come from, you may end up talking to people, looking at logs, finding out who calculated the numbers, and from what piece or pieces of information it derives. This could take hours to days depending on the complexity. If you are in a hurry or have a meeting, it is next to impossible to get this information in a timely manner. A data lineage tracking tool that is integrated with common office tools is a click of a mouse away and can ease the anxiety. Using our data lineage tool, right-clicking on the questionable numbers, and selecting “Get Data Lineage” produces a quick report of changes (the who, what, where, when, and why of data lineage) and produces a graph depicting the data flow and how it became its current value. Armed with this information, you can now feel confident that you will be able to answer any questions that arise. With this in mind, we decided to see if it were possible to create a data lineage tool that could tackle some of the issues described above.
One of the challenges in building a data lineage solution is the ability for its use by both existing assets and new assets in an enterprise environment. Existing assets and their owners may be unwilling or unable to modify their systems to utilize a lineage system. In order to accommodate both new and existing assets within an enterprise environment, a multiple modal architecture is required that is minimally invasive. To meet this need we designed the system to operate in a mediation mode as well as a monitor mode
The data lineage tool allows for integration with a myriad of external tools using web service calls. The real magic comes into play when the data lineage tool is demonstrated as an integration with the Microsoft Office  applications such as Excel  and Word . The integration with the Microsoft Office suite also allows “copy & paste” operations to persist between applications. Furthermore, the lineage information embeds itself within a value within the same document despite it being “copy & pasted” as well as across different documents and spreadsheets, whether or not a host system has the data lineage add-in install or not. In Excel, a user simply selects the desired asset, resource, and table combination and then retrieves the data lineage. Using the Microsoft Excel ribbon, the data lineage and data flow may also be discovered after highlighting an excel cell of interest (see figure below).
A unique feature of the integration is the ability to monitor and detect anomalies. When a value is determined to be a certain degree different from previous values, red highlights will draw attention to the detected information. This provides a starting point for an individual to investigate problems that may exist in a system.
The data lineage provides only limited information of a value at the table level; however, the data flow can enrich the experience even more by showing how the data arrived at its current value as it passed from one database to another through applications. The data lineage tool is able to track data flow not only within a database, but also across databases. The tool displays information as a directed graph with a clear delineation between applications and resource stores (see figure below).
Data lineage has a broad range of applicability. There are many ways of tackling the problem of tracking down the life cycle of data, each method with its own pros and cons. What we have done is created a method that is minimally intrusive on an enterprise environment. We have done this by creating a tool that can work with both new assets and old assets utilizing different access methods for acquiring the data lineage. Tracking data lineage is becoming increasingly important for many companies and can aide in many business processes. While we can’t help much concerning those juicy buffalo wings I mentioned earlier, we do have a way of finding out how data is manipulated throughout its lifetime. As for the buffet line, nothing short of a QR code cooked into my buffalo wings with a link to an online food database or lineage trail may help me decide whether I can trust the food or not…
“Laptops damage sperm? What wi-fi study shows”, http://www.cbsnews.com/8301-504763_162-57332822-10391704/laptops-damage-sperm-what-wi-fi-study-shows/
“Laptops May Hurt Mail Fertility, Study Suggests”, http://www.cbsnews.com/2100-500165_162-7044716.html
“Laptop WiFi May Damage Sperm, Study Suggests”, http://www.huffingtonpost.com/2011/11/29/laptop-wifi-sperm-damage-electromagnetic-radiation_n_1118726.html
Basel Committee on Banking Supervision, http://www.bis.org/bcbs/index.htm
Sarbanes Oxley Act of 2002, http://www.sec.gov/about/laws/soa2002.pdf
The Federal Reserve Board, http://www.federalreserve.gov
United States Securities and Exchange Commission, http://www.sec.gov
Microsoft Office, http://office.microsoft.com
Microsoft Excel, http://office.microsoft.com/en-us/excel
Microsoft Word, http://office.microsoft.com/en-us/word