July 23, 2013
Part II: Important Features of a Log Content Management and an Analytics Solution
By: Colin Puri

In my last blog, I wrote about the importance of using logs to gather insights. In this edition, I’m exploring how Log Content Management solutions aid in gathering insights by performing aggregation of log files from disparate sources, indexing them, and making them searchable. Every solution offers a slightly different approach and not all solutions are created equal. Ultimately, the decision about which solution to choose is based on the types of log files that need to be indexed and the type insight that is desired. Some solutions provide a simple aggregation service, others provide additional statistics, some provide more advanced connections and dashboards, while others enable other analytics and beyond. Analyzing log files depends on what information is contained therein and what you can do with the information at the very least is enabled with the right choice of vendor platform or support.

There are many vendors out there and they fall into two main categories: free public-use solutions and supported enterprise solutions. Beyond the two main categories are various levels of service and availability that vendors provide. The core facets to focus on when acquiring a solution for log management and log content analysis are:

  • Scalability: Almost all of the vendors provide a solution that can scale with large data; however, they differ in their approach. The following are scalability issues that should be considered:

  • Technically scalable:

  • Can the solution deliver on its technical promises and does it scale horizontally?

  • Economically scalable:

  • The economical scalability of a solution is also paramount for an enterprise looking to keep a tight rein on its expenses. A quick question would be to ask is how expensive is it to run the solution? What is the cost of support and is the solution elastic? For public use solutions, what happens when support is needed? For pay for use software, what are the costs of ongoing support from the vendor?

  • Ingestion and Parsing: This can be the thorn in the side of any who have tried to ingest abnormal or odd ball log file formats. The ability of a solution to handle numerous log files can mean the difference from garbage-in-garbage-out and getting real insights from indexed data. To that end, a solution must be able to handle the following three issues gracefully:

  • Ill-formed and poor quality log file inputs:

  • Any viable solution must be able to handle log files with dirty data or poorly formed data values and do so gracefully (generating notifications, re-formatting data, normalizing data, cleaning data, etc.) while still extracting information from the portions that are of high quality. It must do so without propagating data errors and ill formed data to other systems that interact with its data store

  • Unknown log file formats:

  • Many file formats are proprietary and may not follow the well-defined formats of CSV files, tab delimited files, Apache web logs, etc. A good solution will provide a wizard or a guide for first time file ingestion, pattern extraction, and a mechanism to allow for extracting of log files that are too difficult to fit a known template or follow through an ingestion wizard/guide.

  • Heterogeneous log file formats:

  • In some instances, a log file may be well-formed and structured, however is difficult to parse due to its diversity. A solution must allow for a mechanism that is expressive enough to parse a diverse file and complicated trace entries.

  • Analysis and Exploration Capabilities: The core functionality that allows and enables log content analytics as a solution is the capability for searching and pulling out the right data at the right time and displaying it in the right way. A solution must allow for the following:

  • Data exploration: Existing data may be arcane, archaic in its layout, domain knowledge about it is scarce, or documentation may not even exist. Therefore, formulating the right query is challenging due to limited semantic understanding of the log content. A log content analytics solution must provide a mechanism to allow an end-user to understand the data.

  • Exploration guidance: Lacking domain knowledge concerning a data set can lead to difficulties in ascertaining what insights lie within a data set and how to extract the insights. As a result determining what analysis to perform is non-trivial and leads to missed insights or increased time to discovery. An ideal solution should provide a mechanism to not only explore the data but also provide suggestions on where to look and what queries should be formed.

  • Query expressiveness: While exploring the data is critical, being able to succinctly and precisely express a query is important. If a query language is arcane and difficult to understand then it can inhibit the discovery process. The query language must be able to allow for complex questions but also to be elegant and capable of pulling out information from heterogeneous sources. It should be easy to debug if there is a problem in the structure of the query.

  • Visualization Expressiveness: In addition to all of the other challenges of log content analytics, often knowing what visualizations to use can flummox many. A viable solution must provide a means to visualize the data whether be it in a line graph, bar chart, pie graph, etc. In addition to providing dash boarding capabilities an optimal solution should also provide steps toward or provide a mechanism for the following:

  • Visualization optimization: A solution needs to provide an end user with feedback as to which visualizations are best for certain sets of data. Understanding that while a pie chart, for example, can be used to display information about a series of numbers, a line graph may be a better representation. This type of guidance can make dashboards come alive.

  • Preprocessing/Post processing: Some visualizations may require additional pre/post-processing of the source dataset which adds significant human overhead. For example, if a log file has raw sensor data, cleaning and smoothing may be required to remove aberrations to improve the signal to noise ratio (SNR). Failure to do so would result skewed visualizations and other undesirable results (e.g. pops, hisses, spikes, etc).

My next blog will review what can be gleaned from log content analytics and the possibilities I see in the future.

Popular Tags

    More blogs on this topic