As log files are mined, a multitude of information can be gleaned such as: frequency of information, events, alerts, anomalies, and so on. When mining log files for information there may be evidence that events recorded in multiple locations are related in some form. Discovering relations between multiple sources can increase in complexity when dealing with more than two sources. Naturally you may ask why cross data source correlation is interesting or important action to perform. Cross data source correlation can help investigations into root cause analysis, discover data flows, and discovers those behavioral connections that exist within complex multisystem environments.
To make the analogy more concrete, imagine a situation where every time a particular user logins into a corporate network, there is a detected increase in the amount of data being transferred across separate and independent locations within a network. A VPN server would record the login event in its log for any given user along with a location and timestamp (see example log with Table 1). Around the same time as the as the login event, a network monitoring application may record an anomalous event in its log showing a large data transfer occurring between a network asset and an external system containing the same location as the user that is logged in the VPN log files (see example log with Table 2). In such a situation, an administrator managing the VPN server is able to view all of the successful and unsuccessful login attempts for all users. Similarly, the network administrator is able to see all of the anomalies that occurred on the network via recordings in other log files. That task can be made a little easier if it is the same administrator investigating information in the two separate data sources; however, there does not exist a consolidated view of all events across the two systems that might indicate a correlation between events on the same systems and as result finding said correlations may become convoluted. Furthermore, there is a strength metric associated with the events which is not easily arrived at if only perusing the log files. It is left to the administrator to perform the manual searches, timestamp correlations, and log event feature overlaps.
Table 1. Sample event-log of a VPN Authentication Server
Table 2. Sample event-log of a Network Monitoring System
The event correlation mining process can be applied to multiple data sources to automatically detect and pull out those correlations between two events given that they occur close in time and have an overlap of similar features. Tables 1 and 2 have trace entries bolded that should correlate to each other that would otherwise need to be manually parsed and connected by an administrator or analyst. Certain log trace entries are bolded because both time and feature overlap of events may affect the strength of a detected correlation and thus determine its importance. In more complex situations correlating events across data sources can discover those behaviors that cascade between systems. With a little bit of math it is possible to indicate the strength of links using an overlap of features and timestamps. The strength values can also be combined with statistics to keep track of the probability of occurrence for any link discovered between data sources.
Correlating information across multiple data sources is important because it detects those complex application behaviors, aids root-cause analysis and can uncover previously unknown behaviors. Cross data correlation can uncover those anomalies and the unknown behaviors can be indicative of a threat or an error in an application as it logs information. If you need to understand complex relationships in your data, detect those correlations, and want to see a working example then contact us at the TechLabs Data Insight group for more information.