Whitepaper: Inertia in Processing Machine Generated Data
The most common kind of machine generated data is logs. They are generated by a vast number of different applications. There is a wide variety of applications even in the typical and relatively constrained enterprise server environment: firewalls, applications providing main business functionality, databases, dhcp servers, web gateways, etc. Not to mention applications running on the client side: workstations, mobile phones, the internet of things. The logs of these applications contain a tremendous amount of information like the behaviour of customers. They also record a lot of the surrounding environment. Yet logs are generally very sparsely used to extract information. Why?
It turns out that extracting useful information from logs is not easy. In fact, the processes of obtaining data and preparing it for analytics is a complicated and costly process. The aim of this article is to describe these complexities and bring out the core reasons that form the phenomena of inertia of machine generated data. Hopefully, it will help readers to better orientate in the rapidly changing landscape of existing and emerging machine data analytics tools.
Basic Differences in Machine Generated Log Analysis and Traditional Data Warehousing
It is important to understand the nature of log data in order to apply appropriate methods to its processing. We compare the most important characteristics of machine generated data with the data used in traditional data warehousing and relational databases since a lot of the methods used to process logs are rooted in these technologies.
Amount of Data
Logs are often referred to as Machine Generated Big Data. This is not without reason since their volume tends to be immense. There are several reasons for that. First, the data used in traditional data warehousing is usually called ``transactional'' data - i.e. it captures details of a certain event. The events are chosen carefully, they mostly represent important steps of a system flow. Equally well chosen are the captured details: usually only those providing direct business value are recorded. These details are encoded in a consistent manner. The reasons for sparsity are mainly driven by business continuity and performance optimisation.
Logging on the other hand is generally considered outside of those constraints (there are some exceptions as usual). Its purpose in many cases is to record the behaviour of a system as a whole
. This means a great amount of events, context and information.
Additionally, encoding practices for logs are significantly looser which results in low density of info per recorded bits. For example, it is very common to see phrases such as DEBUG, INFO, TRACE, etc instead of encoding the same information in an integer value.
Finally, the diversity of sources generating logs is higher in magnitudes compared to sources generating dedicated ``transactional'' data, especially considering the massive expansion of the Internet of Things.
The volatility of data expresses the likelihood of changes in the structure of data.
In software development, logging is generally considered a way to record information for troubleshooting - i.e. debugging, optimising performance, etc. This information is of secondary importance for business functionality. Being logged, this information is not constrained by performance or continuity requirements and is therefore also more open to changes in general.
Another aspect of logs is the intention to be used and read by humans. Consequently, it is not that important for a developer to follow format and structure in data because humans can interpret unstructured information from context. This results in much more flexible additions and changes to data elements compared to modifications in business critical information. I.e. the data elements in logs could be added, changed or removed at any time without much concern for dependencies. Furthermore, it is not only about the developers. Changes in logs can also appear after modifying logging configuration by the operations staff. These often fall outside of change control procedures, therefore further increasing the likelihood of changes in logs.
In some strictly regulated areas (such as finance, e-commerce, telecom) logging is also used to provide an audit trail. Compared to debug logs, more emphasis is put on the success of recording and retaining. Additions or format changes are considered even more lightly than for business critical transactional information.
Changes could also be unintentional - i.e errors made by developers or caused by the runtime environment. When importing data into a warehouse, a lot of emphasis is usually put on ensuring data integrity. This is not always the case when collecting logs. Quite often the volume logs is immense and UDP protocol is used to increase network throughput. This sacrifices the integrity of data.
Considering the growing number of cyber attacks, changes in data structure could also be caused by log poisoning attacks, i.e intentional manipulation of logs by malicious 3'rd party actor to conceal, remove or change information in logs. This is why it is important to be able to detect unexpected changes in logs.
Near Realtime Data Feed
In traditional data warehousing, external data is usually loaded into a database in batches. However, several significant log analytics use cases are based on realtime events: monitoring IT system performance, security and fraud prevention, etc. Analysing realtime events in batches may cause losses from financial to reputation because of untimely response.
Volatility can also stem from data storage location. In relational databases and data warehouses the data is imported into a database - i.e it resides in machines where database is deployed on. Logs are just raw data entities stored in a file system or cloud storage and therefore easy to relocate. It is not uncommon to retain the most recent logs in a server intended for processing ``hot'' data while not so recent logs are stored at NAS, cloud or even tapes.
For businesses with a global outreach it is common practice to retain logs of an application in different geographical locations, i.e. close to the location of applications producing those logs.
The structure of data is mostly dictated by its purpose. We have already mentioned business critical data being well chosen, consistently encoded and also well structured - as its purpose is to provide business functionality.
The purpose of logs has been and largely still is quite different. As mentioned in the volatility section above, logs are for people: developers and operations staff reading debug logs for troubleshooting, customer support solving customer complaints, etc. Again, because humans can easily understand unstructured information, log structures are defined and followed more loosely.
Let's now have a look at log processing methods.
Collection and Storage
First of all, logs must be generated by applications. They are usually written to the local storage of the host where the application is running. Alternatively, they are sometimes written directly to a log collection framework interface. This does not happen by itself - it needs to be configured and set up. This is a one time effort but it needs to be repeatable - usually one of the duties of operations or devops staff.
In the case of application logs, there is also the question of what
gets written to a log and in which
form. This is not as simple as it may seem as there are a number of factors to consider: the structure of data must be kept consistent, application developers must do that uniformly, privacy must be maintained and malicious attacks on logs must be prevented, just to name a few.
Next, logs must be collected from source hosts for storage and/or forwarding to analytics systems. Log volume, network throughput, security and data integrity during transport must be considered when selecting the optimal solution and maintaining the collection process. Note that storing data for retention and also forwarding it for analysis is leading to data duplication
Preparing Data for Analysis
In order to analyse data, two basic problems must be addressed:
- What is the semantics of data? I.e. - what are the data elements and what is their type? In order to compute meaningful information (such as counts, sums, average and other computations) we need to extract data elements and assign them a type.
- How do we process data in scalable manner? I.e the ability to process an increasing amount of data within reasonable time. This is usually solved by using a commonly available distributed processing platform (such as Hadoop) or a custom built distributed processing solution. In either case, the data must be loaded into the analytics system.
The steps for solving the problems above are known as the Extract, Transform and Load (ETL) process. In data warehousing, ETL is applied by batches of infrequent data (compared to near-realtime feed of log events), low volatility and with well defined structure. Clearly, applying this approach to log processing creates issues.
First, all data objects need to be fully mapped before extracting. This has to be carefully thought through, since adding a field you forgot to consider initially, or making an error in parsing would require repeating ETL again (recall volatility in Section volatility
and complex structure in Section complex_structure, of logs). In the case of large data volumes, this can be very time consuming and resource-hungry process.
Secondly, when the analytics tool does not allow computing the desired info from input data, additional data must be added during ETL. For instance, if geolocation information needs to be derived, appropriate data elements have to be computed in advance for ip-addresses during ETL. This could significantly limit the analysis queries because they need to be planned ahead and the additional data pre-computed. Therefore this approach is viable only for regular queries on stable data.
Loading large amounts of data can cause significant delays in the availability of data for analysis. For instance, consider a security incident where the responders' team urgently needs logs from a host where log collection has not been set up. Depending on the volume of logs, simply copying these could take hours. When adding the time of procedural access control, the result could be days.
The setup and operation of a horizontally scalable platform requires a specific skillset. It is relatively easy to set up a basic configuration of Hadoop, however going above that to meet real-life demands is exponentially more complex.
As we could see from previous sections the process of preparing data for analytics is complicated. It consists of several sequential steps which must be completed flawlessly in order to succeed. The complexity grows as the number of different log sources increases.
The preparation process also requires staff with varying skillset: operational people familiar with log collection and storage. People involved in ETL must know the structure of data and be familiar with importing it to the analytics tool.
Collecting, storing/forwarding and preparing machine generated data for analytics (collectively called Log Management) is a continuous process. The most important criteria for this process to be sustainable is the ability to manage changes. To be more precise, this means minimising errors arising from complexity and maximising the effectiveness of people with specific skillsets. The common approach that enterprises usually implement to tackle this, is centralisation - i.e. using dedicated staff for operating and implementing changes under well defined change control procedures. While providing efficient use of resources and minimising the likelihood of errors in the process, it is also an inherent bottleneck in the workflow.
Changes in logs can appear for various reasons: due to changes in applications (new functionalities being implemented, bugs fixed), new data elements being added or new record types introduced, record structure getting broken due to developer mistakes or transport being unreliable during log collection. The bottleneck introduced by change control procedures is causing pain for implementing desired changes. This is especially relevant in the case of medium and big enterprises where log consumers (analysts) and log generators (developers) are in majority to Log Management operational team (causing the overall number of changes to reach the limit of throughput of change control). It is too common to meet situations where a change could be simple and easy to implement but nevertheless takes a lot of time to execute.
In practice, change control procedures spend most of their time verifying data formats and structure because an error would result in repeating the ETL process. Therefore, comprehensive test and verification routines are commonplace. Not surprisingly, long implementation cycles appear as a result. It is important to understand that the core reason for this problem is and therefore errors becoming permanent
Reacting to errors in logs is slow for systems using ETL. Fixing errors is complex and slow manual work, data is often simply discarded and end users (i.e the analysts) notified. In case erroneous records are dropped silently, the end users are completely left unaware of the parsing errors which is even worse, as potentially unreliable analysis results will be perceived as viable.
It is also important for enterprises to retain the knowledge on data formats and structure. Quite often the data is extracted and transformed using custom written parser scripts and applications. The variety is high, as the selection is dictated by the skills of a particular employee responsible for that specific task. This in turn causes high cost for retaining knowledge due to the learning curve for successor-empoloyees. Alternatively, the company has to pay a higher cost for new employees with the same skillset. Also, the deployment is often poorly documented.Complexities and Limitations Arising From Data Volume
Assigning types to data elements increases the amount of overall resulting data as the information regarding type has to be retained. The simplest approach of assigning type to each of the data elements makes the difference significant. For example, 1GB worth of Apache common log shrinks to 200MB using gzip compression. When converting it to JSON (as it happens when importing into Elasticsearch) the size inflates to 2GB. The difference is staggering 10 times. This is generally disregarded as a problem since storage is nowadays considered cheap. However, in machine generated data analytics the size translates directly into processing cost
: instead of being able to query a year worth of data you're limited only to a few months.
Another example of the dilemma in making a compromise between data volume and query performance is given by Google BigQuery
BigQuery can load uncompressed files significantly faster than compressed files due to parallel load operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google Cloud Storage costs. For example, uncompressed files that live on third-party services can consume considerable bandwidth and time if uploaded to Google Cloud Storage for loading. It's important to weigh these tradeoffs depending on your use case.
In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.
Finally, all commercially available analytics tools are priced by the volume of data
. In our opinion this is one of the major reasons why log data is being used in minimal amounts and only for critical purposes.
Inertia of Machine Generated Data
When thinking about problems and complexities in processing machine generated big data, an analogy of inertia from Newtonian physics arises. It is defined as resistance of a physical object to change its state of motion, which is considered to be the primary manifestation of the body mass. Similarly in the process of deriving information from logs
we can observe inertia - i.e. the resistance to release information of machine generated data arising from its properties and processing methods. In other words, inertia is a set of factors that make log processing hard:
- Batch mode ETL causes i) slow new data onboarding and change implementation ; ii) low visibility to changes in data iii) high cost of retaining knowledge of volatile and complex structured data
- limitations on data volume cause reduced analysis scope
- large volume of data causes i) slow onboarding of new data due to slow process of transferring data and duplication of storage ii) high cost on scalability
- geographical distribution of data causes high cost on scalability
In our opinion the first two are the main reasons for not using logs for extracting information according to their potential. It is simply not worth it as the batch-mode ETL makes the whole process slow, complex and expensive to implement and maintain. Even when using analytics tools free of batch mode ETL, the amount of data is still limited, therefore providing inadequate value for investment. When counting the difficulties created by the great volume of data it becomes apparent that extracting new information
How To Fight Inertia?
If we could reduce the resisting effect of factors contributing to inertia, we could probably increase the likelihood of machine generated big data usage. How cool would that be?
A different kind of ETL
We cannot get rid of ETL completely - data elements must be extracted and type assigned for performing any computational analysis. However, performing ETL at query time
and retaining source data in its original form
makes a huge difference:
- in the course of ETL, source data is no longer modified, since the extract and transform is performed during runtime by instructions captured in the pattern.
- the impact of changes in the pattern (mistakes, errors, new data elements, etc) is reduced close to zero, as the impact surface is reduced to query and its results instead of all the underlying data
- the impact of changes in data is significantly reduced as correcting pattern does not imply expensive re-loading procedures. Just re-run the query with a corrected pattern to get correct results.
Therefore, as the impact of changes in data or the format is significantly reduced, the need for change control procedures when loading data to analytics tool is eliminated (recall that most time is spent on verifying the transform and extract patterns). Data can be collected as is
and all changes in it are adjusted by end users themselves. This could reduce the change implementation time dramatically. In some cases even in magnitudes.
Next, retaining knowledge of data structures becomes more uniform, therefore reducing costs. By using pattern matching language (such as regular expression) the technology becomes more uniform reducing the skillset for employees. Also, pattern scripts can be managed (in terms of storage and deployment) in a uniform way compared to the multitude of parsing scripts or custom written programs.
ETL performed on runtime also allows reporting data accuracy directly to the user in the scope of the query
. Adopting the assumption that data does contain errors/changes, the analyst needs to know how reliable the results of a query are. For instance, if less than 3% of records failed while parsing, the results of a query can still be considered as valid. A rate over 10% could be an indicator of something having changed in the structure of data.
Getting around Data Volume Issues
By now it should be quite apparent that machine generated big data is ``heavy'' - there's lot's of it in the first place, it gets easily inflated and in the end you might even end up with duplications. This is why the scope of queries is usually quite small.
All these concerns would disappear if the analytics tool could run queries directly on original compressed data. Effectively, we would bring analytics tool to data as opposed to moving data.
However, the implementation of this seemingly simple idea needs to overcome several problems:
- As elsewhere, scalability must be addressed. Parallelisation with distributed computing should support processing in a way that would not require specific knowledge and experience for setup and maintenance. (According to Gartner's research on SIEM Critical Capabilities, the simplicity of deployment and support is of primary importance in Basic Security Monitoring. See the report here.)
- Obviously, files are likely to be too large to process in one piece and must be split to smaller fragments. How to implement random access and parallelisation in deflating compressed streams?
- Files are subject to change. They are compressed, rotated, renamed, etc. During this, their content can change, the size can grow and also shrink. How to handle this?
- How to parse data in a more effective way, not to spend 50% of CPU time on extracting and transforming data elements?
Naturally, the list does not stop here. Admittedly, these problems are hard to solve. But they can
be solved. SpectX is an analytics platform for machine generated big data. It implements all these ideas and many beyond.
Back to articles