Preparing unstructured data for analytics
How rich is your source data?
A data warehouse is an important tool providing valuable insights into your business. It allows analysing end user-behaviour, causality relations, etc by using traditional statistical methods as well as emerging machine learning techniques. However, the quality of the analysis results depends heavily on how detailed and contextual ( i.e. rich) the data is. Traditional transactional databases are very restricted. Therefore, modern techniques concentrate on importing data directly from application APIs (here’s a good overview of traditional vs modern ETL process)
. There’s a third option for enriching the data -- including application logs to the analysis scope. Unfortunately, this happens very rarely. Why?
Dealing with change (when there are rules…)
It is very difficult to manage changes in the structure of source data. Extraction, Transformation and Loading (ETL) processes are quite complex because the goal is to avoid loading misinterpreted data into the warehouse. Mistakes happen and recovering from them is expensive, meaning identification, correction and reloading the entire dataset. One way to avoid this is to apply rigid rules for input data change management.
Compared to exports from databases, application API's are loosely regulated. However, their changes require significant time and effort to mitigate. That is, adopting the importing application to the API change or, in worst case scenario, correcting the erroneous data already imported into the data warehouse.
However, the volatility of APIs cannot be compared to logs. These are usually based on the needs, ideas, (secret) desires and passions of application developers. Additionally, as they represent unstructured or at best semi-structured data, detecting changes in them is even more difficult. As a rule, this results in erroneous data being imported into the warehouse. In other words - the worst case scenario.
Using the richest data of them all
SpectX is designed for instant structured analysis on unstructured data. It extracts and transforms data elements defined by a virtual structure during query time. This is complemented by built-in support for detecting changes in the unstructured source data. Both these can be run independently over SpectX RESTful API. This makes it easy to implement different workflows for handling errors. E.g. you can detect changes before executing extract and transform. Fixing these errors is a matter of minutes, adjusting or adding another virtual structure to the source data.
All these features make a huge difference for organisations aiming to improve the quality of data they base their insights on. With SpectX the most volatile application logs can be added to the analysis scope.
Back to solutions