Machine Learning Over Profiling

Discover how machine learning techniques can be applied to data profiling metrics. Explore the challenges of the process, among which the choice of different models, including statistical, machine learning, and deep learning models. See how anomaly detection is emphasized as a crucial aspect of data profiling, allowing proactive management of data quality and data health.

machine-learning-over-profiling-image.webp

Overview

With the increasing complexity of data stacks and the sheer volume, velocity, and variety of data generated and collected, data observability has become a necessity for businesses to proactively identify and address data issues, ensuring accuracy and reliability. One of the key aspects of data observability is data profiling, because it provides insights into the contents and shape of data, which is essential for monitoring and analyzing data health.

Blindata’s profiling offers a framework for deploying an effective SQL based profiling, made of default or custom metrics that perfectly fit to every scenario. Each profiling metric stores historical data through automated execution, in the form of a time series. This article will explore how Blindata can detect data anomalies through the analysis of the time series generated by its profiling metrics.

Data Observability Profiling Screenshots

What is an anomaly?

In broad terms, an anomaly can be described as “a significant variation from the norm”. Therefore, an anomaly is a data point that diverges from the behavior of a modeled system. To detect anomalies, specific algorithms are used, which depend on both the data structure and their statistical characteristics. In the case of time series, one approach is to use forecasting techniques to estimate the expected values of future data points and then compare them with the actual values to identify any anomalies (i.e., deviations from the expected pattern).

Time Series
Forecasting

Time series forecasting is based on the assumption that the future of a time series is influenced by its past, in other words it is possible to predict the future only if a process is non-random. Fortunately, data profiling metrics derive from processes that can be defined and modeled. But even then, time series forecasting remains a challenging task that requires careful consideration of various factors.

Training Data

Time series can have missing values, which can be due to various reasons such as an error during data profiling. Also, they can be noisy, can contain outliers ( anomalies ) or can have irregular time intervals between data points. Blindata manages all of this thanks to automated data preparation and the massive action features offered to the end user, which allow easy and intuitive exclusion of portions of data from model training.

Model Selection

There are many models to choose from, and they can be distinguished into three main categories.

There are several types of statistical models that can be used for time series forecasting, including moving average (MA) model, autoregressive moving average (ARMA) model, autoregressive integrated moving average (ARIMA) model, seasonal autoregressive integrated moving average (SARIMA) model, vector autoregressive (VAR) model, and vector error correction (VECM) model. These models are mathematical representations of the underlying relationships between the variables in a time series and are used to make predictions about future values.

Some of the machine learning models used to make predictions for time series are: Support Vector Machines (SVM), Random Forest, Decision Trees, Gaussian Process Regression. These models are capable of capturing complex patterns in time series data and making accurate predictions about future events. However, they require a large amount of data and computational resources to train and optimize the models.

Deep learning models are a type of machine learning model that uses neural networks to analyze historical data and make predictions about future events. These models include Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) and Transformers. They are particularly good at capturing long-term dependencies in the data and can handle large and complex datasets.
In general, statistical models tend to do well on short-term forecasts, while deep learning models are better at capturing the long-term characteristics of the data. Machine learning models are somewhere in between and can be used for both short-term and long-term forecasts.

Choosing one model from these categories requires expertise-level analysis of each time series, a tedious and not scalable task. Blindata removes this burden from the user by applying AutoML techniques: it automatically selects the best forecasting model using genetic programming optimization; then, the chosen model is used to generate the prediction of time series’ future values.

Furthermore, Blindata optimizes the training and forecasting phases, tailoring them to the characteristics of the time series. For instance, if a profiling metric is a constant value, the forecast will be extended, and training will occur less frequently compared to a metric that is highly variable.

Data Observability Time Series Forecasting Screenshots

Anomaly
Detection

The forecast of a data profiling metric is composed of an estimated value and a range that expresses the uncertainty of the prediction. When a new data point is collected, it is compared against the range of expected values for the metric. If the data point falls outside of the expected range, then it indicates an anomaly that needs to be further investigated.

Data Observability Anomaly Detection Screenshots

Conclusion

Blindata’s capabilities for time series forecasting and anomaly detection through automated data profiling provide a powerful and scalable solution for monitoring data quality and health over time. Blindata takes the complexity out by automatically selecting the optimal forecasting model, preprocessing data, and detecting deviations from expected patterns. This allows businesses to more easily monitor their data for anomalies or issues without requiring deep expertise in time series or machine learning.

With Blindata, data observability becomes an effortless process that ensures organizations always have accurate insights into their data and are promptly alerted to any problems, giving them the ability to proactively address data quality issues before they impact key business objectives.