Isn’t it ironic that we expect applications and infrastructure to have service level agreements (SLA) but don’t have the same expectations from data-intensive applications? Applications and infrastructure are diligently monitored by developers and site reliability engineers for performance, faults, uptime and latency but it’s not the same for data.
In the meantime, data is going through a complex journey across multiple interconnected systems. It originates across dozens or even hundreds of data sources, gets staged in an analytical data store, goes through a myriad of data transformation pipelines before it is ready for consumption. Many large organizations now run several thousand pipelines every day.
An SLA for data would only work if the data is monitored and measured across several dimensions:
- Data quality. Data must be correct, trustworthy, timely, easily discoverable, accessible, usable and fit for purpose. Data veracity is the most obvious and a burning need of data consumers. Data consumers are not just business analysts or the C-level suite running reports and dashboards but also data engineers tasked with remediating data quality issues.
- Data behavior. As data is always changing, identifying the patterns, anomalies and unexpected changes is less obvious and harder to attain. Data scientists know their ML models degrade over time but they also know the culprit is often subtle changes in the data makeup which necessitate retraining.
- Data Privacy. Needless to say, data privacy compliance regulations such as the EU GDPR and a bevy of regional and industry-specific ones require that sensitive data is handled according to the applicable policies. Privacy enhancing techniques are used to de-identify personally identifiable information (PII) but as the data transforms and is joined, the risk of re-identification goes up.
- Data ROI. How well is data aligned to one’s business strategy and is it being used in the most efficient manner is hard to track. All the previous dimensions are tactical in nature while this is a combination of strategic and tactical. CDOs are interested in measuring how well the data meets the business goals while the CIO is tasked with scaling the data infrastructure efficiently which may require a cloud migration.
As can be seen, data engineers, data scientists, CDOs, CIOs/CTOs, data privacy officers (DPOs) and chief information security officers (CISO) can all benefit from a richer view of their data assets.
How did we get into the mess?
For the past few decades, organizations have been bolting on data silos to an already overburdened and fragmented stack. Every time new users or use cases are introduced, the easiest thing to do is to replicate the data required by the users.
After a while, we lose track of the plot. It is no longer clear what is the source of truth. So, we are left with an ecosystem littered with duplicate and inconsistent data. It is no surprise then that data scientists spend 80% of their time wrangling or munging the data, rather than putting their hard-earned PhDs to use to build predictive models.
Data pipelines move data from the point of origin to the point of consumption, using underlying compute frameworks, which need to be monitored for performance, scale, efficiency, etc. The data that flows through these pipelines needs to be monitored for loss of quality. The pipelines which in effect represent the business processes need to be monitored for SLAs.
Enter data observability.
What is Data Observability?
Observability needs to be understood from a broader perspective. This term came into prominence as an evolution of the work that the application performance monitoring (APM) vendors had started. The focus of the APM tools was on performance and reliability. Observability added a deeper level of analysis by aggregating the constant stream of performance data from a highly distributed architecture.
Data observability applies the lessons learned from APM and infrastructure observability and applies them to data.
While it is true that the single biggest problem in making data and analytics pervasive is the lack of trusted data, data observability is far bigger than simply data quality monitoring. Data observability changes the data quality paradigm from data-centric to the entire pipeline. With this shift, data quality is as much a responsibility of the data engineers as it is of the business analyst.
Data observability diagnoses the internal state of the entire data value chain by observing its outputs with the goal to proactively rectify issues.
Data observability provides data engineers the ability to trace the path of that failure, walk upstream and determine what is really broken at processing, computation, or infrastructure and quality level.
How is Data Observability Different?
Application performance management (APM) tools like Datadog and New Relic have provided transparency into infrastructure issues to developers. Prior to the APM tools, only the administrators were responsible for handling performance issues.
Data observability focus is on developing a multi-dimensional view of data including performance, quality and its impact on the other components of the stack. The overall goal of data observability is to see how well data supports business requirements and objectives.
One major difference between APM and data observability is that the applications tend to change far less than data. So, it is easier to automate applications but data is much more dynamic. Observability is the glue that ties the dynamic nature of data into the business context and generates real time insights.
How is data observability different from monitoring?
This is a burning question in the market because their differences are not clear-cut. Monitoring usually works off a known set of metrics and a known set of failures. Monitoring is indispensable to visualize system performance and uptime in a dashboard, analyze trends and try to predict potential issues.
Observability doesn’t replace monitoring but provides inputs to the latter. Monitoring is the visualization, alerting and exploration of what the observability system is collecting from the data pipeline. A monitoring system needs observability but the reverse is not necessarily true. An observability system without a monitoring component could still feed its results to disk storage, or into an ML model, or other applications that are not monitoring. While monitoring looks at MELT (metrics, events, logs and traces), observability enhances it through metadata and lineage of data.
Take an example of a monitoring system that is tracking stock prices. However, the business context of why the stock prices are gyrating is not in the monitoring system. A sudden change in the stock price could be because of a social media post. Observability is the glue that connects the performance data to context.
How is data observability different from data profiling and quality?
Data profiling snapshots the shape of data at a certain point in time. It shows statistical analysis of data and characteristics such as nulls, format inaccuracies etc. Data observability instead analyzes data continuously which allows it to use a time series graph to predict expected ranges and forecast values. By analyzing history, a pattern of seasonality can be inferred to discover anomalies. Data observability leverages metadata to monitor context and intent and to develop a semantic understanding of the data.
Table 1 shows how data observability differs from traditional data quality.
|Parameters||Traditional Data Quality||Data Observability|
|Persona||Business Analyst / Data Steward||Data Engineer / Analytics Engineer|
|Phase||Reporting||Ingestion / Transformation|
|Approach||Static rules and metrics||Static and dynamic rules, metrics, logs and metadata|
|Typical Use Case||Compliance (BCBS 239 / IFRS17), reports and dashboard, operational data quality||Data integration, pipeline automation|
|DevOps||Manual process that can be scheduled or is ad hoc||Automated, part of CI/CD and orchestration e.g. Apache Airflow|
|MLOps / AI||Data quality and MLOps are orthogonal and discrete||Data observability is a part of model / MLOps e.g. explainability|
Table 1. Difference between traditional data quality and data observability
Traditional data quality systems tend to be opaque with built-in instrumentation while the data observability approach puts more controls in the hands of developers and other data leaders. The data quality tools focus on correcting errors on data while the latter tools are more concerned with ensuring successful workflow and preventing errors.
This begs the question of what should be the scope of a data observability tool.
What to observe, monitor and measure?
The obvious starting point is to measure data based on historic patterns. The following list shows what should be observed:
- Data. Observation of data should identify issues pertaining to timely delivery, accuracy and deviations from historic patterns such as a sudden unexplained change in the expected volume of data.
- Metrics and validation rules. Adherence of data quality rules such as whether the age column is within the accepted range is an example of a static rule. However, AI-based techniques are dynamically inferring rules and looking for anomalies.
- Environmental configurations. Monitor changes to configuration parameters that can cause issues such as a security breach due to an inadvertent change in the security settings of a bucket. Monitoring various logs such as query logs can identify configuration issues.
- Query and model performance. Identifying unexpected changes in query or model performance can provide valuable sights into changes in data. Often data scientists will evaluate models but data changes all the time and the model may be fine, it just needs to be retained on new data.
No single product can provide the breath of measurements mentioned above. We see multiple approaches to data observability emerge. This has led to a spurt in products.
Data Observability Products
This space is a hotbed of activity with many startups. These products have attracted a lot of attention (and money) and are listed below in alphabetical order:
- Great Expectations
- OwlDQ (acquired by Collibra)
- Monte Carlo
The purpose of this blog is to clear the mist around the rapidly emerging topic of data observability. Future blogs on this topic will cover topics pertaining to data observability approaches, features and components of data observability products, common use cases and best practices.