So much data … and now? - Observability Heroes

Observability is about data, or isn’t it? OpenTelemetry helps us to gather tons of it and then we store it in a nice scalable data storage and all is well.

Sorry, that’s not how it works in Observability. Let’s take a look at data and information and see what we actually need.

What is data?

Data can be anything from a time series metric or a span tag to a log event. There are tons more options but they all have one thing in common: Alone, for themselves they are all meaningless. What does 95% CPU utilization of a server at 12:01:10 tell me? Or the event “ERROR: mandatory field postalcode empty” in the log? Nothing at first – those are just datapoints without connection.

In Observability we collect tons of data from various sources in various formats and store it in data sinks – but this is not a quality signal. The power and challenge of o11y is to extract information from this data.

Martin Thwaites describes this nicely in his LinkedIn post: Data into Actvity

But – we need Information

What we need for Observability is information that helps us answer questions – for example: “Are our services fast enough to fulfill the SLA/SLO?”. To answer that question I need at least a latency metric AND a corresponding servicename – in combination with a value that allows me to determine if they are fast enough or too slow. Those are at minimum 3 different datapoints that need to be connected in a way.

Though data is not information, we can extract it – that needs expertise, purpose and context. 1 second response time maybe unacceptable for an online store, but perfectly fine for a search request in the online vault of my insurance company.
What I have seen many many times were dashboards with a lot of single datasources (e.g. CPU consumption, memory usage, network saturation) as gauges, histograms or piecharts side by side. The first impression is, well, impressive, but then you ask yourself – What does this mean? Only an expert knows what this data is all about and adds the context in his or her head distilling information on the fly. Without any context those are just numbers.

12 gauges and only the expert knows what they mean.

We want answers

Therefore the next challenge after collecting the data is to enable us using it to answer questions: Is this service slow in our context? And if yes, why? That may be an operators perspective. A developers perspective could be: Is this service resilient in any way? What happens when the DB it is relying on, is throttled or simply overloaded? Can we scale it infinitely? While a security officer may wants to know who was using the service as it is audit-related.
This is where we need means to extract that information from the datasource fast and to the point. Of course I can go ahead and use the query language of the respective data storage – after I learned the specifics of that particular language. That is nothing I want to do when I have other issues to solve or are under pressure to get the production system running again. My need is to get answers fast and to the point, not learn an intermediary skill to be able to do that. More insight in the difference between data and information here: Data and Information – What are the differences?

Expertise required

But I can use dashboards to display the data and, Voilá, it is usable by everybody. Well, almost.

Sometimes an expert painted a yellow or red line in the graph or gauge or histogram – to mark a special point of interest, a threshold. Red usually indicates danger like a pressure gauge on a steam engine – when the hand comes close to the danger zone, you need to take action. Isn’t that easy enough? Well, this threshold or indicator is different though for every system and it needs expertise to define those markers.
That is the real challenge of Observability – Apply expertise to data to create information that you can use.

Why is that? Because this know-how is usually found only in individual humans. The longtime operator, the seasoned developer, the experienced tester and so forth. And their job is to operate, develop and test, not to create thresholds for dashboards and keep them up to date.
When µService architectures entered the stage this escalated even further. Now change is the only constant and thresholds have a half-life shorter then a donut in Homer’s reach. Additionally the complexity and the interdependencies of those services increased – setting the bar even higher as before.

Automation and flexibility needed

So besides an automated way to create dependency maps and high cardinality metrics to be able to monitor the response times and error rates in context – for example calls to a customer account service in connection with the shopping-cart service and subsequent DB calls – we need an automated check if something goes the wrong way. We can’t put everything in the same graph and just look for outliers.
And once we identified a problem we can then dive deeper into it, or rather the team in charge of the component which makes troubles. Automated baselines are helpful but also very sensitive to changes. They need a certain amount of time to establish themselves. If a significant change in response time happens, it needs to be calculated from scratch. Also a normal behaviour does not need to be good. Just because a service has always 300ms response time it can still be 200% over target because the SLA is 100 ms.

More resilient patterns

Other patterns like “sudden change of value” that are independent of the value as such and are flagged if something changes relatively fast or sudden are more helpful as they don’t need baselines or vast amounts of historical data. When my error rate increases from 1% to 11% within seconds and stays that way something is going wrong.
Then I can alert and start collecting connected information and present a starting point for further root cause analysis action. This is where I need the flexibility do slice and dice the data further. Of course those patterns also have their limitations, like a slow incline in response time which they don’t detect – they are still resilient to continuous change and ideally only need to be set up once.
Drill downs and ups are needed as well as a guiding information like log events which point us to the root of the problem so that we can solve it. What system you use to achieve this is totally in your hands – as long as they are mainly automatic, flexible and easy to use for subject matter experts.

Summary

OpenTelemetry is a great starting point for Observability as it standardizes gathering data, but it is nothing more than that. We need other systems to dig, sift, filter, distill the data to generate information to answer the questions of the stakeholder. To turn data into information we need flexible systems which allow us to work with the data without the need to learn arbitrary query languages or similar.