What is Observability?

besides a mouthful of syllables? I’ll use O11y (pronounced OLLY) from now on to make it a little easier. Here is my take on a short (!) intro into the nature of it.
O11y is a composite strategy involving inspecting a service’s performance, availability, quality, and how it affects other system components based on data retrieved from inside the system. It is a property, an attribute. They are many ways O11y can be used and defined. It is only clear what it NOT is: A tool.

The key word for me here is strategy – depending on your task or your job you need different aspects that are part of O11y.

Many say O11y is just another word for monitoring, but that is not true – So how is O11y different from of simply monitoring systems and applications? Let’s look at it.

Monitoring vs. O11y

Short: Observability includes monitoring; it is a superset so to say. It is kind of blurry but when you look at the history of application architecture the evolution becomes clear.

Monitoring in the traditional sense was checking on metrics, that were known, if they violated a threshold, which then indicated a problem. It is all based on experience and doesn’t change a lot. With mainframes and client-server architectures, monitoring system metrics such as CPU and memory was enough, as most applications were monoliths running on a single server. As application architecture evolved from the client-server model to 3-tier applications, service oriented architectures and finally microservices, a shift from monitoring systems (CPU and memory) to monitoring services (latency and error rate) was needed.

This is were a name term was coined: O11y – it is a new name that allows us to distinguish the practices and methods in different spaces. Monitoring is still needed but in a different context – e.g. spin up a new cloud instance of a service if the CPU usage rises instead of raising an alert to ensure accessibility. The main difference now is that often we don’t know what metric or signal we need to determine that state of a system or the point in time when action is needed.

A system needs to be observable in order to be monitorable is another way to put it. Monitoring usually only cares about metrics and thresholds to trigger alerts when something goes wrong. But then what? To solve a problem you need more data then simple metrics and that is what O11y should provide.

When Do I need O11y?

The primary reason is: Performance Problems. I have seen it many times in the past that monitoring or observability was low priority until something happened. Then it was “all hands on deck” to make the system work again. This was usually the moment when my customers realized that they have unsufficient data/information to resolve the issue fast. There are other uses for the data as well though. This report by logz.io provides a rough overview about the use of O11y: 2024 Observability Pulse Report

Since the dawn of DevOps and SREs (Site Reliability Engineers) the priority was raised at last and now O11y is already part of the planning of a system – making sure the right data is collected to be able to guard a system proactively – and make sure you can rely on them.

But besides performance issues, O11y provides insights that are useful for security purposes (access audits, attack prevention, etc.), for cloud operations (e.g. resource optimization), for business units (e.g. revenue over time, user journey, etc.) and for many other purposes. And is always using the same data, just the perspective on it is different. And that is what a good O11y solution provides: A flexible way to present the information needed by different teams from the data.

For example: The CPU consumption of a container can be a valuable metric in diagnosing slow performing services and needs to be correlated with the response time of a service, while the same metric in the context of cloud operations helps you to decide if the instance of a service is running on, is over- or underprovisioned.

What is O11y made of?

To observe a system you need mainly 3 types of data.

Metrics
Traces
Logs

and context – but that is not data as such. We dig into each of those in depth later on, no worries.

Metrics are, simply put, numeric values over time like CPU consumption or memory usage of a server. In client server times those infrastructure metrics were usually enough to tell if a system was performing well or not. In distributed µService-based applications those metrics are not enough and we use typically response time or latency as well as error rate as performance indicators.

Traces are representations or recordings of requests/transactions as they are flowing through the various services of a system. In a trace you can see how long each step took and what other steps / services were called and how long they took. Traces are invaluable for root cause analysis of problems because they allow you to see exactly where a request got stuck on probably why as well. They are also the basis for any kind of dependency map, that shows which service is calling which other service.

Logs are mostly record of events that are noted with a timestamp in a file. They are usually generated when errors or warnings occur, record information about states and can be used for debugging as well. They are the most basic form of data / information used for trouble shooting and are really useful though usually without context – which needs an experienced troubleshooter for analysis who knows where to look for the data. Also you need to know upfront what should be logged, otherwise it is simply not available when you need it.

All this data is gathered, stored and analyzed in a variety of backend systems. Those differ heavily in features and functions depending on the team using them and their needs. SRE, developers, Ops, QA, Business and many others rely heavily on O11y data or can at least do their job much easier with measured data. A eCommerce manager though is not interested in the execution time of a SQL statement in the warehouse DB but what the overall response time and the turnaround rate for their diamond customers is. Both can be provided though with a good 011y system from the same data set.

Summary

Observability is abbreviated with 011y, provides insight into the state of IT systems in realtime for different stakeholders in the organisation and is needed to operate distributed systems reliably in production. The data is retrieved in the form of metrics, traces and logs which is then, in turn, visualized and analysed with various backend systems which are used by various stakeholders in the IT organisations.

Recent Posts

Why OpenTelemetry is such a game changer

The Unsung Heroes: Why OpenTelemetry’s Semantic Conventions are another Game-Changer

Beyond the walls: Understanding and overcoming Observability vendor lock-in

Observability Data: Where Do We Put All This Stuff?

Tags

Search