Performance Monitoring is easy

Finding out if your application or your system is performing well is no rocket science. It is actually really easy -because: There are only 2 conditions – Slow or Failed.
It is too slow or it doesn’t work at all and throws an error in the face of the user. Or – superbingo – it is both at the same time. As annoying as this is for the user, it is relatively simple to be alerted when it happens.

Picture of a race track painted with white chalk on a blackboard where two snails are just short beforce a finish line. Depicting a very slow performance.

To find out if something is too slow is done by comparing the response time or latency to your expected time or an SLA – that means it is simply metrics work. And for errors it is even easier. If an erroneous event is popping up an alert is thrown. A light turns red on the monitoring dashboard. Then it is all hands on deck to resolve the issue. And this is where the trouble starts.

Diagnosing the root cause is hard

As easy as it is to be alerted as hard is it to resolve the issue. You need to find the root cause of the issue to be able to fix it. The first step is to determine: What is slowing down our site, what triggers the error condition? And if you don’t have some sort of Observability in place, the blame game starts. I have been there many times with my respective APM solutions to help getting from a heated discussion and fingerpointing to a fact based and effective troubleshooting process.

The situation is understandably tense. Business is suffering because of an IT issue and management is breathing down the neck of the team leaders who then push the teammembers to resolve the issue fast. But if you don’t have insight into your systems, a fact based approach is hard to accomplish. And it is not a question of the right tooling. Though the right tool can help to get the discussion on a rational level – but you definitely need expertise, passion and the right information.

First step – Triage – find the right team to investigate further

In distributed systems there are a lot of dependencies and a lot of different technologies in play – and even in monolithic apps you still have different players involved e.g. database admins who all have their special expertise which you need to solve a problem fast. So unless you have your firefighting brigade which has nothing else to do, than troubleshoot, you need to identify the area where the root cause most likely sits and assign further diagnostics to the respective team. This is called triage and if successfully done it speeds up the mending process dramatically – it also takes out the blaming. A good triage is enabled by Observability and especially by distributed tracing. Without any info about dependencies the whole thing is a just a game of chance.

Distributed services also distribute problems

The more services we have they more dependencies there are.
Captain Obvious

What does this mean though? That means that if a connection to a DB is not available and I have a chain of services between the user interaction and the DB I need to have a full coverage of the individual connections to be able to quickly determine if the DB connection is the problem or not. If there isn’t at least an architectural diagramm I need to piece the information together based on hear-say. Ideally I have a distributed tracing in place which gives me acurate information about the connections and the timings between the services in place. And, to be honest, architectural diagramms are usually depicting a state as it should be, but not necessarily the state as it is.

So without a map based on real measurements there is still a lot of guesswork. Below is the map generated by Jaeger based on traces from the OpenTelemetry Demo. A good start but static, a snapshot over a period of time. Nevertheless it helps me to confirm if the problem with the Frontend can stem from the problem with a service further down the chain. In turn we can then hand the problem solving to the team responsible for the component that looks like the rootcause.

A tree map of a system architecture produced by Jaeger. We see blue dots representing different entities which are connected with black arrows to other service entities indicating a direct connection and dependency. — Distribution map generated by Jaeger for the OTel demo system

Diagnose the problem – 50 shades of slow

The team now is in charge to find the rootcause. This starts already with the problem description. The alert warning us of a degradation of the speed of our – for example as we run an online shop – product page is way beyond the threshold. So it is slow. But there are different types of slow. To know which kind helps us in determining where to look further. Some examples:

Suddenly very slow for all requests
Slow over time following a linear pattern
Randomly slow for certain periods of time
Predictably slow every x minutes
Constantly slow at regular intervals for a specific service but not for the rest
and many more flavours

Do we have this information? Or do we just have an alert? This is important for the diagnosis as the type of slowness already tells us something about the most likely root cause and we can go and check those things first. E.g. if all of a sudden the system grinds to a halt, we face most likely an outage of a central component, while a slow linear increase points towards something like a memory leak or a resource leak. But to confirm any type of suspicion we need data. Telemetry data preferably.

What data do we need?

Data is the foundation for everyting and depending on the job to be done we extract the necessary information from it. But how do we know which data we need? Well: You don’t.

At least not in the beginning. And it very much depends on your application and the architecture. There are a few staples though – data that you will always need e.g. response time per service, errorrate, connection info through distributed tracing, status information like HTTP status codes and a few others. But your optimal data set will reveal itself over time. The catch is that you usually realize what attribute you’d need to solve an issue when you don’t have it. I remember that many times I thought “If only I’d know which branch was taken or what value was set here” during production post mortem analysis.

That means it is a learning process aided by educated guessing. A round of experience in the past will also help you to make sure you get the logging info that will tell you what state a request or a transaction was in, when it went south.

There is a tendency to collect all there is, just to be sure not to miss anything. That is OK in the beginning and best practice is to look over your collected data from time to time and weed out the excess data that just eats up storage space.

Reproduce the condition and test the fix

Ideally you have just the right set of data and information that let’s you recreate the troublesome condition. That way you can test a possible fix and check if the same issue could occur again or if you are safe now. And you have a learning for the rest of the team or the organisation. I have also seen it that companies store erroneous configurations and redeploy them to e.g. test new employees or perform resilience tests.

With the help of metrics, log events and tracing, especially with the right attributes, you can make sure that each issue is a singular event and doesn’t happen again. The downside is that now the next issue that raises an alert, is going to be more challenging.

But now it is time to

Fix it and go on with life

Once tested the fix is deployed and the systems are back to normal. As is your life. With the right level of Observability the time from alerting to repairing a system – the so called Mean Time To Repair (MTTR) – can be just minutes where it was days or sometimes weeks before.

The fabled “Fix-It” button on the top of your monitor screen is unfortunately not working yet though a lot of hope is put into “AI” as we speak. Until it is implemented we still need to troubleshoot with our own brains. Problems happen in big systems, there is nothing wrong with that. Blaming and fingerpointing though are wrong – and not helpful. Nobody really likes troubleshooting, at least not on a daily basis and there is probably nothing worse than debugging code you haven’t written yourself. A fact-based troubleshooting approach helps everyone in the organisation to get back to the regular job fast – with the benefit of being able to tell a hero story now. It just needs the right data and mindset to look at it.

Easy – isn’t it 😉

Diagnosing the root cause is hard

First step – Triage – find the right team to investigate further

Distributed services also distribute problems

Diagnose the problem – 50 shades of slow

What data do we need?

Reproduce the condition and test the fix

Fix it and go on with life

Leave a Reply Cancel reply

Recent Posts

Why OpenTelemetry is such a game changer

The Unsung Heroes: Why OpenTelemetry’s Semantic Conventions are another Game-Changer

Beyond the walls: Understanding and overcoming Observability vendor lock-in

Observability Data: Where Do We Put All This Stuff?

Tags

Search