To log or not to log

Logging is one of the staples of observability – it is easy to implement, gives a lot of data, saves the day when troubleshooting problems and it is around since the dawn of time.

When I started my life in IT in the 90s of the last century – kinda creepy to write that – logs were all we had when troubleshooting. And there was always a bug or unexpected behaviour. But since then, performance analysis, performance monitoring and root cause analysis evolved quite a bit. And with OpenTelemetry – OTel for short – we finally have a good chance to create a solid foundation, that helps clearing the jungle of proprietary solutions out there.

One of the pillars of Observability is logging, no doubt about that – And it can be a lot. Every application log and container logs and Kubernetes logs and what not. I see a big chance here to reduce the amount of data being stored by checking at the earliest point, if it is worthwhile to store that data or not. Sure, most log management solutions offer a filter that will do this for us, but they do charge for the amount of logs ingested beforehand. But we can do it earlier and save quite some money.

We can use e.g. an OTel collector/processor approach to extract the information we need upfront before sending it to the backend – or avoid unneccessary logs completely? Let’s take a look what data we are dealing with.

Types of data in logs

When I started my career there were only logs available – unless you worked with a mainframe. All kinds of data were in there:

Timestamps and time series data (aka metrics) – though not comprehended but line by line
State information (e.g. health info)
Error details (e.g. exceptions, etc.)
Debug info
Context info (e.g. IDs, business service name)
Other stuff that the developers deemed interesting or needed

Extracting information

It was your job then to unclutter the whole thing and make sense of it – aka retrieving information from data. On UNIX/Linux system ‘awk’, ‘sed’ and ‘grep’ were your friends and PERL was great to create scripts for its parsing capabilities – but your eyes and experience were the primarily used tools to find patterns or interesting events. I once created a metric to troubleshoot a system by counting active transactions in a log – it raised an alert when the number was increasing as this indicated a jam in the cluster sync mechanism. As there was no real documentation I needed to read countless lines of logs to realize, that there were these entries “session_ID xyz123 created” and another “session_ID xyz123 removed”. The script read them, added and removed session-IDs and counted the still active ones. It worked, but it took quite some time to set it up. And all we needed was a metric that we could base an alert on. The log entries were the simple data storage and we needed to extract the information we needed.

Remove the unneccessary

Getting the info is easy, right? It is in there, just remove what is not information and voíla. Well, that’s what Michelangelo said when he created his famous statue of David – “David is already in this block of marble – all I have to do is remove everything that doesn’t belong to David”. It took him 3 years. Logs are string based and hence use a lot of space when not compressed or normalized. But when you compress or serialize them you then need to extract or deserialize them again before you can use them. Ideally you do not log data that you will never need again or process it early in the flow to retrieve the information and dump the ballast.

Don’t log transient information

If you want metrics to work with, e.g. for alerting, do not log them first and then extract them again (serializing-deserializing). You can send the metrics directly via a much leaner mechanism, using Prometheus or OTel. If you have to, you can use an OTel collector with a processor to extract the info and turn it into a metric before sending it to the storage. That is also an excellent intermediate method on the way to metric-less logs. That way you can find out what you really need before investing time and effort into remodeling the metric generation.

State information belongs into traces

In a µService world many services are involved in a single request. If each of those logs their state information like started, completed, OK, error, etc. locally, I only know if this service is in good shape but have no context or information about the other connected services. Ideally this information is in a span as part of a trace. That way I can save energy that would go into correlating this type of information later in the log manager. Additionally you know what other services are influenced by a bad behaving one. “Wide events” – I can hear you thinking while reading this – we are coming to that later.

Not all errors are created equal

But first: Errors. Not every error log event is worth being stored completely. There those that you only want to count (-> metric) and then there are the ones giving you valuable info about the reason for a request to fail. That belongs into the trace of this request. Why? Because it is useless without the context of the transaction. For example “Service XYZ returned error 401-02 – wrong format” is helpful when you know which service call this was in. And that is typically something you collect in traces – which we also can create with OTel – because in log files you do not have context. Usually there is some intermittent service there and you would need an architect to understand where this has to be placed. Finding the corresponding log of the other service, for that request, at the same time (timekeeping in distributed systems, another topic), to determine the content – is tricky at best – if it is logged at all. I’d opt for error / exception attributes in spans which give you that information and correlation directly without any rule based processing later.

How does the ideal system look like?

For me the optimal system creates infrastructure metrics directly e.g. using Prometheus as well as distributed traces where a lot of the information, that you typically log, is stored in a span. Those spans are processed in a collector/aggregator to create the necessary metrics and decide what to send forward to store. Keep logging to the minimum as it uses I/O and additional resources to extract the data and then the information that is more or less already present in traces and spans. And traces will take care of the timing offset problem for you. Logs have timestamps of the local machine and sometimes the machines are not time synched. That means there is an offset of milliseconds or even seconds, which makes it hard to correlate services with logs only.

Wide events

In the last months the topic of wide log events came on the radar and is talked about a lot. Some argue that you don’t need anything else because with a structured log event you can extract all the other information like metrics later. I’d argue that this is already done by spans and traces. With the benefit that I can instrument code that I have not written myself and don’t have to write log statements in my own code. Additionally the processing can happen already early in the process and save data from being unnecessarily send over the network. Plus the log events are also part of the traces/spans which means I have wide events with trace correlation without the effort of creating and propagating trace-IDs myself.

https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics – advocated for Meta’s Scuba system – which looks a lot like unbounded analytics from Instana – and is a based mainly on wide events. Their advantage is that all the code is written by themselves and hence they can log whatever they want. If you use frameworks and libraries out there this is not always the case.
From my point of view traces and wide events have a lot in common and help tremendously, given that you have a backend system that allows you to search, slice and diced the data as you need it. But that is worth an article on its own.

Summary

Logging is important and it is rightfully one of the pillars of Observability. With the development in the other signals and the vast amount of potential log sources rethinking the logging strategy and adjusting the tools reap benefits in spending less time and money with better results.

If you want to share your approach to logging and maybe some tipps and tricks, join our community under https://observability.mn.co.

Types of data in logs

Extracting information

Remove the unneccessary

Don’t log transient information

State information belongs into traces

Not all errors are created equal

How does the ideal system look like?

Wide events

Summary

Leave a Reply Cancel reply

Recent Posts

Why OpenTelemetry is such a game changer

The Unsung Heroes: Why OpenTelemetry’s Semantic Conventions are another Game-Changer

Beyond the walls: Understanding and overcoming Observability vendor lock-in

Observability Data: Where Do We Put All This Stuff?

Tags

Search