My OTel journey – week 1 - Observability Heroes

This is the story of someone – me to be precise – who decided to learn more about OpenTelemetry after working 20 years for 4 commercial APM vendors. So I am very curious about what standards and technical solutions evolved over that time.

OTel Log Day 1 – 8.3.2024 – Deciding how to start

After reading some blogposts and some documentation I decided to dive into it and set up the official OTel demo from this site: https://opentelemetry.io/docs/demo/

My first challenge is to update the Docker desktop for my Mac … Turns out it is not really straight forward to upgrade from an old version. Needed to reinstall the thing.

Once it is running and the CLI works as well, setting the demo up is really easy. Clone the GitHUb repo, copy, paste and run the docker commands.

$ docker compose up --force-recreate --remove-orphans --detach
[+] Running 134/23
✔ quoteservice 14 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 34.4s
⠹ redis-cart 6 layers [⠀⠀⠀⠀⠀⠀] 0B/0B Pulling 241.7s
⠹ jaeger 5 layers [⠀⠀⠀⠀⠀] 0B/0B Pulling 241.7s
✔ cartservice 5 layers [⣿⣿⣿⣿⣿] 0B/0B Pulled 223.5s
⠹ currencyservice 2 layers [⠀⠀] 0B/0B Pulling 241.7s
✔ flagd 1 layers [⣿] 0B/0B Pulled 214.7s
✔ emailservice 9 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 179.1s
⠹ paymentservice 7 layers [⣿⣿⣿⣿⣿⣿⠀] 0B/0B Pulling 241.7s
⠹ grafana 10 layers [⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀] 0B/0B Pulling 241.7s
✔ frontendproxy 10 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 134.3s
✔ accountingservice 2 layers [⣿⣿] 0B/0B Pulled 76.3s
✔ recommendationservice 5 layers [⣿⣿⣿⣿⣿] 0B/0B Pulled 220.0s
⠹ prometheus 12 layers [⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀] 0B/0B Pulling 241.7s
✔ checkoutservice 1 layers [⣿] 0B/0B Pulled 158.9s
✔ frauddetectionservice 34 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 211.5s
⠹ kafka 13 layers [⣿⠀⣿⣿⣷⠀⠀⠀⠀⠀⠀⠀⠀] 33.91MB/36.68MB Pulling 241.7s
✔ adservice 8 layers [⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 163.7s
✔ loadgenerator 9 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 192.5s
⠹ opensearch 6 layers [⠀⠀⠀⠀⠀⠀] 0B/0B Pulling 241.7s
⠹ shippingservice 3 layers [⠀⠀⠀] 0B/0B Pulling 241.7s
⠹ otelcol 3 layers [⠀⠀⠀] 0B/0B Pulling 241.7s
✔ frontend 15 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 96.2s
⠹ productcatalogservice 3 layers [⠀⠀⠀] 0B/0B Pulling 241.7s

After 386 seconds the system was up and running 😀

And because the load generator was also started there were traces in Jaeger immediately – nice. As this happened on a late Friday afternoon I decided that the setup works and I continue on Monday.

OTel Log – Day2 – 11.3.2024 – A view on Jaeger

After the weekend I checked out the demo a little more thouroughly and tried to squeeze Jaeger a bit. Besides it being really fast – probably because it is running locally with light load – I missed some comfortable features that I am used to with other tools like Instana. Little things like a button that would automatically show only traces with an error in the list. I need to actually search for a tag to do that. One can see the errorenous traces in the graph though – which works in the demo, I doubt that is as nicely in a production environment with thousands of traces.

It is a great free tool when troubleshooting issues, though I am not sure if the basic setup is really much fun in production. The search capabilities lacked regex support for tags, and generally the search for an errorneous trace in a service needed some idea about the general architecture before I was able to find it.

I’ll upload a video with my struggles and questions shortly.

OTel Log – Day 3 – 12.3. – All that Jazz

Digging into the demo setup reveals a couple of interesting things. For instance, logging doesn’t really happen in the demo, but metrics and traces do. I already dipped my toe into the Jaeger pond and was overwhelmed with some complexity and query options that need more work.

The metrics can be queries via the Prometheus/PromQL UI which takes also some getting used to. A query builder like PromLens is desperately needed.

The Grafana setup contains 4 pre-configured dashboards which give at least some color 😄 And also access to the Jaegar traces and queries – though it is a little squeezed.

The Grafana OpenTelemetry demo dashboard in a screenshot with black background.

Reading up on some basic OTel concepts (https://ivan-kurchenko.medium.com/building-decoupled-monitoring-with-opentelemetry-5d2755e15922) and realizing that the demo has some potential to be extended. Maybe I can extract the logs when I setup a Loki exporter 🤔

OTel log – Day 4 & 5 13./14..3 – Troubleshooting test

O11y is mostly needed when troubles occur and you need to find the root cause of a problem to resolve it. I turned on the cartservice and the adservice errors in the demo and started.
So my use case that I want to check out with the OTel demo is a typical one:

Hey Rainer, our application is acting weird. Users report errors in the checkout. Can you take a look and repair this?
The unmentioned sys ops 😉

How fast can I do that? When I have no other info than that there is a faulty app? Usually there are 2 ways I use preferably.

Look at some metrics about the error rates of your services and see if there is a significant increase to find a starting point.
Jump into the traces and just dig around – take a look and see what you find. This is when I have no dashboards or graphs that I can use as a starting point

I tried #2 first – dig into Jaeger with the traces. Let’s see. I have no info about the service which is in trouble and I can’t search within all services, I have to pick one. So I go from top to bottom trough the list and search for “tag: error=true”. The short lifespan a trace exist in the demo is making it hard though. I have to stop the load generator to do some sniffing around as the traces are purged after 1 (!) minute when the demo is running.

Here we can see that 5 spans of the trace are marked as erroneous with the little red exclamationmark. Opening up the checkoutservice span we find the following:

OK – we found the root cause: Error: ProductCatalogService Fail Feature Flag Enabled – the is the error description in the span. Hoorray … that was fast. This is a tiny system though and just a handful of services insofar that approach works. In real production environments we need to do it differently if time is pressing. Hence usecase 1: Check the metrics and drill down from there. A job for Grafana and the dashboards.

This is the ready-made span-metrics dashboard in the Grafana, actually a part of it. As we want to find the root cause for an error, a look at the error rate is helpful. At first sight the most errors occur on the frontend service, which is not the root cause but where we see the outcome. The 2nd highest error rate is on the ProductCatalogService, so where would we start to dig? Of course on the frontend, as the error rate is so much higher than on the other service.

In the Jaeger UI I search then for frontend with tag: error=true and find lots of traces and when digging into the log spans, I find my root cause as well:

The exception was propagated from the catalogservice up to the frontend and we can nicely read where the problem resides. Nice.

Summary of my little test week

Of course I am extremely biased when it comes to frontends and analytics capabilities since my former companies were really good at that, but OpenTelemetry is about data collection and not about visualization or storage, etc.

The data that I can see in the demo is great, the context is propagated nicely in the traces, the spans are rich in details and context. There is nothing OTel can about missing search capabilities on the backend. The demo does work flawlessly and provides a fast positive experience. Great work. Now the next step is to fiddle with the knobs and extend the data, maybe check out the instrumentation and augment something to it to see how that works. I am also tempted to point the OTel collector to a different backend to see how that works out.

OTel looks to me like a really strong foundation for O11y and we can focus on the real hard part then: Analysing the data that we get so easily from our applications thanks to OTel.

I am looking forward to that.

OTel Log Day 1 – 8.3.2024 – Deciding how to start

OTel Log – Day2 – 11.3.2024 – A view on Jaeger

OTel Log – Day 3 – 12.3. – All that Jazz

OTel log – Day 4 & 5 13./14..3 – Troubleshooting test

Summary of my little test week

My OTel journey – week 1

Leave a Reply Cancel reply

Recent Posts

Why OpenTelemetry is such a game changer

The Unsung Heroes: Why OpenTelemetry’s Semantic Conventions are another Game-Changer

Beyond the walls: Understanding and overcoming Observability vendor lock-in

Observability Data: Where Do We Put All This Stuff?

Tags

Search