My OTel Journey – Part 2 - Observability Heroes

My 2nd part of the journey into OTel is about logging – how to retrieve logging data from the OTel store. Logs are important to get detailed information, error statements and lots of other things but the real important information can drown easily in a vast amount of data.
The OpenTelemetry demo uses OpenSearch as the data store and with the help of Grafana Explore I looked into it.

OpenSearch and YAQL

In order to make good use of OpenSearch I needed to learn YAQL – yet another query language – PPL which stands for Piped Processing Language But basically it is SQL. Another option is Lucene which stems from the origins of OpenSearch in ElasticSearch. Both can be used in the Grafana integration.
What should I look for? What is a typical use case for loganalysis?
A pretty common one is to look for IDs like transaction IDs or User IDs or for error messages with a certain keyword.

Now let’s see how we can find trace.span.events from the checkoutservice in the OTel demo app. Is it possible?

In Jaeger I can actually look for a span.event that is defined in GO (checkoutservice):

txID, err := cs.chargeCard(ctx, total, req.CreditCard)
    if err != nil {
        return nil, status.Errorf(codes.Internal, "failed to charge card: %+v", err)
    }
    log.Infof("payment went through (transaction_id: %s)", txID)
    span.AddEvent("charged",
        trace.WithAttributes(attribute.String("app.payment.transaction.id", txID)))

So, Hopefully the trace will give me the transaction ID … and maybe the amount in a different event? Well, that works for Jaeger but not for OpenSearch as the logs from the Go components are not yet forwarded to the backend 😕

Check OpenSearch Logsearch vs. Jaeger Trace logs

I tried the ID thing first. Can I find a trace in a given userId that I found for in OpenSearch with an information log. I used the cartservice as an example. In OS the field “attributes.userId” was present and filled.
I ran a query in OpenSearch, found results, copied one of the IDs, searched for a trace in Jaeger, and it failed. 🤔 A minor pitfall is the name change of the attribute. It is named „attributes.userId“ in OpenSearch, while the span tag that I need to search in Jaeger for is called „app.user.id“. Naming convention is important.

One example in this screenshot:

A table with userids from a PPL query in OpenSearch

There was no result when we used one of the IDs in Jaeger:

A search screen for Jaeger with the cartservice selected, the tag "app.user.id=somevalue" filled and the message "No trace results. Try another qeruy"

I did not dig deeper into this but it may has to do with some trace sampling. I haven’t checked the collector yet if it does any tail sampling of traces, but it would explain the symptoms. The other way round, using an ID that I found in Jaeger traces in OpenSearch, works fine – that supports the sampling assumption.

Success – a dashboard with log info

I experimented a bit with Grafana, PPL and Lucene and came up with this dashboard that contains 3 panels which are fed from OpenSearch. Sparing you the details, this is the result.

A Grafana dashboard with 3 panels and a black background. Upper left is a table with the timestamp, the message body text and the ID. To the right is the number 19 in green letters (big) and on the bottom a graph depicting the sum of the amounts quoted and the max amount in a line.

We see the total number of failed ad requests of the adservice in the upper right corner, a list of log messages with text and id as a table and as a graph the quoted amount in our store from the cart service. All retrieved via PPL and Lucene and calculated in Grafana.
It was astounishingly hard to get the single number on the screen, as the calculation for the data retrieved with Lucene (count – number of lines/documents) just didn‘t work. A PPL query with a count did the trick then.
Harder than expected but success 😀

What about syslog? Let’s try

Now I made a bold move and tried to connect 2 servers that I have locally here which use syslog to send the data to the collector and forward it to OpenSearch. It failed miserably and I have no idea why so far. I think I messed up the collector config as there were no logs whatsoever in OpenSearch from that moment on. After reversing the config at least the standard logs were there again.

That needs some more digging. And probably another blog post.

Summary

As easy as it is to setup the OTel demo and get data flowing, getting the data out again takes some work. Query languages, naming conventions need to be learned or experienced to get some results. Coming from a commercial APM background the UI was always easy to use and you got fast results as you were, more or less, guided along your usecase – e.g. rootcause analysis or alerting.
In OTel I need to do all of this myself, which needs some work and/or experience – and I haven‘t talked about instrumentation yet.

OpenSearch and YAQL

Check OpenSearch Logsearch vs. Jaeger Trace logs

Success – a dashboard with log info

What about syslog? Let’s try

Summary

My OTel Journey – Part 2

Leave a Reply Cancel reply

Recent Posts

Why OpenTelemetry is such a game changer

The Unsung Heroes: Why OpenTelemetry’s Semantic Conventions are another Game-Changer

Beyond the walls: Understanding and overcoming Observability vendor lock-in

Observability Data: Where Do We Put All This Stuff?

Tags

Search