In my 30 years in IT I have seen many things – good ones and bad ones. The bad ones usually had to do with the performance of systems or rather the lack of it. And a few of them stuck in my mind and will come back to the surface time and time again because they are significant.
In the light of OpenTelemetry and the mindset shift that is taking place regarding observability my “Poor man’s metric creation” story came back with a vengeance. It reminded me of some basics in data gathering and that you don’t need a ton of data, but the right data.
Our cluster got stuck – It’s your fault
It all started with me being in the office and doing expenses sometime in 2000. As a Professional Services consultant I was on the road a lot and office work usually was relaxing and stressfree. Until the phone rang. A colleague called me and asked if I could take a look at an issue with a high profile customer. Their service stopped working and it is based on our application server and as we (the vendor) were not in the room when the emergency troubleshooting meeting aka Blame Game was happening, we ended up having the blame.
You know how those meetings work, right? A lot of people from all departments gather at short notice in a room and explain, why it is not their fault that something went wrong. That means, if you are not in the room at this time, you can’t defend yourself and that’s why you end up with the blame in your hands.
Troubleshooter by chance
As I had some expertise with the application server setup and I was in the office it was now my task to help with the troubleshooting. Great, always a pleasure to debug systems that you know nothing about in the first place 🙄. But IT is a teamsport and the office of the hosting partner of the customer was actually on my way home so I checked in there.
I met some very nice, knowledgable and practical guys who knew where stuff was running and what the symptoms were. So I got a seat in front of a PC which had connection to the data center and started digging.
Whack-a-mole
Debugging this system was like a “whack-a-mole” game. Occasionally but never predictable the response times would start to increase and 5 minutes later you got timeouts as the whole system was stuck. When this happened all we could do was “whack the mole” by restarting the cluster. Unfortunately we had no metric on a dashboard that would warn us – there were no performance metrics at all. It was all manual work – alerting was done by the support hotline.
The first task at hand was to start a semi-automatic alerting to warn us when the mole started to pick a hole to appear. And here some terrible flaws of the architecture were revealed, like a the lacking of a single-sign-on for the services incorporated in this portal. This was a portal with lots of different options thats could be individually configured via so called portlets but there was no underlying user management layer but each of the portlets needed a separate login. A nightmare to monitor and debug as one user login could trigger multiple further logins which had each different demands on CPU and memory.
Give me a sign
The remedy so far for the problem was a restart of the cluster of application servers which had all types of negative side effects like loss of session information, etc. After we started troubleshooting we found out that if we do it early enough, the restart of one of the cluster members is sufficient to remediate the problem. But how could we find out when “early enough” was? When the mole woke up we had 5 minutes to do something if we were lucky.
We looked for other telltale signs that could warn us earlier and also give us a hint in the direction of the root cause of the issue.
So we turned to the application and system logs to see if there was anything useful in there. There was definitely a lot of stuff in there, but all mixed up, lots of transaction IDs and state changes and, as this portal was used by a couple of hundred thousand people, it filled up quite fast. The log rotation worked but made analysis even harder – meaning: Filename changes.
PERL to the rescue
Finally we found 1 data point that looked as if it could be helpful. The request ID (or something similar, I don’t remember the exact name). There was a “request started for ID: xyz” and a “request completed for ID: xyz” event in the logs. We noticed that shortly before the system got stuck that number of “started” events outweighted the “completed” events by far. Actually nothing go completed at all. That filled up the ID sync mechanism in the cluster and that finally led to a stuck cluster. Why it stopped working we didn’t know yet, but we finally had a chance to get an early warning – number of active IDs 😀
But … there was no metric in the application server that we could look at on a dashboard. Not even a number of started and completed events. We had to build it ourselves.
This was in the year 2000 … No OpenTelemetry, no Prometheus, No bytecode instrumentation – only logs. And PERL. This language has 2 assets that helped us big time.
First: It has great parsing options and a nice map implementation, second: It was preinstalled on every UNIX system we deployed as it was a staple for CGI scripts in webservers – no, I am not going to explain what CGI scripts were. OK, I play nice: What is CGI?
So I created a script that was parsing the log files for request started and completed events, put them in a map indexed by the ID and remove them when the completed event for an existing ID happened. Then I counted the entries – basically calling map.size() with the downside that this method did not exist in Perl then.
Finally we had a metric – yes, one!
Running the script periodically like once a minute gave us finally the number of active requests. And if this one increased suddenly we knew the mole woke up and we put him to sleep again by restarting one of the cluster nodes.
At the same time we used this information and gave it to the developers who were in a frenzy as the issue was on the desk of Scott McNealy, CEO of Sun Microsystems at that time, because the customer was a high profile German company. All the info we could share was helpful in the identification of the root cause – which, of course, were many, not just one.
The metric though helped us to mitigate the impact of the issue on users. Only a few were impacted by a node restart if any at all.
Why am I telling you this?
The right metric can save a lot of time and nerves – if you have it. Today we have all the possibilities with OpenTelemetry, Prometheus, Dropwizard, Micrometer and what not to extract the information that we need. There is still the problem of relevance to solve though. Which of all the collected data points delivers useful information for my case? In 2000 we had a dry riverbed and we were looking for some drops of water, in 2024 we have a flood and we are looking for the same drops of water as before, only now they are hiding within streams of water ¯\_(ツ)_/¯
BTW – the root cause
We never really found THE root cause. It was a mix of a bug in the cluster mechanism, a resource overuse because of a faulty architecture and misuse of APIs on combination with a memory leak and massive traffic. It also was not tested very well. After 2 months we had a pretty stable system though with tons of learnings and a couple of sleepless nights deploying hot fixes on the production machines live in the data center.
Guess why I joined APM vendors after this experience 😄






Leave a Reply