Observability, at its core, revolves around data. Think of it like this: raw telemetry data streams in from your OpenTelemetry agents and SDKs, and our mission is to refine that raw material into valuable information. This information is the lifeblood of our work, helping us diagnose issues, pinpoint root causes, and ultimately, do our jobs better, faster, and with fewer late-night heroics.
But once we’ve got this torrent of data, the next crucial question arises: where do we actually store it? We’ve successfully retrieved it – our agents are diligently collecting – but now it needs a permanent home, a place where we can search, analyze, and extract the insights we crave.
The initial resting place: Relational databases and SQL
The immediate answer that springs to mind? A database! Or perhaps more broadly, a data storage solution – even a data lake could theoretically fit the bill. We need somewhere to park all this valuable telemetry.
Historically, the go-to solution has been the realm of relational databases. The magic word here is SQL. In my early days with pioneering APM solutions like Wily, Dynatrace (the original!), and AppDynamics, the underlying data was invariably housed in relational databases. Initially, you had the freedom (and the responsibility!) of choosing your own database – think Oracle, Informix, DB2, and later, the rise of MySQL, MariaDB, and PostgreSQL. With AppDynamics, they streamlined the process by embedding a dedicated MySQL instance, shielding users from the database wrangling.
The beauty of SQL databases lies in their familiarity. SQL is practically a lingua franca in the IT world; most professionals have at least a passing acquaintance with its syntax and can conjure up a basic SELECT * FROM something query. This approachability was a significant advantage.
A shifting landscape: The challenges of microservices and high cardinality
However, the landscape of technology has dramatically shifted since the early 2000s. We’ve witnessed an explosion of services – smaller, more agile microservices, and the ephemeral nature of serverless functions. This evolution has brought with it a challenge that traditional SQL databases struggle to handle gracefully: high cardinality.
Let’s delve into that. In the early days of Wily and Dynatrace, the data was largely symmetrical, primarily time-series data with predictable data points. This structured nature allowed for efficient storage in relational tables with well-defined keys and relationships. It wasn’t always lightning-fast, but it got the job done.
Microservices, however, introduced a new dynamic. Consider a scenario where “Service A” calls “Service B” and “Service C,” and then “Service B” also calls “Service D,” while “Service C” might also interact with a “Billing Service.” Now, we want to track the overall response time of “Service A.” That’s manageable. But we also need to dissect the individual response times for each downstream call within the context of Service A. How long did the call to Service B take specifically when initiated by Service A? And then, how long did the subsequent call from B to D take?
As the number of services and their interconnections grows, the number of unique combinations of these contextual data points explodes exponentially. This is the high cardinality problem, and it’s a tough nut for relational databases to crack.
Beyond Stsructure: The limitations of traditional databases
Furthermore, individual time-series data points in isolation often lack inherent meaning. Their true value emerges when viewed in connection, within a specific context. Traditionally, this context was often inferred through experience and pre-defined relationships.
But what happens when the “unknown unknowns” rear their head? Traditional monitoring excels when you have a solid understanding of your environment and changes are infrequent. You know what to look for, and you can meticulously craft dashboards based on your experience. However, when an unforeseen issue arises – something you didn’t anticipate collecting data for – you’re left scrambling. You lack the flexibility to retroactively analyze your existing data from a new perspective to understand the root cause. Relational databases, with their rigid, upfront schema definitions, offer limited help in such scenarios. You can’t easily ask new questions of your data if the underlying relationships weren’t defined from the outset.
A new paradigm: The power of columnar and non-relational databases
This is where alternative database paradigms truly shine, particularly columnar databases. Think of solutions like ClickHouse, but also consider the power of Elasticsearch and other NoSQL (non-relational) databases. Columnar databases are exceptionally well-suited for observability data because they allow us to efficiently query and aggregate specific data points – like duration, response time, latency, or error rate – directly from our traces.
Yes, you heard that right. The modern approach often involves not directly storing pre-aggregated application metrics as they’re generated. Instead, we calculate them on the fly, as needed, from the rich data contained within our traces.
Consider a single span within a trace. It has a start time, an end time, and therefore, an inherent duration – the response time for that specific component. A complete trace, composed of multiple interconnected spans, provides a wealth of individual performance data.
Flexibility is key: Answering the unknown questions
By adopting this approach, we can effectively tackle the high cardinality challenge. Instead of endlessly storing every possible time-series combination, we retain the granular trace data and then dynamically generate the metrics we need at the moment of analysis. This provides the flexibility to slice and dice our data from countless perspectives, even those we didn’t foresee.
This flexibility is paramount because we often don’t know in advance what kind of information will be crucial for troubleshooting. A flexible data schema and storage empower us to ask new questions of our existing data, to create custom metrics and data points on demand, rather than being limited to the pre-defined metrics our systems happen to expose.
The modern toolkit: Platforms built for dynamic data analysis
Of course, having this flexible data storage is only half the battle. We also need tools that are designed to leverage its capabilities. This is where the new wave of observability platforms, such as Dash0, Honeycomb and others, are leading the charge. They are architected around the power of trace and span data, enabling users to dynamically derive insights and create monitoring, alerting, and dashboards on the fly. The days of endless “metric hunting” are fading – no more need to teach your dogs new tricks.
Unlocking AI potential: The value of granular data
This evolution is a significant step forward, paving the way for the effective application of AI in observability. AI thrives on vast amounts of raw, undigested data. Traditional time-series databases, while efficient for storing aggregated metrics, often make assumptions and compress data in ways that can limit the potential for deeper, AI-driven analysis. By contrast, retaining granular trace data allows AI algorithms to identify patterns and anomalies that might be obscured in pre-aggregated metrics.
The power of storing detailed trace data lies in its interpretability. We can combine individual data points – events, durations, attributes, and labels – to create precisely the metrics we need at a specific point in time. We can persist these derived metrics for ongoing monitoring or discard them after our immediate troubleshooting is complete.
Balancing insight and cost: Intelligent data management
However, storing vast quantities of raw trace data can be costly. This necessitates intelligent tooling that helps us make sense of the data and efficiently filter out what’s truly valuable versus what can be discarded. Modern observability tools are evolving to address this, allowing users to sift through data, identify key information, and define rules for data retention and filtering, even at the OpenTelemetry collector level, preventing unnecessary data from ever reaching the backend.
Filtering the noise: Focusing on what truly matters
Think about health checks. Often, a significant portion of your traffic consists of these essential but generally uninteresting requests. We don’t necessarily need to store every trace of every health check. However, identifying when health checks fail is crucial. The ideal tooling helps us distinguish between routine health checks and those that signal a potential issue, allowing us to retain the valuable signals while discarding the noise.
Ultimately, the right data storage solution for observability is one that offers the flexibility to slice and dice your data on demand, enabling you to extract the information you need, when you need it, without being constrained by rigid, pre-defined relationships.
Summary
In the dynamic realm of observability, the journey of our data from collection to insight hinges critically on its storage. While relational databases served as the initial stronghold, the complexities of modern, high-cardinality environments necessitate a shift towards more flexible solutions like columnar and non-relational databases. By embracing the richness of trace data and empowering on-demand metric generation, we gain the agility to answer unforeseen questions and unlock the potential for advanced analytics, all while navigating the crucial balance between comprehensive insight and efficient cost management through intelligent data filtering and retention strategies.





Leave a Reply