The Evolution of Observability
What is Observability?
Observability's origin story starts in control theory. Control theory is a field of engineering and mathematics that deals with understanding and influencing dynamic systems to achieve desired outcomes. Control theory highlights the importance of determining a system's internal state by observing its outputs. This concept parallels modern Observability as defined by Charity Majors, Co-Author of Observability Engineering and CEO of Honeycomb, who describes Observability as "… the ability to understand any inner system state just by asking questions from the outside of the system."
Observability provides the ability to gain a deep understanding of system behaviour, enabling you to ask new and unexpected questions about your system. This means that when unexpected issues arise, you can dynamically investigate and analyze the system's state in real-time to identify the root causes. Observability helps address unknown-unknowns—those issues that you didn't anticipate or prepare for.
Observability is not optional. In today's ever-changing technology landscape, systems fail in new and unexpected ways, making it vital that their behaviour can be analysed and understood. But if Observability is that important, why have we not been doing it all along? For that, we need to understand the history and constraints of monitoring and Observability.
The History of Observability
When it comes to applications, traditional monitoring is a relic from the days when systems were simple monoliths that were easy to reason about. In those times, everything happened within a single process, which made debugging straightforward and localized to one place. Additionally, storage costs were high, so storing compacted and aggregated data points for historical analysis was considered sufficient. This approach worked well because the systems were predictable, and there was a limited need to drill down into detailed data for troubleshooting.
Then came Time-Series Databases (TSDBs), which made it more cost-effective to store pre-aggregated metrics and build dashboards on top of them. TSDBs allowed for efficient storage and querying of time-stamped data, enabling engineers to visualise trends and understand their systems' historical performance more easily.
"A time series is a sequence of data points where each point is a pair: a timestamp and a numeric value. A time series database stores a separate time series for each metric, allowing you to then query and graph the values over time." 1
This was a significant step forward in managing and analyzing metrics, allowing for more powerful insights into system health over time. However, this approach had an important shortcoming: it required you to know in advance which questions you might want to answer in production so that you could create the necessary metrics. This meant unforeseen issues or new questions were challenging to address effectively, mainly because data is aggregated at write time. Hence, the underlying raw data, crucial for root cause analysis, is inaccessible.
Fortunately, around this time, structured logging emerged, which shifted the focus from attempting to predict specific system failures to capturing detailed logs that could be queried in production. This allowed engineers to answer more questions than with metrics and improved investigations in production. This was a significant leap forward regarding system visibility and troubleshooting capabilities.
However, logs had their limitations. Developers had to write additional code to calculate durations, often resorting to guessing the execution order when steps occurred in parallel and ensuring that IDs were consistently passed to tie logs together within the service. With the shift towards microservices, these issues became more pronounced. A correlation ID must be passed across services, requiring each service to process that correlation ID and use it consistently in all log messages - for example, the same property name across services and teams. Additionally, discrepancies in server clocks became an issue, as logs relied on timestamps to establish the order of events across distributed microservices. Finally, logs were not inherently machine-readable; they were designed for human consumption, making them challenging to analyze and query systematically.
The Rise of Microservices and Its Impact on Observability
With the rise of microservices architecture, software systems began to evolve from large monolithic applications to smaller, independent services. This shift brought many benefits, such as improved scalability, resilience, and the ability to develop and deploy services independently. However, it also significantly increased the complexity of understanding and managing these systems.
In a microservices environment, a single user request might traverse multiple services, each of which could be running in different environments, using various technologies, or even maintained by separate teams. This distributed nature made gaining a holistic view of system health, detecting issues, and determining their root causes challenging. Traditional monitoring techniques, which worked well for monolithic applications, fell short when faced with the intricate web of interactions within microservices.
The industry started focusing on distributed tracing standards to address these challenges, allowing engineers to trace requests as they travelled through multiple services. This led to the emergence of two key projects: OpenCensus and OpenTracing. OpenCensus, initially developed by Google, provided a set of libraries for capturing traces and metrics, while OpenTracing aimed to offer a vendor-neutral API for distributed tracing across different systems.
Despite their usefulness, having two separate projects created fragmentation in the community, making it difficult for organizations to choose the right solution and achieve interoperability. To solve this problem, the OpenTelemetry project was born as a merger of OpenCensus and OpenTracing. OpenTelemetry aimed to unify the two projects, providing a single set of APIs, libraries, and tools to collect distributed traces, metrics, and logs, regardless of the technologies in use. Today, OpenTelemetry has become the de facto standard for achieving Observability in modern distributed systems, simplifying gaining insights into complex microservice environments.
Conclusion
Observability is essential in managing modern software systems, especially with the complexity introduced by microservices and distributed architectures. Unlike traditional monitoring, which relies on pre-defined metrics and known failure conditions, Observability enables a deeper understanding of system behaviour by analysing telemetry data in real-time. This empowers teams to explore unknowns, diagnose root causes efficiently, and maintain system reliability even in unpredictable environments. The rise of standards like OpenTelemetry has further streamlined the process, making it easier to instrument, collect, and analyze data across diverse systems. Ultimately, Observability provides the insights needed to proactively manage and stabilize complex software systems rather than react to failures.