Understanding the True Causes of High Logging Costs

Understanding the True Causes of High Logging Costs

In my previous blog post, I spoke about the evolution of observability, exploring its history and the rise of OpenTelemetry as a powerful tool to tackle modern observability challenges. Today, I want to focus on a pressing issue many organisations, including the one I’m working with, face: the high costs of logging. This problem is complex and driven by a mix of tooling, architecture, and operational practices. By unpacking these contributing factors, we can better understand what needs to be done to rein in those costs and make our logging strategies more efficient.

The Tooling Jungle

One of the primary contributors to high logging costs is the fragmented nature of logging tools. Our current setup has systems deployed across multiple environments—Azure, AWS, and on-premises—each using different logging backends. Each environment relies on the logging solution most appropriate for that specific hosting venue, which has led to a proliferation of tools.

When an issue arises in production, we must jump between multiple tools with often uncorrelated data to gather the required information. And that is just for logging. Metrics, similarly, are spread across different tools depending on where the application is hosted. Hosting venues are not the only segregators; signal type is another. On-premise, for example, uses Elastic and Kibana for logs but Jaeger for distributed traces.

This kind of scattered tooling isn’t just inefficient—it’s costly. Licensing fees, storage costs, and engineers' time investigating issues all add up. More than anything, it highlights the need for a unified approach to logging and observability to streamline operations and reduce complexity.

The Monolithic Log: One Size Does Not Fit All

Logs are often treated as a monolithic entity—one bucket that captures every event, error, and activity—but this approach is inefficient. Different types of logs serve different purposes, and failing to recognise these differences can lead to bloated storage and excessive costs.

  • Compliance, Audit, or Security Logs: These logs are essential for audit and regulatory purposes. They must be retained for extended periods, sometimes years, but they are rarely accessed except during specific audits or investigations. As such, they require reliable long-term storage but don’t need to be optimised for frequent, fast access. Hence, cheap, cold storage can be used to cut unnecessary costs.
  • Debugging Logs: Logs used for debugging, in contrast, are utilised explicitly for short-term troubleshooting and are accessed more frequently. These logs require a system that supports efficient querying and low-latency retrieval to ensure rapid access during critical times. Unlike compliance logs, debugging logs typically do not need to be retained for long durations— 7 - 30 days, depending on how mature the observability practices are —since their primary purpose is to assist with immediate problem resolution. Organisations can ensure efficient troubleshooting while minimising unnecessary storage costs by optimising these logs for short-term retention and fast access.

By treating all logs the same, organisations often store everything in expensive, high-performance resources—even data that could be stored more cheaply in long-term archives. Optimising storage and retention policies for each log type can go a long way towards cutting costs without sacrificing functionality.

The Hidden Developer Cost

High logging costs are not just about storage and compute fees—they also come in the form of developer and opportunity costs. Debugging production issues often falls to the most experienced engineers, who know the systems best and have the battle scars to prove it. Those are usually the most expensive engineers, the ones organisations want to spend time building new features or optimising existing systems.

Some studies suggest that developers spend around 20% of their time debugging production issues. Let's do a quick back-of-the-envelope calculation. This 20%, for an organisation with 200 developers (roughly the number of developers in this particular organisation), earning an average of, say, £50,000 per year, equates to £2,000,000 annually spent on debugging alone. Modern observability tools can reduce this time by 90-95%, representing substantial potential savings in terms of opportunity cost and developer morale.

Admittedly, this is an oversimplified view, as observability itself isn’t without cost. Building and maintaining observability tools requires time and resources, especially initially, which is often underestimated or overlooked. The key is to adopt tools that offer a low barrier to entry and minimal operational overhead, that allow you to focus on effectively mining the telemetry data collected to better understand your systems and be able to answer meaningful questions from the outside, as Hazel Weakly often says—ensuring that the cost of implementing observability is outweighed by the value generated through realising otherwise missed opportunity cost.

The Cost of Attrition

Developers enjoy using new technology, solving interesting problems, and dealing with many other things that do not include production issues. This becomes even more challenging when the tools and generated telemetry exacerbate these difficulties, creating additional friction and complexity for developers as they navigate the already stressful task of managing production systems. I have worked in such environments and seen the effect of the morale-sapping production alerts firsthand. The times developers would have to wade through out-of-date runbooks and struggle watching unhelpful dashboards to try and decipher a plethora of red herrings and false-positive alerts to get to the root causes is soul-destroying. This often leads to good developers just giving up or, worse, quitting. "The direct cost of replacing an employee ranges from 50% to 60% of the employee's salary. However, that range goes up to 90% to 200% of the separated employee's salary, when the lost productivity, employee engagement, and other soft costs are taken into account." (source: Praisidio)

Breaking the Cycle

To truly address the high cost of logging, we need a different approach—one that moves beyond the fragmented toolset and the monolithic logging mentality. By rethinking how we handle observability, moving away from siloed telemetry signals, choosing the right tools, and optimising how telemetry is utilised, ensuring telemetry serves its intended purpose efficiently and effectively, we can reduce costs while improving our ability to observe systems. A trace-first observability strategy, supported by OpenTelemetry and wide-structured events, offers a promising way forward. In subsequent posts, I’ll explore this topic further as I am currently in the midst of introducing observability at an organisation.