RSS

Is Observability Worth the Cost?

Observability can be expensive for data at scale, so how do you ensure it’s worth it for large distributed systems?

Marty Kagan

Published:

Apr 22, 2024

6 minute read

In 2009 when I co-founded a company named Cedexis, the concept of software observability didn’t exist yet. Splunk had been around for six years, focused on infrastructure monitoring, and engineering teams often had their own homegrown logging systems. But at the time, we didn’t know of anyone besides Cedexis that was inferring the internal state specifically for CDNs and media delivery infrastructure by observing external outputs (the widely accepted definition of observability). In fact, observability didn’t really become a widely considered concept until a 2016 blog post by the observability engineering team at the company formerly known as Twitter.

As more enterprises built large distributed systems, the need for observability grew tremendously. With observability, enterprises had fresh insights into performance issues and bottlenecks. This is part of the reason we’ve seen huge leaps forward in both the scale and performance of infrastructure, including the cloud platforms that most of us rely on. Without observability, we wouldn’t have a good understanding of how our systems are performing—and we’d have to write off many performance problems as part of the cost of doing business in the cloud.

I see regular hot takes that suggest “observability is a great shiny lie” or “real-time monitoring is a waste of time” or “observability is a fancy rebrand of monitoring.” For smaller applications, it may be possible to get by with in-house logging systems and great testing and QE practices, but in my experience, I honestly don’t think that’s the case with large distributed systems. Once you are dealing with terabyte and petabyte scale volumes, systems will interact in complex and sometimes unpredictable ways. Observability may not fix all the chaos, but without it, you’ll truly be in the dark.

So is observability important for modern ops and engineering teams? My answer is a strong yes.

But there’s another question worth asking. Is observability in its current form worth the expense and effort? You may be surprised to hear that too often, my answer is no.

So Is Observability Worth the Expense and Effort?

Unfortunately, many incumbent observability solutions cost too much and aren’t providing enough value. (Spoiler alert: Hydrolix is built to solve this problem and maximize the value of observability while keeping costs down.)

The high costs are mainly related to tightly coupled and expensive underlying architecture that’s often not performant at scale. This leads customers to sacrifice data with observability anti-practices such as shortening retention windows, sampling, and aggregating, all in order to reduce costs and hit performance targets. All this compromising is then sold back to users as “best practices.” For big data use cases, it is not uncommon for Splunk users to have only 30 days of data in hot storage, while ElasticSearch users may only have 7 days. These highly restrictive retention windows lead companies to move data into cold storage, which is hard to access in most operational contexts, and typically becomes dark data. People want as much of their data as possible, preferably served hot, and yet this simply isn’t affordable with many solutions.

The need for long-term, cost-effective hot data at scale led to the founding of Hydrolix and the creation of fast-growing observability and SIEM products powered by Hydrolix such as TrafficPeak. Hydrolix uses stateless architecture and object storage to radically reduce costs while solving key performance issues. That reduction in cost for data at scale is essential for observability to deliver value and remain relevant.

Cut Costs, Retain More Data, Make It Searchable

Cutting costs by 70-80% and extending hot data retention windows make observability much more useful. Cost-effective, long-term retention allows teams to keep data available longer, even if the value of the data isn’t yet clear. As one of our solutions engineers, Dan Sullivan, said recently:

With complicated systems, we don’t know all the ways systems can break until they do break. An important log message or fact in one scenario may be useless in another. Log messages useful to a security engineer may not be useful to a DevOps team focused on performance or may be useful at a different point in time. 

Capturing, analyzing, and retaining raw logs—facts about the system—are essential when a new problem arises. An operator doesn’t always know in advance the questions they need to ask. And raw logs—so long as they are available and haven’t been discarded because of cost or performance issues—have many stories to tell, even if it sometimes takes additional digging and analysis to uncover them.

The teams that use Hydrolix make important decisions impacting core business systems every day using raw data retained for months and even years. They are not “drowning in data.” They are using observability to make data-driven business decisions.

Most Enterprises Need to Keep Their Data

This notion of “drowning in data” is in too much marketing copy these days. The peoples we talk to have different technical problems. They want to keep their data at scale, but limitations in current systems force them to dispose of data or pay exorbitant fees and deal with poor query performance. Observability platforms impose tradeoffs on customers that impact not just observability but end-user reporting, billing, compliance-driven retention of raw logs, and machine learning. These tradeoffs even impact internal systems for the platforms themselves. I wrote for InsideBigData.com about Cisco’s acquisition of Splunk, which was made in part because of Splunk’s available data for training AI models. In order for Cisco to maximize the value of their models, they’ll need to keep a lot more data long-term or their models won’t be as accurate.

To keep data hot (and readily available) for the long term, you need a system with decoupled storage and independently scalable ingest, query, and control subsystems that allows you to store data at scale in a cost-effective manner. The query system must be optimized for both real time and historical analysis using indices, predicate pushdown, and independently scalable, per-workload query pools. The ingest system needs to be able to process streaming data, pull directly from popular tools like Kafka and Kinesis, and ideally even enrich and transform data on the fly before it goes into storage. The system should use commodity object storage, which is already extremely cost-effective, and also include high-density compression to reduce overall storage footprint (and costs) much further. This is exactly what Hydrolix is designed to do.

With the right system at the right price, observability is absolutely worth it. However, traditional observability platforms weren’t built with today’s big data use cases in mind. They have built amazing, feature-rich interfaces, but it takes a lot of work to rebuild underlying infrastructure and systems, and they’re simply not cost-effective enough at scale.

Observability’s Business Value

Observability is valuable to a business when it’s more profitable to answer a question about your infrastructure rather than to ignore it. 

By lowering the cost of observability, we make it profitable to solve more problems than before. 

Observability platforms need to:

  • Cost much less when retaining data at scale.
  • Make it easy to get the raw logs to solve any problem, either real-time or historical, whether the problem happened a minute ago or a year ago.
  • Make it easy to combine data and create tables or views that reflect business needs. Too many complex JOINs drive up costs and complexity while reducing query performance.
  • Make it cost-effective to keep data for as long as possible on the chance it might be useful. This is a tough criteria to meet but an important one. Retaining data for known problems is easy to manage. But what about problems that emerge from the system over time? What about novel advanced persistent threats? What about having data on hand to train models that don’t exist yet that can provide breakthroughs in QoE, anti-fraud and anti-piracy efforts, or predictive routing?
  • Engineer for our ignorance. We are still at the beginning of the observability era, and we still don’t know all of the use cases for the data we collect. If we don’t build systems that allow us to keep that data, we are foreclosing future opportunities to understand the internal workings of what will undoubtedly be far more complex systems generating even greater volumes of external outputs.

So let’s revisit the question: is observability worth the expense and effort? If your platform provides all of the above at scale while remaining cost-effective, then it’s one of the best investments you can make to ensure that your infrastructure and services are reliable and performant. If your platform doesn’t provide these things, you might want to reach out and set up a POC.

Next Steps

Read Transforming the Economics of Log Management to learn about the issues facing many of today’s observability platforms, and how next-generation cloud data platforms must maximize the benefits of object storage to make log data cost-effective.

Learn more about how Hydrolix offers cost-effective data at terabyte scale and contact us for a POC.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

In 2009 when I co-founded a company named Cedexis, the concept of software observability didn’t exist yet. Splunk had been around for six years, focused on infrastructure monitoring, and engineering teams often had their own homegrown logging systems. But at the time, we didn’t know of anyone besides Cedexis that was inferring the internal state specifically for CDNs and media delivery infrastructure by observing external outputs (the widely accepted definition of observability). In fact, observability didn’t really become a widely considered concept until a 2016 blog post by the observability engineering team at the company formerly known as Twitter.

As more enterprises built large distributed systems, the need for observability grew tremendously. With observability, enterprises had fresh insights into performance issues and bottlenecks. This is part of the reason we’ve seen huge leaps forward in both the scale and performance of infrastructure, including the cloud platforms that most of us rely on. Without observability, we wouldn’t have a good understanding of how our systems are performing—and we’d have to write off many performance problems as part of the cost of doing business in the cloud.

I see regular hot takes that suggest “observability is a great shiny lie” or “real-time monitoring is a waste of time” or “observability is a fancy rebrand of monitoring.” For smaller applications, it may be possible to get by with in-house logging systems and great testing and QE practices, but in my experience, I honestly don’t think that’s the case with large distributed systems. Once you are dealing with terabyte and petabyte scale volumes, systems will interact in complex and sometimes unpredictable ways. Observability may not fix all the chaos, but without it, you’ll truly be in the dark.

So is observability important for modern ops and engineering teams? My answer is a strong yes.

But there’s another question worth asking. Is observability in its current form worth the expense and effort? You may be surprised to hear that too often, my answer is no.

So Is Observability Worth the Expense and Effort?

Unfortunately, many incumbent observability solutions cost too much and aren’t providing enough value. (Spoiler alert: Hydrolix is built to solve this problem and maximize the value of observability while keeping costs down.)

The high costs are mainly related to tightly coupled and expensive underlying architecture that’s often not performant at scale. This leads customers to sacrifice data with observability anti-practices such as shortening retention windows, sampling, and aggregating, all in order to reduce costs and hit performance targets. All this compromising is then sold back to users as “best practices.” For big data use cases, it is not uncommon for Splunk users to have only 30 days of data in hot storage, while ElasticSearch users may only have 7 days. These highly restrictive retention windows lead companies to move data into cold storage, which is hard to access in most operational contexts, and typically becomes dark data. People want as much of their data as possible, preferably served hot, and yet this simply isn’t affordable with many solutions.

The need for long-term, cost-effective hot data at scale led to the founding of Hydrolix and the creation of fast-growing observability and SIEM products powered by Hydrolix such as TrafficPeak. Hydrolix uses stateless architecture and object storage to radically reduce costs while solving key performance issues. That reduction in cost for data at scale is essential for observability to deliver value and remain relevant.

Cut Costs, Retain More Data, Make It Searchable

Cutting costs by 70-80% and extending hot data retention windows make observability much more useful. Cost-effective, long-term retention allows teams to keep data available longer, even if the value of the data isn’t yet clear. As one of our solutions engineers, Dan Sullivan, said recently:

With complicated systems, we don’t know all the ways systems can break until they do break. An important log message or fact in one scenario may be useless in another. Log messages useful to a security engineer may not be useful to a DevOps team focused on performance or may be useful at a different point in time. 

Capturing, analyzing, and retaining raw logs—facts about the system—are essential when a new problem arises. An operator doesn’t always know in advance the questions they need to ask. And raw logs—so long as they are available and haven’t been discarded because of cost or performance issues—have many stories to tell, even if it sometimes takes additional digging and analysis to uncover them.

The teams that use Hydrolix make important decisions impacting core business systems every day using raw data retained for months and even years. They are not “drowning in data.” They are using observability to make data-driven business decisions.

Most Enterprises Need to Keep Their Data

This notion of “drowning in data” is in too much marketing copy these days. The peoples we talk to have different technical problems. They want to keep their data at scale, but limitations in current systems force them to dispose of data or pay exorbitant fees and deal with poor query performance. Observability platforms impose tradeoffs on customers that impact not just observability but end-user reporting, billing, compliance-driven retention of raw logs, and machine learning. These tradeoffs even impact internal systems for the platforms themselves. I wrote for InsideBigData.com about Cisco’s acquisition of Splunk, which was made in part because of Splunk’s available data for training AI models. In order for Cisco to maximize the value of their models, they’ll need to keep a lot more data long-term or their models won’t be as accurate.

To keep data hot (and readily available) for the long term, you need a system with decoupled storage and independently scalable ingest, query, and control subsystems that allows you to store data at scale in a cost-effective manner. The query system must be optimized for both real time and historical analysis using indices, predicate pushdown, and independently scalable, per-workload query pools. The ingest system needs to be able to process streaming data, pull directly from popular tools like Kafka and Kinesis, and ideally even enrich and transform data on the fly before it goes into storage. The system should use commodity object storage, which is already extremely cost-effective, and also include high-density compression to reduce overall storage footprint (and costs) much further. This is exactly what Hydrolix is designed to do.

With the right system at the right price, observability is absolutely worth it. However, traditional observability platforms weren’t built with today’s big data use cases in mind. They have built amazing, feature-rich interfaces, but it takes a lot of work to rebuild underlying infrastructure and systems, and they’re simply not cost-effective enough at scale.

Observability’s Business Value

Observability is valuable to a business when it’s more profitable to answer a question about your infrastructure rather than to ignore it. 

By lowering the cost of observability, we make it profitable to solve more problems than before. 

Observability platforms need to:

  • Cost much less when retaining data at scale.
  • Make it easy to get the raw logs to solve any problem, either real-time or historical, whether the problem happened a minute ago or a year ago.
  • Make it easy to combine data and create tables or views that reflect business needs. Too many complex JOINs drive up costs and complexity while reducing query performance.
  • Make it cost-effective to keep data for as long as possible on the chance it might be useful. This is a tough criteria to meet but an important one. Retaining data for known problems is easy to manage. But what about problems that emerge from the system over time? What about novel advanced persistent threats? What about having data on hand to train models that don’t exist yet that can provide breakthroughs in QoE, anti-fraud and anti-piracy efforts, or predictive routing?
  • Engineer for our ignorance. We are still at the beginning of the observability era, and we still don’t know all of the use cases for the data we collect. If we don’t build systems that allow us to keep that data, we are foreclosing future opportunities to understand the internal workings of what will undoubtedly be far more complex systems generating even greater volumes of external outputs.

So let’s revisit the question: is observability worth the expense and effort? If your platform provides all of the above at scale while remaining cost-effective, then it’s one of the best investments you can make to ensure that your infrastructure and services are reliable and performant. If your platform doesn’t provide these things, you might want to reach out and set up a POC.

Next Steps

Read Transforming the Economics of Log Management to learn about the issues facing many of today’s observability platforms, and how next-generation cloud data platforms must maximize the benefits of object storage to make log data cost-effective.

Learn more about how Hydrolix offers cost-effective data at terabyte scale and contact us for a POC.