RSS

Why All Your Data Storage Should Be Hot

Learn how Hydrolix offers hot storage performance at the cost of cold storage.

Franz Knupfer

Published:

Oct 09, 2023

9 minute read

If you’re working with large volumes of log data, the thought of runaway costs might make you break into a cold sweat. Your hot data costs too much to store while your cold storage is too difficult to access. You could try to fix the cost issue with warm storage, which costs less than hot storage but comes with increased query latency. You’ll need to manage tradeoffs such as decreased performance—and your warm storage still might cost too much. Or maybe you’re working with so much data that the only cost-effective solution is to put all your data on ice. 

But what if you didn’t have to make that tradeoff between cost and performance? What if you could get hot storage access at the cost of cold storage? With Hydrolix, you get ultra-fast SSD-like query performance from inexpensive cloud storage, eliminating the tradeoff between hot and cold data and making all of your data both easy to access and cost-effective.

This post covers:

Why Hot Storage Usually Costs So Much

Hot storage gives you efficient query performance, low latency ingest, and high scalability. However, these benefits have traditionally come with a cost. To achieve high performance, most platforms use expensive storage solutions like solid-state drives (SSDs) and storage arrays. These storage solutions offer fast read and write times for your data, but the extra cost makes your hot storage expensive—and this architecture makes it prohibitive for all of your storage to be hot.

These costs are even higher when you consider the sheer amount of data that many companies are ingesting on a day-by-day basis. To reduce costs, it’s typical to move real-time data to warm, cool, or cold storage after a certain amount of time has elapsed. These tiers come with tradeoffs such as lower query performance. For many log management solutions and observability platforms, a typical default for hot data storage might be somewhere between three and thirty days. Some companies ingest so much log data that it’s necessary to move data to cold storage after just a few days to be more cost effective. And some solutions won’t retain your log data beyond the hot storage time period, which means you have to export it to cold storage yourself, creating additional headaches and complexity.

The problem with this model is that hot storage retention periods simply aren’t long enough for complete data analysis. Data-intensive applications like e-commerce personalization, ad targeting, XDR, ML model training, and observability are just a few use cases where limits on data retention can negatively impact your business.

The Problems With Cold Storage

When your data goes into cold storage, it becomes much harder to compare, understand, and analyze. As the saying goes, “Those who cannot remember the past are condemned to repeat it.” The same could easily be said of your data. What happens when your team has a major incident that looks similar to something that happened six months or a year ago—but you can’t query data from that time period to compare and see what went wrong? Or what if you are trying to understand patterns in your data that suggest periodic slowdowns, but you can’t go back far enough to better understand those patterns?

Another typical use case where cold storage is problematic is for large scale, cyclical events. For example, you might have a yearly Black Friday sale or manage other large annual events such as holiday specials, sporting events, and more. When all your storage is hot, you can make comparisons across large events in near real time that can support your business. When the data from the last major event is in cold storage, though, you can’t make those connections fast enough to act on them. You lose valuable contextual information—such as what a customer bought or looked at last year—to connect to your most recent data. And the customer that almost made a purchase last year might be a near-miss this year, too. Even worse, you’ll never connect those dots and understand what happened.

Ultimately, having more data over a longer time period increases its accuracy and precision, makes it easier to detect patterns, provides valuable context, and gives you richer insights. By moving data into cold storage where it’s hard to query and access, you risk losing the information your organization needs to thrive.

Downsides of Tiered Storage

Here are some of the downsides of tiered storage:

  • Slow queries: Cold storage provides less RAM than hot storage while also typically having many more rows of data than hot storage. The low ratio of RAM to data leads to much slower queries, which makes it difficult to get actionable insights in a timely fashion.
  • Query bottlenecks: Your queries are only as fast as the coldest tier of storage you are querying. If you make a query that pulls from both hot and cold storage, you’ll be waiting on the slowest part of your system. Even if you are using a solution that asynchronously returns query results (giving you the data in hot storage faster), you won’t have a complete picture until you get the full response.
  • Resource-intensive migrations. In addition to taking considerable compute resources to move data from one tier to another, you also need to keep your data backed up while it’s being written to the next tier, so it takes up extra space during the migration process. That also means you need additional storage overhead to handle duplicate data during migrations.
  • Extra maintenance. With tiered storage, you have to manage multiple storage solutions, not just one. Depending on your implementation, each solution might work differently.

The main benefit of tiered storage is to cut down on costs—but if you can keep all your storage hot at no additional cost, there is minimal benefit to tiered storage for most use cases.

Why All Your Storage Should Be Hot

The traditional cost differences between hot and cold storage have created a false dichotomy, one where your most recent data is valued while older data—sometimes data that is only days or weeks old—is devalued. But that’s rarely an effective way to value your data. You don’t know which data points you’ll need until you make a query. And the efficiency of that query will be limited by the coldest tier of data that you are pulling data from. If some of that data is hot, some warm, and some cold, you won’t be getting any of the benefits of hot storage in the first place. You’ll just be footing the extra cost.

When all of your storage is hot, you get the following benefits:

  • Make historical comparisons for deeper data insights. Compare today’s data against last week, last month, or last year—with no penalty for querying cold storage and no lost data.
  • Real-time analytics for all your data. When all your storage is hot, your queries will always be high performance and low latency. You can add historical context to your real-time data, whether that’s historical data about a customer that’s on your site now or real-time security insights on an IP address associated with a device that last visited the site six months ago.
  • Eliminate the complexity of storage tiers. Having multiple storage tiers injects complexity into your system. You’ll either need to manage the migration of data between tiers or rely on a platform that handles that migration for you. You’ll need to back up your data before the migration to ensure data integrity in case there are issues with migrating your data, and you’ll need extra storage overhead to account for duplicate data during migrations. When all of your storage is hot, you don’t need to worry about migrating data between tiers.
  • Eliminate difficult decisions about data management. When you’re working with tiered storage, you have to make challenging decisions. When should your data be moved from hot to warm and finally to cold? Cost isn’t the only consideration—you also need to understand which data is essential for operations. These decisions become even more complicated as your applications scale. As your data volume and costs go up, do you try to save money by shortening the retention period for hot storage? What if your engineering teams need the data while your CFO says costs need to be reduced? These decisions can be anxiety-inducing and cause friction between teams. When all your storage is hot (and cost effective like cold storage), you have one less problem to worry about—and more time to spend elsewhere.

So if you can eliminate the extra cost traditionally associated with hot storage, why would you ever put your data in cold storage again?

How Hydrolix Offers Hot Storage at Cold Storage Costs

Hydrolix is built to solve the problem of runaway data costs while still maintaining the high efficiency and low latency needed for ingesting and querying big data in near real time.

Here’s the TL;DR:

  • Hydrolix offers the cost-effectiveness of cold storage by writing your data to inexpensive cloud storage and using a patented compression technique to reduce the size of your data by more than 90%. Generally, Hydrolix users get 75% or more savings on total cost of ownership (TCO) compared to solutions they previously used.
  • Hydrolix decouples storage from compute, giving you the efficiency of hot storage by utilizing massive parallelism, autoscaling stream and query peers, and using advanced querying techniques such as predicate pushdown. You get near-real-time ingest and ad hoc query performance up to 500% faster than other cloud data platforms.

Let’s dig a little deeper into how Hydrolix offers these benefits.

Low-Cost Cloud Storage and High-Density Compression

Instead of using expensive storage solutions like SSDs and storage arrays, Hydrolix data is written to inexpensive cloud storage (generally Amazon, Google, or Azure) in a virtual private cloud (VPC) of your choice. This approach gives you full control over your data at the cost of cold storage. Hydrolix also uses patented high-density compression technology to reduce the footprint of your data. 

Stateless Architecture, Parallelism, and Autoscaling for High Efficiency Ingest

To achieve high-efficiency ingest for streaming data, Hydrolix can scale stream peers to transform and write your streaming data using parallelism. Hydrolix uses a unique stateless architecture that decouples storage read and write operations, making it possible to autoscale your stream peers up as needed or even down to zero. This flexibility allows you to manage both high-ingest events by scaling up and off-peak periods by scaling down.

At ingest time, stream peers transform your data on the fly, transforming, indexing, and compressing on a row-by-row basis. Each stream peer writes data to a partition that by default only contains at most sixty seconds of time series data. At first, partitions are small to ensure they are written to storage as quickly as possible so that you can query for actionable insights. As partitions age, they’re merged into larger partitions that use fewer resources. Merging also sorts out-of-order data.

Scalable Query Pools, Predicate Pushdown, and Cataloging for Low Latency Queries

The biggest benefit of hot storage is the ability to query your data on demand and in near real time. With Hydrolix, you can scale the size of your query pools to balance cost considerations versus latency. Large pools with more query peers give you consistent low-latency queries while small pools give you consistent low-cost queries.

Hydrolix leverages a number of advanced techniques to reduce query latency, including a catalog that stores metadata about storage partitions (including minimum and maximum timestamps). When you query your data by time, the query first references the catalog to determine which partitions include that time range, narrowing down the amount of data that needs to be searched and lowering the latency of your queries.

Hydrolix also uses predicate pushdown techniques to improve query performance. By default, all columns are indexed, and Hydrolix partitions include information about where data is stored at the block level. Because all columns are indexed, a query peer can quickly determine which blocks have the requested data and only return those blocks to the query head, making your queries even more efficient.

The end result: hot storage efficiency with cold storage prices.

Come and get your data while it’s hot!

Next Steps

With Hydrolix, all your data is hot—at cold storage prices. If you don’t have a Hydrolix account yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

If you’re working with large volumes of log data, the thought of runaway costs might make you break into a cold sweat. Your hot data costs too much to store while your cold storage is too difficult to access. You could try to fix the cost issue with warm storage, which costs less than hot storage but comes with increased query latency. You’ll need to manage tradeoffs such as decreased performance—and your warm storage still might cost too much. Or maybe you’re working with so much data that the only cost-effective solution is to put all your data on ice. 

But what if you didn’t have to make that tradeoff between cost and performance? What if you could get hot storage access at the cost of cold storage? With Hydrolix, you get ultra-fast SSD-like query performance from inexpensive cloud storage, eliminating the tradeoff between hot and cold data and making all of your data both easy to access and cost-effective.

This post covers:

Why Hot Storage Usually Costs So Much

Hot storage gives you efficient query performance, low latency ingest, and high scalability. However, these benefits have traditionally come with a cost. To achieve high performance, most platforms use expensive storage solutions like solid-state drives (SSDs) and storage arrays. These storage solutions offer fast read and write times for your data, but the extra cost makes your hot storage expensive—and this architecture makes it prohibitive for all of your storage to be hot.

These costs are even higher when you consider the sheer amount of data that many companies are ingesting on a day-by-day basis. To reduce costs, it’s typical to move real-time data to warm, cool, or cold storage after a certain amount of time has elapsed. These tiers come with tradeoffs such as lower query performance. For many log management solutions and observability platforms, a typical default for hot data storage might be somewhere between three and thirty days. Some companies ingest so much log data that it’s necessary to move data to cold storage after just a few days to be more cost effective. And some solutions won’t retain your log data beyond the hot storage time period, which means you have to export it to cold storage yourself, creating additional headaches and complexity.

The problem with this model is that hot storage retention periods simply aren’t long enough for complete data analysis. Data-intensive applications like e-commerce personalization, ad targeting, XDR, ML model training, and observability are just a few use cases where limits on data retention can negatively impact your business.

The Problems With Cold Storage

When your data goes into cold storage, it becomes much harder to compare, understand, and analyze. As the saying goes, “Those who cannot remember the past are condemned to repeat it.” The same could easily be said of your data. What happens when your team has a major incident that looks similar to something that happened six months or a year ago—but you can’t query data from that time period to compare and see what went wrong? Or what if you are trying to understand patterns in your data that suggest periodic slowdowns, but you can’t go back far enough to better understand those patterns?

Another typical use case where cold storage is problematic is for large scale, cyclical events. For example, you might have a yearly Black Friday sale or manage other large annual events such as holiday specials, sporting events, and more. When all your storage is hot, you can make comparisons across large events in near real time that can support your business. When the data from the last major event is in cold storage, though, you can’t make those connections fast enough to act on them. You lose valuable contextual information—such as what a customer bought or looked at last year—to connect to your most recent data. And the customer that almost made a purchase last year might be a near-miss this year, too. Even worse, you’ll never connect those dots and understand what happened.

Ultimately, having more data over a longer time period increases its accuracy and precision, makes it easier to detect patterns, provides valuable context, and gives you richer insights. By moving data into cold storage where it’s hard to query and access, you risk losing the information your organization needs to thrive.

Downsides of Tiered Storage

Here are some of the downsides of tiered storage:

  • Slow queries: Cold storage provides less RAM than hot storage while also typically having many more rows of data than hot storage. The low ratio of RAM to data leads to much slower queries, which makes it difficult to get actionable insights in a timely fashion.
  • Query bottlenecks: Your queries are only as fast as the coldest tier of storage you are querying. If you make a query that pulls from both hot and cold storage, you’ll be waiting on the slowest part of your system. Even if you are using a solution that asynchronously returns query results (giving you the data in hot storage faster), you won’t have a complete picture until you get the full response.
  • Resource-intensive migrations. In addition to taking considerable compute resources to move data from one tier to another, you also need to keep your data backed up while it’s being written to the next tier, so it takes up extra space during the migration process. That also means you need additional storage overhead to handle duplicate data during migrations.
  • Extra maintenance. With tiered storage, you have to manage multiple storage solutions, not just one. Depending on your implementation, each solution might work differently.

The main benefit of tiered storage is to cut down on costs—but if you can keep all your storage hot at no additional cost, there is minimal benefit to tiered storage for most use cases.

Why All Your Storage Should Be Hot

The traditional cost differences between hot and cold storage have created a false dichotomy, one where your most recent data is valued while older data—sometimes data that is only days or weeks old—is devalued. But that’s rarely an effective way to value your data. You don’t know which data points you’ll need until you make a query. And the efficiency of that query will be limited by the coldest tier of data that you are pulling data from. If some of that data is hot, some warm, and some cold, you won’t be getting any of the benefits of hot storage in the first place. You’ll just be footing the extra cost.

When all of your storage is hot, you get the following benefits:

  • Make historical comparisons for deeper data insights. Compare today’s data against last week, last month, or last year—with no penalty for querying cold storage and no lost data.
  • Real-time analytics for all your data. When all your storage is hot, your queries will always be high performance and low latency. You can add historical context to your real-time data, whether that’s historical data about a customer that’s on your site now or real-time security insights on an IP address associated with a device that last visited the site six months ago.
  • Eliminate the complexity of storage tiers. Having multiple storage tiers injects complexity into your system. You’ll either need to manage the migration of data between tiers or rely on a platform that handles that migration for you. You’ll need to back up your data before the migration to ensure data integrity in case there are issues with migrating your data, and you’ll need extra storage overhead to account for duplicate data during migrations. When all of your storage is hot, you don’t need to worry about migrating data between tiers.
  • Eliminate difficult decisions about data management. When you’re working with tiered storage, you have to make challenging decisions. When should your data be moved from hot to warm and finally to cold? Cost isn’t the only consideration—you also need to understand which data is essential for operations. These decisions become even more complicated as your applications scale. As your data volume and costs go up, do you try to save money by shortening the retention period for hot storage? What if your engineering teams need the data while your CFO says costs need to be reduced? These decisions can be anxiety-inducing and cause friction between teams. When all your storage is hot (and cost effective like cold storage), you have one less problem to worry about—and more time to spend elsewhere.

So if you can eliminate the extra cost traditionally associated with hot storage, why would you ever put your data in cold storage again?

How Hydrolix Offers Hot Storage at Cold Storage Costs

Hydrolix is built to solve the problem of runaway data costs while still maintaining the high efficiency and low latency needed for ingesting and querying big data in near real time.

Here’s the TL;DR:

  • Hydrolix offers the cost-effectiveness of cold storage by writing your data to inexpensive cloud storage and using a patented compression technique to reduce the size of your data by more than 90%. Generally, Hydrolix users get 75% or more savings on total cost of ownership (TCO) compared to solutions they previously used.
  • Hydrolix decouples storage from compute, giving you the efficiency of hot storage by utilizing massive parallelism, autoscaling stream and query peers, and using advanced querying techniques such as predicate pushdown. You get near-real-time ingest and ad hoc query performance up to 500% faster than other cloud data platforms.

Let’s dig a little deeper into how Hydrolix offers these benefits.

Low-Cost Cloud Storage and High-Density Compression

Instead of using expensive storage solutions like SSDs and storage arrays, Hydrolix data is written to inexpensive cloud storage (generally Amazon, Google, or Azure) in a virtual private cloud (VPC) of your choice. This approach gives you full control over your data at the cost of cold storage. Hydrolix also uses patented high-density compression technology to reduce the footprint of your data. 

Stateless Architecture, Parallelism, and Autoscaling for High Efficiency Ingest

To achieve high-efficiency ingest for streaming data, Hydrolix can scale stream peers to transform and write your streaming data using parallelism. Hydrolix uses a unique stateless architecture that decouples storage read and write operations, making it possible to autoscale your stream peers up as needed or even down to zero. This flexibility allows you to manage both high-ingest events by scaling up and off-peak periods by scaling down.

At ingest time, stream peers transform your data on the fly, transforming, indexing, and compressing on a row-by-row basis. Each stream peer writes data to a partition that by default only contains at most sixty seconds of time series data. At first, partitions are small to ensure they are written to storage as quickly as possible so that you can query for actionable insights. As partitions age, they’re merged into larger partitions that use fewer resources. Merging also sorts out-of-order data.

Scalable Query Pools, Predicate Pushdown, and Cataloging for Low Latency Queries

The biggest benefit of hot storage is the ability to query your data on demand and in near real time. With Hydrolix, you can scale the size of your query pools to balance cost considerations versus latency. Large pools with more query peers give you consistent low-latency queries while small pools give you consistent low-cost queries.

Hydrolix leverages a number of advanced techniques to reduce query latency, including a catalog that stores metadata about storage partitions (including minimum and maximum timestamps). When you query your data by time, the query first references the catalog to determine which partitions include that time range, narrowing down the amount of data that needs to be searched and lowering the latency of your queries.

Hydrolix also uses predicate pushdown techniques to improve query performance. By default, all columns are indexed, and Hydrolix partitions include information about where data is stored at the block level. Because all columns are indexed, a query peer can quickly determine which blocks have the requested data and only return those blocks to the query head, making your queries even more efficient.

The end result: hot storage efficiency with cold storage prices.

Come and get your data while it’s hot!

Next Steps

With Hydrolix, all your data is hot—at cold storage prices. If you don’t have a Hydrolix account yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.