RSS

Reduce Your Cloud Spend with Precision Query Scaling

Learn about precision query scaling, a feature unique to Hydrolix that allows you to scale compute resources for each query workload.

Franz Knupfer

Published:

Feb 07, 2024

9 minute read
, ,

Sub-second query performance is essential for many big data use cases ranging from observability to adtech, but the more you scale up compute resources to improve performance, the higher your compute costs will be. Those costs are worth it when you need data in real time, such as when you’re trying to resolve a P1 major outage. However, if you focus on high performance at all times, regardless of the use case, you can end up with wasted cloud spend.

It’s like leaving on all the lights in your house 24/7, even when you don’t need them. Cloud spend is the equivalent of a lot of lights. The estimated worldwide cloud spend is $600 billion, with 28% being wasted. That amounts to about $168 billion in wasted costs. Not surprisingly, 82% of cloud professionals in a survey said managing cloud costs was their top challenge.

One way to manage cloud spend is through precision query scaling.

With precision query scaling, you can independently scale each query workload up or down. Create new query workloads at any time, automatically scale workloads to zero for evenings and weekends, and assign less compute for less urgent workloads to save on compute costs. Or quickly scale up a team’s query workload when the need is urgent.

By adjusting a workload’s memory, CPU, and number of peers, you can optimize cost versus performance for each query workload. For instance, an incident response team might need to quickly scale resources for high performance and parallelism during a P1 incident, then scale down after the incident is resolved. Meanwhile, a data science team may need regular, long-term access to data, but not in real time, so you can use less compute to save on costs while still getting performant queries.

It’s the equivalent of using only the lights you need for each team. You get cost-effective performance without sacrificing either the bottom line or the needs of each of your stakeholders.

Unfortunately, most platforms that offer separate workloads for teams don’t actually let you scale compute resources independently. Only Hydrolix allows you to do so with precision. And there are often major trade offs involved simply for preventing resource contention, such as the need to duplicate data, which results in the very problem big enterprises need to solve for: wasted cloud costs.

Hydrolix allows you to use precision scaling for each query workload, preventing resource contention and fine tuning your compute spend. In this post, you’ll learn about the benefits of precision scaling, why most platforms don’t offer this functionality, and how Hydrolix offers precision scaling for each workload.

The Benefits of Precision Query Scaling

For large enterprises working with data at petabyte scale, many stakeholders may need to access that data. The larger the dataset, the more likely it is to attract more users and services, or to exert data gravity. If too many stakeholders are using the same compute, there can be issues with resource contention. This is also called the noisy neighbor effect.

By separating the compute of each workload, you can eliminate issues with resource contention. But it’s not enough to separate compute. You also need precision scaling for each workload. This allows you to quantify the value and urgency of each stakeholder’s data needs and then take steps to balance performance versus cost-effectiveness. You’ll avoid overprovisioning and follow FinOps best practices, which include dialing back cloud services that aren’t in use.

Here are a few potential use cases that you can tune for, each with varying degrees of value, urgency, and timing. The attributed values are hypothetical. Each organization will value data use cases differently. With Hydrolix, you can have separate workloads for each of these use cases, all in the same cluster. You can also add as many workloads as you need. 

Incident Response

Value: High

Urgency: High during major incidents, low during other times

For a big business, a major outage typically costs thousands of dollars per minute. Meanwhile, a larger outage can cost tens of millions of dollars. Data related to a P1 severity outage is both high value and high urgency, so an incident response team should be able to scale up query compute quickly for sub-second latency, even on data sets numbering in the billions of rows. On the other hand, data related to a P5 incident, while still valuable, may be of medium value and low urgency. When you aren’t dealing with major incidents, you can scale down query compute.

Alerts and Dashboard Visualizations

Value: High

Urgency: Dependent on alert and dashboard

Typically, you’ll devote more compute to real-time use cases (data from the last five minutes). Alerts typically focus on real-time data and need to be low latency. At the same time, they will likely only make queries of narrow, recent ranges of data (smaller datasets). And some dashboards need to be in real time (such as dashboards related to request performance), while others may be used more for reporting purposes or need to be consulted daily or weekly (such as reporting on MTTR and MTTD).

Data Science

Value: Variable

Urgency: Low, can scale to zero during off hours

Data science teams may use datasets of varying sizes to test hypotheses. Some findings may be business critical or lead to new products and features, while others may be less impactful. Regardless of the hypothesis, you probably don’t need sub-second query compute for most hypotheses. Instead, you can use less compute for queries and scale down to zero during off hours. Or you might want to use off hours, or even a separate workload, for training AI models.

Finally, you may not know how valuable your data is until after analysis. For use cases where the value is still unknown, you probably don’t want to use too much query compute. Once you’ve established the value of your data, you can scale up compute as needed.

Threat Hunting and Forensic Analysis

Value: Highly variable for threat hunting and high for forensic analysis

Urgency: Variable, may need to scale up quickly if a threat is detected or if you need forensic analysis

Threat hunting is a proactive cybersecurity approach that’s essential for large enterprises, but its exact value is challenging to quantify, at least until a breach or threat is revealed. And threat hunting is just one of many cybersecurity use cases, each with its own level of urgency. Meanwhile, forensic analysis can be extremely urgent, such as when teams are trying to determine the severity of a breach, so you can scale up a workload quickly. Similar to other types of incidents, there will also be times when query compute needs are lower.

Cyclical Events and Peak Traffic

During major planned events, some stakeholders will need more query resources. You may want to scale down workloads that aren’t directly related to the event (an “all hands on deck” approach) while scaling up workloads that are essential for monitoring and remediation. Or other teams can continue to use their own workloads since resources are isolated. Then, once the event is over, you can easily scale down workloads as needed.

These are just a few examples of how a large enterprise can have many stakeholders, all with different data needs. Treating them as a monolith will lead to overprovisioning or poor query performance, or more likely, both.

Other Platforms Don’t Offer Precision Query Scaling

Many data platforms offer separate workloads for teams, but they do not offer true resource isolation with independent scaling. In some cases, this is because storage and compute are tightly coupled. This means you must scale the entire system up or down, and doing so is often complex and time-consuming. Scaling each part of the system together results in tough choices between overprovisioning, poor performance, and managing resource contention.

This tightly coupled legacy approach to data also cannot isolate resources for different workloads. This is because all queries contend for the same read resources. Instead, you must duplicate the data for each workload. This creates a potential data management nightmare, as the data can quickly become inconsistent. Even worse, you’ll have additional storage costs, and this approach simply isn’t feasible for large datasets.

Other solutions attempt to get around the issue of resource contention and noisy neighbors with resource throttling. While this may stop teams from using a disproportionate number of resources, it doesn’t solve the underlying problem. You’ll still need to scale the entire system up or down. Scaling may be a long, awkward process, or it may not be feasible at all.

In order to provide precision query scaling for workloads, a data platform must have decoupled architecture. Decoupled architectures typically uses cloud-native tools like Kubernetes for compute and object storage based in the cloud. Object storage partitions support thousands of concurrent reads per second, so a data platform that partitions data and maximizes parallelism will not have issues with resource contention.

Hydrolix is not the only data platform that offers a decoupled approach, but other platforms do not offer the same fine-grained control. For example, with Snowflake, you can use data warehouses to scale your compute up or down. However, each size up represents a doubling of compute resources, which is potentially a huge jump. If you only need to add a small amount of compute to a workload, this leads to overprovisioning, and the issue will be compounded if you are using many workloads.

Solutions like Google Big Query offer many ways to impose quotas, such as by limiting the amount of data a workload can query over a time period or setting a maximum amount of CPU a query can use. These quotas can help prevent shock bills, but they don’t give you the ability to precisely scale each workload. 

Balance Cost and Performance with Precision Query Scaling

Hydrolix is a cloud data platform optimized for big data use cases and time series data. One of the major challenges (and opportunities) of working with big data is the concept of data gravity. The more data you have, the more use cases, services, and additional data it will attract. To prevent resource contention and to fine tune query performance to optimize for both performance and budget, you can use resource pooling to create a query pool for each workload. You can use resource pooling for ingest, merge, and summary tables as well as querying, giving you granular control over each part of your system.

ou can easily create, update, or delete a new query pool using the Hydrolix UI, a configuration file, or the API. With the API, you can automate scaling. For example, you could automate some query pools to scale down at 5 PM on Friday and scale back up at 8 AM on Monday. Scaling occurs in a matter of minutes, so you can quickly scale up or down regardless of whether you are doing so manually or through an automated process.

When you edit a pool, you can tune:

  • Number of replicas: This is a range, and determines the number of peers the query pool can use. More query peers means more parallelism, and is particularly useful when queries have to search many data partitions.
  • CPU: The amount of CPU to allocate to the query pool.
  • Memory: The amount of memory to allocate to the query pool.

Hydrolix has predefined prod (1-4 terabytes of data ingest per day) and mega (10-50 terabytes of data ingest per day) scales designed to fit the majority of use cases, but every part of the system is tunable (not just query), and can scale up or down in minutes, giving you fine-grained control over compute resources so you can get the most out of FinOps best practices without sacrificing performance.

Next Steps

Interested in learning more about using a next-generation cloud data platform like Hydrolix? Read our whitepaper on Powering Big Data with Next-Gen Cloud Data Platforms.

Learn more about Hydrolix and contact us for a POC.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Sub-second query performance is essential for many big data use cases ranging from observability to adtech, but the more you scale up compute resources to improve performance, the higher your compute costs will be. Those costs are worth it when you need data in real time, such as when you’re trying to resolve a P1 major outage. However, if you focus on high performance at all times, regardless of the use case, you can end up with wasted cloud spend.

It’s like leaving on all the lights in your house 24/7, even when you don’t need them. Cloud spend is the equivalent of a lot of lights. The estimated worldwide cloud spend is $600 billion, with 28% being wasted. That amounts to about $168 billion in wasted costs. Not surprisingly, 82% of cloud professionals in a survey said managing cloud costs was their top challenge.

One way to manage cloud spend is through precision query scaling.

With precision query scaling, you can independently scale each query workload up or down. Create new query workloads at any time, automatically scale workloads to zero for evenings and weekends, and assign less compute for less urgent workloads to save on compute costs. Or quickly scale up a team’s query workload when the need is urgent.

By adjusting a workload’s memory, CPU, and number of peers, you can optimize cost versus performance for each query workload. For instance, an incident response team might need to quickly scale resources for high performance and parallelism during a P1 incident, then scale down after the incident is resolved. Meanwhile, a data science team may need regular, long-term access to data, but not in real time, so you can use less compute to save on costs while still getting performant queries.

It’s the equivalent of using only the lights you need for each team. You get cost-effective performance without sacrificing either the bottom line or the needs of each of your stakeholders.

Unfortunately, most platforms that offer separate workloads for teams don’t actually let you scale compute resources independently. Only Hydrolix allows you to do so with precision. And there are often major trade offs involved simply for preventing resource contention, such as the need to duplicate data, which results in the very problem big enterprises need to solve for: wasted cloud costs.

Hydrolix allows you to use precision scaling for each query workload, preventing resource contention and fine tuning your compute spend. In this post, you’ll learn about the benefits of precision scaling, why most platforms don’t offer this functionality, and how Hydrolix offers precision scaling for each workload.

The Benefits of Precision Query Scaling

For large enterprises working with data at petabyte scale, many stakeholders may need to access that data. The larger the dataset, the more likely it is to attract more users and services, or to exert data gravity. If too many stakeholders are using the same compute, there can be issues with resource contention. This is also called the noisy neighbor effect.

By separating the compute of each workload, you can eliminate issues with resource contention. But it’s not enough to separate compute. You also need precision scaling for each workload. This allows you to quantify the value and urgency of each stakeholder’s data needs and then take steps to balance performance versus cost-effectiveness. You’ll avoid overprovisioning and follow FinOps best practices, which include dialing back cloud services that aren’t in use.

Here are a few potential use cases that you can tune for, each with varying degrees of value, urgency, and timing. The attributed values are hypothetical. Each organization will value data use cases differently. With Hydrolix, you can have separate workloads for each of these use cases, all in the same cluster. You can also add as many workloads as you need. 

Incident Response

Value: High

Urgency: High during major incidents, low during other times

For a big business, a major outage typically costs thousands of dollars per minute. Meanwhile, a larger outage can cost tens of millions of dollars. Data related to a P1 severity outage is both high value and high urgency, so an incident response team should be able to scale up query compute quickly for sub-second latency, even on data sets numbering in the billions of rows. On the other hand, data related to a P5 incident, while still valuable, may be of medium value and low urgency. When you aren’t dealing with major incidents, you can scale down query compute.

Alerts and Dashboard Visualizations

Value: High

Urgency: Dependent on alert and dashboard

Typically, you’ll devote more compute to real-time use cases (data from the last five minutes). Alerts typically focus on real-time data and need to be low latency. At the same time, they will likely only make queries of narrow, recent ranges of data (smaller datasets). And some dashboards need to be in real time (such as dashboards related to request performance), while others may be used more for reporting purposes or need to be consulted daily or weekly (such as reporting on MTTR and MTTD).

Data Science

Value: Variable

Urgency: Low, can scale to zero during off hours

Data science teams may use datasets of varying sizes to test hypotheses. Some findings may be business critical or lead to new products and features, while others may be less impactful. Regardless of the hypothesis, you probably don’t need sub-second query compute for most hypotheses. Instead, you can use less compute for queries and scale down to zero during off hours. Or you might want to use off hours, or even a separate workload, for training AI models.

Finally, you may not know how valuable your data is until after analysis. For use cases where the value is still unknown, you probably don’t want to use too much query compute. Once you’ve established the value of your data, you can scale up compute as needed.

Threat Hunting and Forensic Analysis

Value: Highly variable for threat hunting and high for forensic analysis

Urgency: Variable, may need to scale up quickly if a threat is detected or if you need forensic analysis

Threat hunting is a proactive cybersecurity approach that’s essential for large enterprises, but its exact value is challenging to quantify, at least until a breach or threat is revealed. And threat hunting is just one of many cybersecurity use cases, each with its own level of urgency. Meanwhile, forensic analysis can be extremely urgent, such as when teams are trying to determine the severity of a breach, so you can scale up a workload quickly. Similar to other types of incidents, there will also be times when query compute needs are lower.

Cyclical Events and Peak Traffic

During major planned events, some stakeholders will need more query resources. You may want to scale down workloads that aren’t directly related to the event (an “all hands on deck” approach) while scaling up workloads that are essential for monitoring and remediation. Or other teams can continue to use their own workloads since resources are isolated. Then, once the event is over, you can easily scale down workloads as needed.

These are just a few examples of how a large enterprise can have many stakeholders, all with different data needs. Treating them as a monolith will lead to overprovisioning or poor query performance, or more likely, both.

Other Platforms Don’t Offer Precision Query Scaling

Many data platforms offer separate workloads for teams, but they do not offer true resource isolation with independent scaling. In some cases, this is because storage and compute are tightly coupled. This means you must scale the entire system up or down, and doing so is often complex and time-consuming. Scaling each part of the system together results in tough choices between overprovisioning, poor performance, and managing resource contention.

This tightly coupled legacy approach to data also cannot isolate resources for different workloads. This is because all queries contend for the same read resources. Instead, you must duplicate the data for each workload. This creates a potential data management nightmare, as the data can quickly become inconsistent. Even worse, you’ll have additional storage costs, and this approach simply isn’t feasible for large datasets.

Other solutions attempt to get around the issue of resource contention and noisy neighbors with resource throttling. While this may stop teams from using a disproportionate number of resources, it doesn’t solve the underlying problem. You’ll still need to scale the entire system up or down. Scaling may be a long, awkward process, or it may not be feasible at all.

In order to provide precision query scaling for workloads, a data platform must have decoupled architecture. Decoupled architectures typically uses cloud-native tools like Kubernetes for compute and object storage based in the cloud. Object storage partitions support thousands of concurrent reads per second, so a data platform that partitions data and maximizes parallelism will not have issues with resource contention.

Hydrolix is not the only data platform that offers a decoupled approach, but other platforms do not offer the same fine-grained control. For example, with Snowflake, you can use data warehouses to scale your compute up or down. However, each size up represents a doubling of compute resources, which is potentially a huge jump. If you only need to add a small amount of compute to a workload, this leads to overprovisioning, and the issue will be compounded if you are using many workloads.

Solutions like Google Big Query offer many ways to impose quotas, such as by limiting the amount of data a workload can query over a time period or setting a maximum amount of CPU a query can use. These quotas can help prevent shock bills, but they don’t give you the ability to precisely scale each workload. 

Balance Cost and Performance with Precision Query Scaling

Hydrolix is a cloud data platform optimized for big data use cases and time series data. One of the major challenges (and opportunities) of working with big data is the concept of data gravity. The more data you have, the more use cases, services, and additional data it will attract. To prevent resource contention and to fine tune query performance to optimize for both performance and budget, you can use resource pooling to create a query pool for each workload. You can use resource pooling for ingest, merge, and summary tables as well as querying, giving you granular control over each part of your system.

ou can easily create, update, or delete a new query pool using the Hydrolix UI, a configuration file, or the API. With the API, you can automate scaling. For example, you could automate some query pools to scale down at 5 PM on Friday and scale back up at 8 AM on Monday. Scaling occurs in a matter of minutes, so you can quickly scale up or down regardless of whether you are doing so manually or through an automated process.

When you edit a pool, you can tune:

  • Number of replicas: This is a range, and determines the number of peers the query pool can use. More query peers means more parallelism, and is particularly useful when queries have to search many data partitions.
  • CPU: The amount of CPU to allocate to the query pool.
  • Memory: The amount of memory to allocate to the query pool.

Hydrolix has predefined prod (1-4 terabytes of data ingest per day) and mega (10-50 terabytes of data ingest per day) scales designed to fit the majority of use cases, but every part of the system is tunable (not just query), and can scale up or down in minutes, giving you fine-grained control over compute resources so you can get the most out of FinOps best practices without sacrificing performance.

Next Steps

Interested in learning more about using a next-generation cloud data platform like Hydrolix? Read our whitepaper on Powering Big Data with Next-Gen Cloud Data Platforms.

Learn more about Hydrolix and contact us for a POC.