Migrating from Rockset? Find out if Hydrolix is right for you >>

RSS

Best Practices for Reducing Query Compute Costs

Learn best practices for reducing query compute costs, including how to use circuit breakers, custom views, query pools, and summary tables.

Franz Knupfer

Published:

Mar 26, 2024

5 minute read

Hydrolix is designed for cost-effective data processing and retention at scale, and it’s not uncommon to see Hydrolix tables with 100+ billion rows. Because Hydrolix uses high-density compression and cloud-based object storage, you don’t have to worry about discarding data or transferring data to tiered storage. You can keep all of your data ready and available in the hot storage tier without additional data management or data loss because Hydrolix is optimized for affordable, long-term data retention.

Hydrolix also has many optimizations for querying data, including extreme predicate pushdown, partition and column pruning, and sparse indexing. However, like any database, there is a big difference between a well-written query and a badly-written one, and inefficient queries will take longer to run and cost more. Just about anyone who has deep experience with big data has at least one horror story of a huge, unexpected cloud bill, and there are a lot of ways it can happen.

As your data accumulates data gravity, you’ll want to apply FinOps best practices, which means maximizing the value of your data while keeping costs down. More teams will likely need access to that data, leading to additional business value, which is great. However, more access comes with some tradeoffs, including the potential for wasted compute costs—and even big surprise bills if you aren’t careful.

This post covers four basic steps you can take in Hydrolix to keep query costs down, including safeguards to ensure that your teams don’t break the bank with inefficient queries. These steps include:

  • Circuit breakers
  • Custom views
  • Precision query scaling
  • Summary tables

Circuit Breakers

In electrical terms, a circuit breaker protects a circuit from receiving too much current and taking damage. In query terms, circuit breakers work in a similar manner—they protect your business from heavy compute queries that damage your bottom line. Hydrolix has multiple circuit breakers you can enable, and you can do so at the system level or at the query level.

Hydrolix partitions data by timestamp, making it extremely efficient when filtering by time. However, if you don’t include a time filter in a query, Hydrolix opens every partition in the table. In a very large dataset, that’s a lot of partitions—and a lot of wasted query compute. So this is the very first circuit breaker you should enable:

hdx_query_timerange_required

With this circuit breaker turned on, all queries must have a time filter. If you really need to override it (and generally you shouldn’t), you can do so at the query level. You can also set up circuit breakers to limit the time range queried, total number of rows retrieved, and more.

Read this post to set up query circuit breakers via the API and learn about circuit breaker best practices.

Custom Views

Columnar databases like Hydrolix excel at querying tables with high dimensionality data. You can query just the columns you need instead of scanning every column, which is great if you only need five columns out of hundreds. Columnar databases use column pruning to efficiently retrieve just the data you need.

However, to maximize the benefits of column pruning, you need to specify specific columns in your query. Too often, we see users running SELECT * queries, which retrieve every column in the table. With custom views, you can give specific users and teams access to tailored views that include only a subset of columns in the table.

The following example shows how you can specify a custom view in a query:

This query will only retrieve columns associated with custom_view

By default, Hydrolix has an auto_view that includes every column in the database. When you create custom views for teams, you should also change the default as well—it shouldn’t be the auto_view.

To set up custom views, read Creating Custom Views for High Dimensionality Data.

Precision Query Scaling

Each part of the Hydrolix system is stateless and independently scalable, including ingest, merge, and query resources. One of the benefits of this architectural approach is the ability to set up separate query pools for different stakeholders to prevent resource contention. Even better, you can fine-tune the resources for each query pools to maximize performance versus costs as needed. This feature is unique to Hydrolix, which is why we call it precision query scaling—you can even tune compute resources on a query-by-query basis by specifying which pool you want a query to use.

Whether you need a lot of compute resources to quickly find the cause of a major outage or you want to maximize cost-effectiveness instead (such as with AI model training use cases), you can create separate query pools and assign them to different teams and use cases as needed.

Learn more about the benefits and use cases of precision query scaling and then read the docs to set up a query pool.

Summary Tables

If you need to retrieve aggregated data from a table on a regular basis (which is very common with real-time monitoring dashboards), you can use summary tables to reduce query latency and compute costs. If a real-time dashboard updates every five seconds, that’s more than 17,000 queries per day. It’s much better to query a summary table with aggregated data than to query the raw data.

With Hydrolix, each data source has a transform that specifies how incoming data should be ingested, normalized, and aggregated. In addition to ingesting data to a larger table, you can also use a summary SQL transform at ingest time to pre-aggregate data for summary tables. Whether you want to calculate sums, percentages, or quantiles, summary tables are perfect for high-level dashboards.

You can use summary tables to identify larger issues that need investigation and then drill down into the raw data to find the root causes of those issues.

Many databases do not have the capacity to handle late-arriving and out-of-order data, but Hydrolix is optimized for this specific use case, so your summary tables will account for data regardless of when it arrives.

To learn more about creating summary tables, see the docs on aggregating data.

Next Steps

Apply these best practices for reducing query compute and cloud spend:

Learn more about Hydrolix and contact us for a POC.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Hydrolix is designed for cost-effective data processing and retention at scale, and it’s not uncommon to see Hydrolix tables with 100+ billion rows. Because Hydrolix uses high-density compression and cloud-based object storage, you don’t have to worry about discarding data or transferring data to tiered storage. You can keep all of your data ready and available in the hot storage tier without additional data management or data loss because Hydrolix is optimized for affordable, long-term data retention.

Hydrolix also has many optimizations for querying data, including extreme predicate pushdown, partition and column pruning, and sparse indexing. However, like any database, there is a big difference between a well-written query and a badly-written one, and inefficient queries will take longer to run and cost more. Just about anyone who has deep experience with big data has at least one horror story of a huge, unexpected cloud bill, and there are a lot of ways it can happen.

As your data accumulates data gravity, you’ll want to apply FinOps best practices, which means maximizing the value of your data while keeping costs down. More teams will likely need access to that data, leading to additional business value, which is great. However, more access comes with some tradeoffs, including the potential for wasted compute costs—and even big surprise bills if you aren’t careful.

This post covers four basic steps you can take in Hydrolix to keep query costs down, including safeguards to ensure that your teams don’t break the bank with inefficient queries. These steps include:

  • Circuit breakers
  • Custom views
  • Precision query scaling
  • Summary tables

Circuit Breakers

In electrical terms, a circuit breaker protects a circuit from receiving too much current and taking damage. In query terms, circuit breakers work in a similar manner—they protect your business from heavy compute queries that damage your bottom line. Hydrolix has multiple circuit breakers you can enable, and you can do so at the system level or at the query level.

Hydrolix partitions data by timestamp, making it extremely efficient when filtering by time. However, if you don’t include a time filter in a query, Hydrolix opens every partition in the table. In a very large dataset, that’s a lot of partitions—and a lot of wasted query compute. So this is the very first circuit breaker you should enable:

hdx_query_timerange_required

With this circuit breaker turned on, all queries must have a time filter. If you really need to override it (and generally you shouldn’t), you can do so at the query level. You can also set up circuit breakers to limit the time range queried, total number of rows retrieved, and more.

Read this post to set up query circuit breakers via the API and learn about circuit breaker best practices.

Custom Views

Columnar databases like Hydrolix excel at querying tables with high dimensionality data. You can query just the columns you need instead of scanning every column, which is great if you only need five columns out of hundreds. Columnar databases use column pruning to efficiently retrieve just the data you need.

However, to maximize the benefits of column pruning, you need to specify specific columns in your query. Too often, we see users running SELECT * queries, which retrieve every column in the table. With custom views, you can give specific users and teams access to tailored views that include only a subset of columns in the table.

The following example shows how you can specify a custom view in a query:

This query will only retrieve columns associated with custom_view

By default, Hydrolix has an auto_view that includes every column in the database. When you create custom views for teams, you should also change the default as well—it shouldn’t be the auto_view.

To set up custom views, read Creating Custom Views for High Dimensionality Data.

Precision Query Scaling

Each part of the Hydrolix system is stateless and independently scalable, including ingest, merge, and query resources. One of the benefits of this architectural approach is the ability to set up separate query pools for different stakeholders to prevent resource contention. Even better, you can fine-tune the resources for each query pools to maximize performance versus costs as needed. This feature is unique to Hydrolix, which is why we call it precision query scaling—you can even tune compute resources on a query-by-query basis by specifying which pool you want a query to use.

Whether you need a lot of compute resources to quickly find the cause of a major outage or you want to maximize cost-effectiveness instead (such as with AI model training use cases), you can create separate query pools and assign them to different teams and use cases as needed.

Learn more about the benefits and use cases of precision query scaling and then read the docs to set up a query pool.

Summary Tables

If you need to retrieve aggregated data from a table on a regular basis (which is very common with real-time monitoring dashboards), you can use summary tables to reduce query latency and compute costs. If a real-time dashboard updates every five seconds, that’s more than 17,000 queries per day. It’s much better to query a summary table with aggregated data than to query the raw data.

With Hydrolix, each data source has a transform that specifies how incoming data should be ingested, normalized, and aggregated. In addition to ingesting data to a larger table, you can also use a summary SQL transform at ingest time to pre-aggregate data for summary tables. Whether you want to calculate sums, percentages, or quantiles, summary tables are perfect for high-level dashboards.

You can use summary tables to identify larger issues that need investigation and then drill down into the raw data to find the root causes of those issues.

Many databases do not have the capacity to handle late-arriving and out-of-order data, but Hydrolix is optimized for this specific use case, so your summary tables will account for data regardless of when it arrives.

To learn more about creating summary tables, see the docs on aggregating data.

Next Steps

Apply these best practices for reducing query compute and cloud spend:

Learn more about Hydrolix and contact us for a POC.