Migrating from Rockset? Find out if Hydrolix is right for you >>

RSS

Caching Queries in Hydrolix

Learn how to implement query caching in Hydrolix and determine which queries will most benefit from query caching.

Franz Knupfer

Published:

Jul 09, 2024

4 minute read
,

Note: Query caching is an experimental feature in beta. It may not deliver expected results at scale. Use at your own risk.

Query caching can make your queries more efficient, especially for use cases like dashboards where multiple users are making the same query at regular intervals. For example, if you have many engineers and teams all viewing a dashboard that uses the same query for a major event, each on their own screens, it’s inefficient and wasteful for every dashboard to make the same query to the database. Hydrolix achieves sub-second query latency even on trillion row datasets without query caching, but all those repeated queries can still add up.

Query caching is now available in Hydrolix. To enable it, you just need to update the Hydrolix operator to 4.14 and then add SETTINGS use_query_cache = true to queries you want to cache.

Hydrolix maintains and updates its own fork of ClickHouse in query head components. We’re big fans of ClickHouse for many reasons, and it’s always exciting when ClickHouse releases new features that we can merge into our own fork and provide to customers. For version 4.14 of Hydrolix, we added ClickHouse functionality for query caching, parameterized views, and new SQL functions.

This post will focus on query caching in Hydrolix and how to determine which queries you should cache to maximize performance.

Implementing Query Caching in Hydrolix

Even with Hydrolix’s exceptional query performance, it’s still a best practice to avoid unnecessary queries to the data store. The fastest query is the one you never make, and query caches can help you avoid excessive requests to storage.

For real-time analytics use cases, the biggest benefit of query caching is typically for dashboards since many teams may be using the same query, typically at regular intervals. However, properly caching dashboard queries comes with some important caveats.

  • Using now() or other functions that calculate the timestamp at the time of the dashboard query will not deliver intended results. Otherwise, you’ll end up with queries that have different time filters. For example, query 1 might have the following time filter BETWEEN 10:00:00.01 am and 11:00:00.01 while query 2 might have the time filter BETWEEN 10:00:00.39 am and 11:00:00.39. Query 1 will be cached, but query 2 will not used the cached query.
  • Instead, you need to specify specific queries across dashboard instances. Instead of now(), the filter would be BETWEEN 10:00:00.00 and 11:00:00.00. Dashboard tools like Grafana and Superset are two dashboard visualization tools that allow you to easily rollup to the nearest granularity you are comfortable with (such as second, minute, and hour).
  • Finally, keep in mind that query caching typically involves some queries retrieving stale data. This won’t be acceptable for some real-time analytics use cases. In this situation, you can rely on Hydrolix’s ad hoc query performance to deliver real-time analytics quickly and cost-effectively.

To use query caching in Hydrolix, you need to be on version 4.14 or later.

Query caching includes some configurable defaults as well as some sensible limits.

Query cache limits

  • Cache maximum: 1 GB
  • Max number of cached queries: 1024
  • Max size per cached query: 1 MB
  • Max number of rows per cached query: 3 million

Note that the query cache stores uncompressed data. When a Hydrolix query retrieves data from object storage, it is immediately transformed from Hydrolix blocks (which typically include 20x-50x compression) to uncompressed Clickhouse blocks.

Adjustable settings (and defaults):

  • use_query_cache: Defaults to false.
  • query_cache_ttl: This is the “time to live” before the query cache expires. The default is 60 seconds.
  • query_cache_min_query_duration: The default minimum is 100 milliseconds. If a query has lower latency than the default, it won’t be cached even if use_query_cache is true. If a query already has extremely low latency, there is potentially less benefit to caching it. However, it can be useful to adjust if you regularly run dashboard queries that are below the default minimum.

Which Queries Should Be Cached?

Query caching is most helpful for frequent queries as well as queries where the underlying data rarely (if ever) changes. In the case of real-time analytics, data is frequently changing, but it can still make sense to cache data for a short period of time. For example, a 60-second window is often tolerable for real-time use cases.

The query cache is transactionally inconsistent by design because OLAP databases often prioritize performance and cost efficiency. If your query results absolutely need to be accurate, you shouldn’t use query caching.

You may already have a good sense of which queries need caching for your use case, but you can also query hydro.logs to determine which queries you’re using most frequently.

This query returns a descending count of the most frequent Hydrolix queries in your cluster over the past week. This can be a helpful starting point to determine which queries you should cache.

Next Steps

Looking for other ways to increase query performance and reduce costs? Read these best practices for reducing query compute costs.

Learn more about Hydrolix and contact us for a trial.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Note: Query caching is an experimental feature in beta. It may not deliver expected results at scale. Use at your own risk.

Query caching can make your queries more efficient, especially for use cases like dashboards where multiple users are making the same query at regular intervals. For example, if you have many engineers and teams all viewing a dashboard that uses the same query for a major event, each on their own screens, it’s inefficient and wasteful for every dashboard to make the same query to the database. Hydrolix achieves sub-second query latency even on trillion row datasets without query caching, but all those repeated queries can still add up.

Query caching is now available in Hydrolix. To enable it, you just need to update the Hydrolix operator to 4.14 and then add SETTINGS use_query_cache = true to queries you want to cache.

Hydrolix maintains and updates its own fork of ClickHouse in query head components. We’re big fans of ClickHouse for many reasons, and it’s always exciting when ClickHouse releases new features that we can merge into our own fork and provide to customers. For version 4.14 of Hydrolix, we added ClickHouse functionality for query caching, parameterized views, and new SQL functions.

This post will focus on query caching in Hydrolix and how to determine which queries you should cache to maximize performance.

Implementing Query Caching in Hydrolix

Even with Hydrolix’s exceptional query performance, it’s still a best practice to avoid unnecessary queries to the data store. The fastest query is the one you never make, and query caches can help you avoid excessive requests to storage.

For real-time analytics use cases, the biggest benefit of query caching is typically for dashboards since many teams may be using the same query, typically at regular intervals. However, properly caching dashboard queries comes with some important caveats.

  • Using now() or other functions that calculate the timestamp at the time of the dashboard query will not deliver intended results. Otherwise, you’ll end up with queries that have different time filters. For example, query 1 might have the following time filter BETWEEN 10:00:00.01 am and 11:00:00.01 while query 2 might have the time filter BETWEEN 10:00:00.39 am and 11:00:00.39. Query 1 will be cached, but query 2 will not used the cached query.
  • Instead, you need to specify specific queries across dashboard instances. Instead of now(), the filter would be BETWEEN 10:00:00.00 and 11:00:00.00. Dashboard tools like Grafana and Superset are two dashboard visualization tools that allow you to easily rollup to the nearest granularity you are comfortable with (such as second, minute, and hour).
  • Finally, keep in mind that query caching typically involves some queries retrieving stale data. This won’t be acceptable for some real-time analytics use cases. In this situation, you can rely on Hydrolix’s ad hoc query performance to deliver real-time analytics quickly and cost-effectively.

To use query caching in Hydrolix, you need to be on version 4.14 or later.

Query caching includes some configurable defaults as well as some sensible limits.

Query cache limits

  • Cache maximum: 1 GB
  • Max number of cached queries: 1024
  • Max size per cached query: 1 MB
  • Max number of rows per cached query: 3 million

Note that the query cache stores uncompressed data. When a Hydrolix query retrieves data from object storage, it is immediately transformed from Hydrolix blocks (which typically include 20x-50x compression) to uncompressed Clickhouse blocks.

Adjustable settings (and defaults):

  • use_query_cache: Defaults to false.
  • query_cache_ttl: This is the “time to live” before the query cache expires. The default is 60 seconds.
  • query_cache_min_query_duration: The default minimum is 100 milliseconds. If a query has lower latency than the default, it won’t be cached even if use_query_cache is true. If a query already has extremely low latency, there is potentially less benefit to caching it. However, it can be useful to adjust if you regularly run dashboard queries that are below the default minimum.

Which Queries Should Be Cached?

Query caching is most helpful for frequent queries as well as queries where the underlying data rarely (if ever) changes. In the case of real-time analytics, data is frequently changing, but it can still make sense to cache data for a short period of time. For example, a 60-second window is often tolerable for real-time use cases.

The query cache is transactionally inconsistent by design because OLAP databases often prioritize performance and cost efficiency. If your query results absolutely need to be accurate, you shouldn’t use query caching.

You may already have a good sense of which queries need caching for your use case, but you can also query hydro.logs to determine which queries you’re using most frequently.

This query returns a descending count of the most frequent Hydrolix queries in your cluster over the past week. This can be a helpful starting point to determine which queries you should cache.

Next Steps

Looking for other ways to increase query performance and reduce costs? Read these best practices for reducing query compute costs.

Learn more about Hydrolix and contact us for a trial.