Ponderings, insights and industry updates

Picture of dam spilling over

Scaling Query Performance With Big Data

Published: July 3, 2023

Updated: September 12, 2023

Author: Dan Sullivan

Tags: , , , ,

Achieving consistent, low latency query performance in analytic systems can be challenging when working with a mix of workloads and varying numbers of concurrent queries. Hydrolix’s distributed architecture allows for multiple ways to tune and optimize query performance. This post outlines a few strategies for improving performance in a variety of scenarios.

How Hydrolix Executes a Query

Query tuning starts with understanding the architecture of the query service and how it executes queries. Hydrolix divides query processing between two types of nodes, query heads and query peers.

Figure 1. Architecture of query processing

Query heads parse and compile SQL statements. This produces an execution plan that drives the rest of query processing. Next, the query head identifies partitions of data that may be needed to respond to the query.  Candidate partitions are selected based on the time range specified in the query as well as shard keys when they are used. Narrowing the time range specified in a query is one of the most effective ways of reducing the amount of data scanned during query execution.  After creating a set of candidate partitions, the query head distributes work by assigning partitions to one or more query peers. 

Query peers receive a query and a set of candidate partitions from the query head. The query peers then download a filtered set of column indexes and data blocks from object storage. Since Hydrolix uses a columnar storage format, it is optimal to specify only the columns needed in a query. Like other databases that use columnar data storage, SELECT * queries are considered an anti-pattern that should be avoided.  Once data is retrieved and filtered by the query peer, results are partially aggregated if needed and then returned to the query head.

The query head collects results from all peers working on the query and performs global operations, such as sorting and grouping; after that results are returned to the client. Note, that a LIMIT clause will introduce a global operation that will be applied after other operations. LIMIT may reduce the number of rows returned to a client but it will not reduce the total amount of work done by query peers when executing a query. Using a more restrictive WHERE clause is the best way to reduce the amount of work done to process a query.

Hydrolix query peer infrastructure is stateless so adding peers is straightforward and incurs little overhead. For example, since data is persisted in object storage, there is no need to rebalance data between peers as is needed in some databases. Also, this architecture allows for linear performance improvement as nodes are added.

Understanding How Your Query is Performing

Hydrolix exposes details of query execution to help you understand and troubleshoot your queries. To have these details returned from a query, you can specify:

  • SETTINGS hdx_query_debug=true at the end of your SELECT statement
  • Include hdx_query_debug=true in an HTTP request string, for example, …/query/?hdx_query_debug=true

When hdx_query_debug is enabled, the following information will be provided:

Query MetricsDefinition
exec_timeTime to execute the query (in ms) and get the response to the query-head, doesn’t include response time to end user.
rows_readNumber of rows read by the different query-peer
bytes_readNumber of bytes read by the different query-peer
num_partitionsNumber of partitions read by the different query-peer
num_peersNumber of query peers used for that query
query_attemptsNumber of query attempts by default we retry up to 3x if we had some failures
memory_usageMemory used in bytes at the query-head level.
For example GROUP BY is done at the head and requires RAM to do the aggregation.
index_statsSpecify the table used for the query, and for each table which column is read and also which index is used.
List 1. Metrics and statistics returned when hdx_query_debug is enabled.

For example, a query on the sample.taxi_trips table in Hydrolix, returns data such as:


These results show metrics such as the number of row read and memory usage, which can help when comparing the performance of different versions of a query. It provides a list of columns read and index used. The more the query execution plan can use indexes, the less data will be scanned.

Roots of Poor Query Performance

Several different factors can lead to poor query performance, including:

  • Insufficient query peer resources
  • Insufficient query head resources
  • Sub-optimal partition merging
  • Noisy neighbors

Fortunately, there are ways of addressing each of these.

Insufficient Query Peer Resources

The amount of memory a query peer needs will depend on the number and size of partitions processed on the peer. A partition is the smallest unit of work allocated to a query peer. Peers may have multiple partitions opened at any point in time. By default, Hydrolix allows for 3 concurrently open partitions per query peer process. This maximum number of open partitions can be adjusted using the hdx_query_max_concurrent_partitions query setting.

The amount of data in partition varies, with recently ingested data typically stored in partitions with relatively small amounts of data. Data over 24 hours old, are merged into larger partitions. This strategy provides a balance between persisting recently ingested data quickly and taking advantage of higher compression ratios realized with larger partitions.  The number of partitions that will be scanned for any query depends on the time range and shard keys specified in the WHERE clause of the query as well as the merge state of partitions. 

Tuning Tip: For details on the merge state of a table, run the View Merge Health query available in the table details section of the Hydrolix portal.

Query peers can also experience long latencies if there are insufficient number of query peers to process partitions. For example, if 2 query peers are available and each can have 3 open partitions, then 6 partitions can be processed concurrently. If a total of 12 partitions need to be processed, then adding another 2 query peers could reduce the total query peer processing time by half.   

Tuning Tip: You can see the number of partitions processed by a query by specifying the hdx_query_debug settings parameter in the query. See the documentation for details.

Insufficient Query Head Resources

Like query peers, query head scale should be suitable for the workload. Vertical scaling such as increasing pod memory or horizontal scaling by increasing the number of query heads can all ensure sufficient performance is achieved for workloads.

Query heads are responsible for performing operations that cannot be distributed to query peers. For example, GROUP BY and ORDER BY operations must be performed on the query head because those operations require data from all query peers. If the total volume of data returned by query peers to the query head exceeds the amount of available memory, then the pod may be terminated with an OOMKilled status.  This problem can be addressed by increasing the amount of memory allocated to the pod.

An insufficient number of query heads can also result in longer query latencies. In this case, query processing is queued until a query head is available to process additional queries. In this case, adding more query heads will reduce latency.

Suboptimal Sharding and Partition Merging

Long latencies can also occur if partitions are not optimally merged.  Typically, partitions are merged based on primary field timestamps. In some cases, other fields may be used to further shard data. Sharding should be used only with low cardinality columns since high cardinality columns can lead to a large number of small shards that can degrade query performance.

Noisy Neighbors

Noisy neighbors are a problem common to shared resource platforms, such as databases. A noisy neighbor is a process that consumes inordinate amounts of resources and adversely affects the performance of other processes. In Hydrolix, a noisy neighbor may consume large amounts of memory on query heads for sorting and grouping operations or use a large number of query peers to process queries.

Query pools is a Hydrolix feature designed to isolate workloads and avoid the impact of noisy neighbors. Query pools can be created for different types of workload. For example, large batch jobs that have predictable resource requirements can be run in a query pool tuned specifically for that workload while another query pool can be created for ad hoc queries that may have difficult to predict resource requirements. 

Queries can be assigned to specific query pools by specifying the query pool name in the hdx_query_pool_name setting. For example:

The Hydrolix platform is designed to scale to meet the needs of a wide range of query patterns and workloads. The variety of those patterns and workloads, however, may require some tuning of resources and resource allocation for optimal query performance.

Next steps

If you don’t have a Hydrolix account yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.

Dam photo by American Public Power Association on Unsplash

Share Now