RSS

Dissecting a Hydrolix Query

Walk through a Hydrolix query and learn how Hydrolix achieves low latency performance using massive parallelism and sparse indexing.

David Sztykman

Published:

May 07, 2024

6 minute read
,

In a previous post, we wrote about how Hydrolix achieves sub second query latency even on very large datasets of 100+ billion rows or more. In this post, we’ll dissect what happens in a specific query and share some actual numbers around storage and query time. These numbers are anecdotal, not benchmarks, but they reflect a typical query use case.

Here’s the query:

This query retrieves all messages that have a level of 'error' from a specific table over the last day, then groups them by message in order of message frequency. It’s a common and useful query that gives you a high-level view of error messages. For this demo cluster, the response looks like this:

List of error messages returned by the query

As you can see, it’s a great way to see which errors are occurring most frequently over the last 24 hours and then drill down and mitigate those errors as needed.

Let’s take a look at what’s happening under the hood when this query runs.

Partition Pruning and the Hydrolix Catalog

At query time, the Hydrolix query head first parses the query to extract the time filter, which in this case is the last 24 hours. Because Hydrolix is optimized and partitioned by time, adding a time filter is an essential part of making queries performant.

To determine which partitions contain data from the last 24 hours, the query head consults a catalog that contains basic information about all of the partitions in storage. Hydrolix uses decoupled commodity object storage, which is distributed and accessible via HTTP requests. By first consulting the catalog, the query head removes all partitions that don’t contain the specified time range from consideration, a process known as partition pruning. Partition pruning significantly reduces the amount of data that needs to be transferred from object storage.

The catalog itself is a PostgreSQL database that stores the following information about each partition:

Catalog ColumnDefinition
partitionPath of the partition into the cloud storage bucket.
min_timestampTime of the first row in the partition.
max_timestampTime of the last row in the partition.
manifest_sizeSize of the manifest file in bytes. Each partition includes a manifest that has information on how to read and extract data from the partition.
data_sizeSize of the data file in bytes. Each partition includes a data file which is highly compressed.
index_sizeSize of the index file in bytes.
rowsNumber of rows in the partition.
mem_sizeThe memory needed to hold a dictionary-encoded column in bytes.
uncompressed_mem_sizeThe amount of memory needed to decompress the data if it’s retrieved.
root_pathRoot path of the partition in cloud storage which is usually project_id and table_id.
shard_keyShard key ID if sharding is enabled.
raw_sizeRaw size of ingested data in bytes.
storage_idID of the cloud bucket storage.

Massive Parallelism and Sparse Indexing

After the query head determines which partitions include the specified time filter, it then assigns those partitions to query peers in a query pool, which will work in parallel to open and read them. For example, if 1000 partitions match the time range, and the query pool has ten query peers, each query peer will be responsible for handling 100 partitions. Query peers are highly configurable and isolated from other query peers, making precision query scaling possible and preventing resource contention.

By default, each peer will open and read three partitions per CPU, so if a query peer has ten CPUs, it will be able to open thirty partitions at once. With this configuration of ten query peers each using ten CPUs, query peers will open and read 300 partitions in parallel across the query pool.

But what exactly does it mean to open and read a partition? A partition consists of three files: manifest.hdx, index.hdx, and data.hdx, and opening a partition means retrieving the manifest and index files. The manifest file is the smallest of the three and provides a set of instructions on how to extract data from the partition. The index file is also quite small, and provides instructions on the exact location not just of each column but specific byte ranges for each column. This is possible because Hydrolix indexes every column using sparse indexing.

A query peer downloads the manifest and index for each partition they are responsible for. While Hydrolix doesn’t rely on caching for performance, there is a consistent hashing algorithm that facilitates caching manifest and index files for future queries, so it’s not always necessary to retrieve the full manifest and index.

Next, the query peer analyzes the SQL query and checks the index to see if the partition includes any data which matches the request from the user. This particular query checks the index of the column level for the string 'error'.

If there is no match in the level column, the query peer will move on to the next partition. However, if there is a match with the query predicates, the query peer will use the sparse index to determine the location of the specific block of data in the partition. The peer will then generate byte range requests to retrieve the smallest amount of data possible. The process of pushing predicates as close to storage as possible is called predicate pushdown. Because Hydrolix uses a columnar storage format, only the block of data for the “message” column need to be retrieved, while all other columns can be pruned. And because data is extremely compressed, this minimizes the amount of data to be retrieved even more.

The peer decompresses any retrieved blocks in memory, then sends the response back to the query head. The query head aggregates the response from all the query peers and sorts the data if needed. Then the query head sends the response to the user.

Query Efficiency and Total Resources Used

To determine how efficiently the query executed, we can query the catalog to break down the total amount of data from the last 24 hours and also use the query debug settings for the sample query to determine its latency.

By querying the catalog, you can get an overview of the number of partitions and rows as well as both the raw and compressed size of the data. Here’s an example query that retrieves this data from a Hydrolix catalog:

This counts the total number of partitions, the number of rows (in millions), the raw size of the data, and the size of the indexes, manifests, and data files (which are heavily compressed). It also specifies the last day like the original query.

Running this query, over this 24-hour period, there were:

  • 365 partitions
  • 1.7 Million rows
  • 4.6GB of raw logs
  • 12MB of manifest data (manifest.hdx files)
  • 26MB of index data (index.hdx files)
  • 96MB total compressed data (data.hdx files)

While these numbers are anecdotal, they reveal some interesting details about how Hydrolix handles and compresses data. First, you can see how much smaller the manifest and index files are than the data files. This provides significant performance benefits because in many cases (where a partition has a matching timestamp but there are no other matching predicates), the query peer will only check the manifest and index without needing to read or download any data from data.hdx.

The uncompressed size of the raw data is 4.6GB, but with Hydrolix’s high-density compression, that only amounts to 134MB of data in cloud storage, a compression rate of 34x. Users typically see compression rates of 20x-50x, so 34x is right in the middle of that window.

Next, let’s take a look at how long the original query took using the query debug settings:

The total execution time is 63 milliseconds. The catalog response time was 7 milliseconds. Querying the catalog takes a trivial amount of time and typically prunes the majority of partitions from consideration. This query ran on a demo cluster which uses less resources, so each query peer only used 2 CPUs, and 4GB of RAM was assigned to the entire query pool. The query pool can autoscale from 1-10 peers, with more peers being utilized for bigger queries so more partitions can be opened at the same time. All of these settings are customizable, and you can have separate resource pools (each with their own resources) for different queries, allowing you to balance performance versus cost.

Because of the way partition pruning works, and because Hydrolix includes long-term data retention, you could query a 24-hour time range from a year ago and get the same level of query efficiency—even if the table contains hundreds of billions or even trillions of rows.

Next Steps

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

In a previous post, we wrote about how Hydrolix achieves sub second query latency even on very large datasets of 100+ billion rows or more. In this post, we’ll dissect what happens in a specific query and share some actual numbers around storage and query time. These numbers are anecdotal, not benchmarks, but they reflect a typical query use case.

Here’s the query:

This query retrieves all messages that have a level of 'error' from a specific table over the last day, then groups them by message in order of message frequency. It’s a common and useful query that gives you a high-level view of error messages. For this demo cluster, the response looks like this:

List of error messages returned by the query

As you can see, it’s a great way to see which errors are occurring most frequently over the last 24 hours and then drill down and mitigate those errors as needed.

Let’s take a look at what’s happening under the hood when this query runs.

Partition Pruning and the Hydrolix Catalog

At query time, the Hydrolix query head first parses the query to extract the time filter, which in this case is the last 24 hours. Because Hydrolix is optimized and partitioned by time, adding a time filter is an essential part of making queries performant.

To determine which partitions contain data from the last 24 hours, the query head consults a catalog that contains basic information about all of the partitions in storage. Hydrolix uses decoupled commodity object storage, which is distributed and accessible via HTTP requests. By first consulting the catalog, the query head removes all partitions that don’t contain the specified time range from consideration, a process known as partition pruning. Partition pruning significantly reduces the amount of data that needs to be transferred from object storage.

The catalog itself is a PostgreSQL database that stores the following information about each partition:

Catalog ColumnDefinition
partitionPath of the partition into the cloud storage bucket.
min_timestampTime of the first row in the partition.
max_timestampTime of the last row in the partition.
manifest_sizeSize of the manifest file in bytes. Each partition includes a manifest that has information on how to read and extract data from the partition.
data_sizeSize of the data file in bytes. Each partition includes a data file which is highly compressed.
index_sizeSize of the index file in bytes.
rowsNumber of rows in the partition.
mem_sizeThe memory needed to hold a dictionary-encoded column in bytes.
uncompressed_mem_sizeThe amount of memory needed to decompress the data if it’s retrieved.
root_pathRoot path of the partition in cloud storage which is usually project_id and table_id.
shard_keyShard key ID if sharding is enabled.
raw_sizeRaw size of ingested data in bytes.
storage_idID of the cloud bucket storage.

Massive Parallelism and Sparse Indexing

After the query head determines which partitions include the specified time filter, it then assigns those partitions to query peers in a query pool, which will work in parallel to open and read them. For example, if 1000 partitions match the time range, and the query pool has ten query peers, each query peer will be responsible for handling 100 partitions. Query peers are highly configurable and isolated from other query peers, making precision query scaling possible and preventing resource contention.

By default, each peer will open and read three partitions per CPU, so if a query peer has ten CPUs, it will be able to open thirty partitions at once. With this configuration of ten query peers each using ten CPUs, query peers will open and read 300 partitions in parallel across the query pool.

But what exactly does it mean to open and read a partition? A partition consists of three files: manifest.hdx, index.hdx, and data.hdx, and opening a partition means retrieving the manifest and index files. The manifest file is the smallest of the three and provides a set of instructions on how to extract data from the partition. The index file is also quite small, and provides instructions on the exact location not just of each column but specific byte ranges for each column. This is possible because Hydrolix indexes every column using sparse indexing.

A query peer downloads the manifest and index for each partition they are responsible for. While Hydrolix doesn’t rely on caching for performance, there is a consistent hashing algorithm that facilitates caching manifest and index files for future queries, so it’s not always necessary to retrieve the full manifest and index.

Next, the query peer analyzes the SQL query and checks the index to see if the partition includes any data which matches the request from the user. This particular query checks the index of the column level for the string 'error'.

If there is no match in the level column, the query peer will move on to the next partition. However, if there is a match with the query predicates, the query peer will use the sparse index to determine the location of the specific block of data in the partition. The peer will then generate byte range requests to retrieve the smallest amount of data possible. The process of pushing predicates as close to storage as possible is called predicate pushdown. Because Hydrolix uses a columnar storage format, only the block of data for the “message” column need to be retrieved, while all other columns can be pruned. And because data is extremely compressed, this minimizes the amount of data to be retrieved even more.

The peer decompresses any retrieved blocks in memory, then sends the response back to the query head. The query head aggregates the response from all the query peers and sorts the data if needed. Then the query head sends the response to the user.

Query Efficiency and Total Resources Used

To determine how efficiently the query executed, we can query the catalog to break down the total amount of data from the last 24 hours and also use the query debug settings for the sample query to determine its latency.

By querying the catalog, you can get an overview of the number of partitions and rows as well as both the raw and compressed size of the data. Here’s an example query that retrieves this data from a Hydrolix catalog:

This counts the total number of partitions, the number of rows (in millions), the raw size of the data, and the size of the indexes, manifests, and data files (which are heavily compressed). It also specifies the last day like the original query.

Running this query, over this 24-hour period, there were:

  • 365 partitions
  • 1.7 Million rows
  • 4.6GB of raw logs
  • 12MB of manifest data (manifest.hdx files)
  • 26MB of index data (index.hdx files)
  • 96MB total compressed data (data.hdx files)

While these numbers are anecdotal, they reveal some interesting details about how Hydrolix handles and compresses data. First, you can see how much smaller the manifest and index files are than the data files. This provides significant performance benefits because in many cases (where a partition has a matching timestamp but there are no other matching predicates), the query peer will only check the manifest and index without needing to read or download any data from data.hdx.

The uncompressed size of the raw data is 4.6GB, but with Hydrolix’s high-density compression, that only amounts to 134MB of data in cloud storage, a compression rate of 34x. Users typically see compression rates of 20x-50x, so 34x is right in the middle of that window.

Next, let’s take a look at how long the original query took using the query debug settings:

The total execution time is 63 milliseconds. The catalog response time was 7 milliseconds. Querying the catalog takes a trivial amount of time and typically prunes the majority of partitions from consideration. This query ran on a demo cluster which uses less resources, so each query peer only used 2 CPUs, and 4GB of RAM was assigned to the entire query pool. The query pool can autoscale from 1-10 peers, with more peers being utilized for bigger queries so more partitions can be opened at the same time. All of these settings are customizable, and you can have separate resource pools (each with their own resources) for different queries, allowing you to balance performance versus cost.

Because of the way partition pruning works, and because Hydrolix includes long-term data retention, you could query a 24-hour time range from a year ago and get the same level of query efficiency—even if the table contains hundreds of billions or even trillions of rows.

Next Steps