Migrating from Rockset? Find out if Hydrolix is right for you >>

RSS

Why Hydrolix Indexes All Columns

Learn how and why Hydrolix indexes all columns to improve performance, even in high dimensionality tables.

Franz Knupfer

Published:

Jun 18, 2024

7 minute read
,

Indexing is a powerful tool for maximizing query performance in databases, but one of the traditional maxims of indexing is that it should be done sparingly, and only on columns that are used frequently. Because of the way that the majority of databases are designed, too many indexes can have a negative impact on database performance, including both read and write speed.

That’s not how Hydrolix works. Hydrolix takes a novel approach to indexing where every column is indexed at the partition level—even if a table has thousands of columns. Hydrolix’s indexing approach is part of the reason users see sub-second query performance on trillion row datasets—all while using commodity object storage.

In this post, you’ll learn how and why Hydrolix indexes every column by default. This post will cover:

  • How Hydrolix partitions and prunes data by time
  • Writing and reading indexes at the partition level
  • The benefits of indexing everything

Hydrolix Partitions and Prunes Data by Time

Hydrolix is a streaming data lake optimized for log-intensive use cases that partitions data by time. When a user makes a time-filtered query, the query head first consults a catalog with information about each partition. This information includes a partition’s minimum and maximum timestamps as well as its exact location in object storage. The query head then eliminates partitions that don’t have the included range from consideration, a process known as partition pruning.

The next graphic shows a query based on an incident that occurred between 9:07 AM and 9:09 AM. For a query using this time filter, the query head can prune two of the three partitions because they don’t include the specified time range. Partition pruning can eliminate thousands of partitions from consideration in a large data set.

That’s not column indexing, though, and in fact, there are no column indexes at the table level. Instead, each partition has its own index, a process we refer to as micro-indexing. This is in contrast to other big data solutions that use partitioning but not indexing, including both Snowflake and Google BigQuery.

Snowflake uses clustering to cluster contiguous data together, but it doesn’t include indexing. In BigQuery, you can query on any column, but without indexes, this can lead to large table scans and increased charges for scanned data. To efficiently query data, it’s typical to duplicate data using a different key structure, but duplication also comes with increased costs. While Snowflake and BigQuery are powerful tools that use partition pruning, the fact that they both lack indexing strategies is a detriment in terms of overall query performance and can lead to higher costs.

Indexing at the Partition Level

A Hydrolix table can have hundreds of thousands of partitions, and each partition consists of three files:

  • manifest.hdx
  • index.hdx
  • data.hdx

The manifest.hdx file is the smallest of the three and acts as a table of contents for the partition. The index.hdx file contains the index in a list structure, and data.hdx contains the data itself. Using simple list structures takes less space and maintenance than b-trees.

index.hdx contains sparse indexes of every column by default. (Indexes for specific columns can be optionally suppressed.) A sparse index contains a pointer to a range of values instead of to each value in a column. Sparse indexing means smaller index sizes, which is useful for both reducing I/O and storage. Log data can also have many repeated values—an example is HTTP status codes. Sparse indexes are very effective at retrieving repeated value ranges.

Writing Indexes to Partitions

Hydrolix’s partitioning strategy makes both writing and reading indexes efficient. When log data is ingested into Hydrolix, intake heads work in parallel to create new partitions. Partitions are smallest at ingest time, with a maximum of one million rows by default. Creating an index for a small partition is much more efficient than inserting new values into a b-tree index and rebalancing it, especially for very large tables.

Hydrolix is immutable and append-only by design. An ALTER service is available, but it’s an added feature, not an architectural cornerstone of Hydrolix. As a result, indexes don’t need to be constantly rewritten as they do in a transactional database.

However, Hydrolix does include a compaction service that merges data over time, compressing and optimizing it further. The largest partitions are still only 4 GB in size, so updating a partition index is still a very efficient operation in comparison to updating indexes on large tables.

Hydrolix also uses high-density compression to dramatically reduce the size of both the raw data and the index. This compression typically reduces the storage footprint of log data by 20-50x.

In a recent blog that dissected a Hydrolix query, there’s a section that discusses the total amount of data in a Hydrolix partition over a 24-hour period. The data is anecdotal, but it provides an example of how much space an index uses in comparison to both the raw and compressed data.

Over a 24-hour period, the queried table included:

  • 4.6GB of raw logs
  • 12MB of manifest data (manifest.hdx files)
  • 26MB of index data (index.hdx files)
  • 96MB total compressed data (data.hdx files)

That’s approximately 33x compression of the raw logs. The indexes are about 19% of the total size of the compressed data—and only about .5% of the size of the total raw data.

In Hydrolix, enabling an index has the beneficial side effect of improving compression further, but this is part of the secret sauce.

Reading Indexes in Partitions

In the case of reading data, it’s helpful to consider an example query. The following query covers a common log use case: retrieving all messages from the last hour that have a level of 'error':

The query head will first determine which partitions include timestamps from the last hour, eliminating all other partitions from consideration. This is partition pruning.

Next, query peers work in parallel, opening multiple partitions concurrently and reading the contents of the manifest and index files. (By default, each peer opens three partitions per CPU). 

As the index is downloaded into memory, the query peer determines whether the partition contains any matches with a level of 'error'. If there’s no match, the query peer won’t retrieve any data from data.hdx. However, if there is a match, the peer will retrieve only narrow byte ranges from data.hdx, which are specified in the index. Row-based databases often do full table scans while other columnar databases do full column scans, which is much more efficient. Hydrolix is even more efficient than a typical columnar database, though. Because of sparse indexing for every column, Hydrolix only performs partial column scans, regardless of which column is queried.

The following image shows the difference. As you can see, Hydrolix needs to scan much less data than row-based databases. The difference can also be significant for columnar databases as well. This is only possible because Hydrolix indexes every column.

Image shows row-based db on left doing full table scan, columnar db in middle doing full column scans, and Hydrolix on the right doing partial column scans.

Indexing at the partition level has several benefits for query optimization:

  • Hydrolix’s indexing strategy requires less memory than other databases because each index is small and most partition indexes aren’t used for a query. This makes queries much more efficient.
  • Because indexes don’t take up much ingest or query overhead and are individually small, they can contain more information than traditional indexes—so Hydrolix can index everything. This makes indexes much more comprehensive, and in turn, also makes queries more efficient.

Why Index Every Column?

Let’s move onto the why. Why index every column, and what value does that provide to Hydrolix users? Hydrolix’s indexing strategy impacts and improves every part of the system—it’s a core aspect of what makes Hydrolix unique.

  • Query performance: Hydrolix uses commodity object storage. Because object storage is distributed and uses HTTP requests, minimizing data transfer is extremely important. With sparse indexing of all columns, query peers can quickly determine whether a partition includes relevant data, ignore it if not, and only transfer narrow byte ranges of data if so. Indexing every column is a crucial part of achieving this performance.
  • Long-term “hot” retention: Hydrolix offers long-term retention with full access to all of your data at a fraction of the cost of other data solutions. Commodity object storage makes this efficiency possible, and indexing is a critical part of what makes object storage a viable solution from a performance perspective.
  • High-density compression: Hydrolix’s indexing strategy has the side effect of improving compression further, reducing your storage footprint and costs.

Indexing everything is also part of our strategy of never compromising on your data. You shouldn’t have to make tough choices between paying too much for data or throwing it away—nor should you have to make tough decisions about which columns to index. Indexing everything simplifies maintenance and decision-making:

  • No additional decision-making at startup. With traditional solutions, deciding which columns to index can be a tough choice, especially in high dimensionality datasets that contain hundreds or even thousands of fields.
  • No need to create indexes later. With traditional solutions, you may not realize until later that a column should be indexed. Adding a new index to a large table is a compute-intensive and painstaking process that you can avoid altogether by indexing all columns from the start. And until you add indexes, teams will have to deal with performance degradation and higher compute costs.
  • No painfully slow, expensive queries. Even if you make wise decisions about which columns to index, indexing just a handful of columns out of potentially hundreds (especially in high dimensionality tables) makes it inevitable that some teams will have to query unindexed columns. This can have a cascading effect on compute, budget, and team management.

With Hydrolix’s approach to indexing, the question shouldn’t be why index everything, but rather, why not index everything? Traditional systems aren’t designed for comprehensive indexing, and it’s just one of the ways Hydrolix is unique: by making indexing an architectural cornerstone and combining it with other features such as massive parallelism, Hydrolix maximizes the performance of object storage and provides sub second query latency even on trillion row datasets.

Next Steps

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Indexing is a powerful tool for maximizing query performance in databases, but one of the traditional maxims of indexing is that it should be done sparingly, and only on columns that are used frequently. Because of the way that the majority of databases are designed, too many indexes can have a negative impact on database performance, including both read and write speed.

That’s not how Hydrolix works. Hydrolix takes a novel approach to indexing where every column is indexed at the partition level—even if a table has thousands of columns. Hydrolix’s indexing approach is part of the reason users see sub-second query performance on trillion row datasets—all while using commodity object storage.

In this post, you’ll learn how and why Hydrolix indexes every column by default. This post will cover:

  • How Hydrolix partitions and prunes data by time
  • Writing and reading indexes at the partition level
  • The benefits of indexing everything

Hydrolix Partitions and Prunes Data by Time

Hydrolix is a streaming data lake optimized for log-intensive use cases that partitions data by time. When a user makes a time-filtered query, the query head first consults a catalog with information about each partition. This information includes a partition’s minimum and maximum timestamps as well as its exact location in object storage. The query head then eliminates partitions that don’t have the included range from consideration, a process known as partition pruning.

The next graphic shows a query based on an incident that occurred between 9:07 AM and 9:09 AM. For a query using this time filter, the query head can prune two of the three partitions because they don’t include the specified time range. Partition pruning can eliminate thousands of partitions from consideration in a large data set.

That’s not column indexing, though, and in fact, there are no column indexes at the table level. Instead, each partition has its own index, a process we refer to as micro-indexing. This is in contrast to other big data solutions that use partitioning but not indexing, including both Snowflake and Google BigQuery.

Snowflake uses clustering to cluster contiguous data together, but it doesn’t include indexing. In BigQuery, you can query on any column, but without indexes, this can lead to large table scans and increased charges for scanned data. To efficiently query data, it’s typical to duplicate data using a different key structure, but duplication also comes with increased costs. While Snowflake and BigQuery are powerful tools that use partition pruning, the fact that they both lack indexing strategies is a detriment in terms of overall query performance and can lead to higher costs.

Indexing at the Partition Level

A Hydrolix table can have hundreds of thousands of partitions, and each partition consists of three files:

  • manifest.hdx
  • index.hdx
  • data.hdx

The manifest.hdx file is the smallest of the three and acts as a table of contents for the partition. The index.hdx file contains the index in a list structure, and data.hdx contains the data itself. Using simple list structures takes less space and maintenance than b-trees.

index.hdx contains sparse indexes of every column by default. (Indexes for specific columns can be optionally suppressed.) A sparse index contains a pointer to a range of values instead of to each value in a column. Sparse indexing means smaller index sizes, which is useful for both reducing I/O and storage. Log data can also have many repeated values—an example is HTTP status codes. Sparse indexes are very effective at retrieving repeated value ranges.

Writing Indexes to Partitions

Hydrolix’s partitioning strategy makes both writing and reading indexes efficient. When log data is ingested into Hydrolix, intake heads work in parallel to create new partitions. Partitions are smallest at ingest time, with a maximum of one million rows by default. Creating an index for a small partition is much more efficient than inserting new values into a b-tree index and rebalancing it, especially for very large tables.

Hydrolix is immutable and append-only by design. An ALTER service is available, but it’s an added feature, not an architectural cornerstone of Hydrolix. As a result, indexes don’t need to be constantly rewritten as they do in a transactional database.

However, Hydrolix does include a compaction service that merges data over time, compressing and optimizing it further. The largest partitions are still only 4 GB in size, so updating a partition index is still a very efficient operation in comparison to updating indexes on large tables.

Hydrolix also uses high-density compression to dramatically reduce the size of both the raw data and the index. This compression typically reduces the storage footprint of log data by 20-50x.

In a recent blog that dissected a Hydrolix query, there’s a section that discusses the total amount of data in a Hydrolix partition over a 24-hour period. The data is anecdotal, but it provides an example of how much space an index uses in comparison to both the raw and compressed data.

Over a 24-hour period, the queried table included:

  • 4.6GB of raw logs
  • 12MB of manifest data (manifest.hdx files)
  • 26MB of index data (index.hdx files)
  • 96MB total compressed data (data.hdx files)

That’s approximately 33x compression of the raw logs. The indexes are about 19% of the total size of the compressed data—and only about .5% of the size of the total raw data.

In Hydrolix, enabling an index has the beneficial side effect of improving compression further, but this is part of the secret sauce.

Reading Indexes in Partitions

In the case of reading data, it’s helpful to consider an example query. The following query covers a common log use case: retrieving all messages from the last hour that have a level of 'error':

The query head will first determine which partitions include timestamps from the last hour, eliminating all other partitions from consideration. This is partition pruning.

Next, query peers work in parallel, opening multiple partitions concurrently and reading the contents of the manifest and index files. (By default, each peer opens three partitions per CPU). 

As the index is downloaded into memory, the query peer determines whether the partition contains any matches with a level of 'error'. If there’s no match, the query peer won’t retrieve any data from data.hdx. However, if there is a match, the peer will retrieve only narrow byte ranges from data.hdx, which are specified in the index. Row-based databases often do full table scans while other columnar databases do full column scans, which is much more efficient. Hydrolix is even more efficient than a typical columnar database, though. Because of sparse indexing for every column, Hydrolix only performs partial column scans, regardless of which column is queried.

The following image shows the difference. As you can see, Hydrolix needs to scan much less data than row-based databases. The difference can also be significant for columnar databases as well. This is only possible because Hydrolix indexes every column.

Image shows row-based db on left doing full table scan, columnar db in middle doing full column scans, and Hydrolix on the right doing partial column scans.

Indexing at the partition level has several benefits for query optimization:

  • Hydrolix’s indexing strategy requires less memory than other databases because each index is small and most partition indexes aren’t used for a query. This makes queries much more efficient.
  • Because indexes don’t take up much ingest or query overhead and are individually small, they can contain more information than traditional indexes—so Hydrolix can index everything. This makes indexes much more comprehensive, and in turn, also makes queries more efficient.

Why Index Every Column?

Let’s move onto the why. Why index every column, and what value does that provide to Hydrolix users? Hydrolix’s indexing strategy impacts and improves every part of the system—it’s a core aspect of what makes Hydrolix unique.

  • Query performance: Hydrolix uses commodity object storage. Because object storage is distributed and uses HTTP requests, minimizing data transfer is extremely important. With sparse indexing of all columns, query peers can quickly determine whether a partition includes relevant data, ignore it if not, and only transfer narrow byte ranges of data if so. Indexing every column is a crucial part of achieving this performance.
  • Long-term “hot” retention: Hydrolix offers long-term retention with full access to all of your data at a fraction of the cost of other data solutions. Commodity object storage makes this efficiency possible, and indexing is a critical part of what makes object storage a viable solution from a performance perspective.
  • High-density compression: Hydrolix’s indexing strategy has the side effect of improving compression further, reducing your storage footprint and costs.

Indexing everything is also part of our strategy of never compromising on your data. You shouldn’t have to make tough choices between paying too much for data or throwing it away—nor should you have to make tough decisions about which columns to index. Indexing everything simplifies maintenance and decision-making:

  • No additional decision-making at startup. With traditional solutions, deciding which columns to index can be a tough choice, especially in high dimensionality datasets that contain hundreds or even thousands of fields.
  • No need to create indexes later. With traditional solutions, you may not realize until later that a column should be indexed. Adding a new index to a large table is a compute-intensive and painstaking process that you can avoid altogether by indexing all columns from the start. And until you add indexes, teams will have to deal with performance degradation and higher compute costs.
  • No painfully slow, expensive queries. Even if you make wise decisions about which columns to index, indexing just a handful of columns out of potentially hundreds (especially in high dimensionality tables) makes it inevitable that some teams will have to query unindexed columns. This can have a cascading effect on compute, budget, and team management.

With Hydrolix’s approach to indexing, the question shouldn’t be why index everything, but rather, why not index everything? Traditional systems aren’t designed for comprehensive indexing, and it’s just one of the ways Hydrolix is unique: by making indexing an architectural cornerstone and combining it with other features such as massive parallelism, Hydrolix maximizes the performance of object storage and provides sub second query latency even on trillion row datasets.

Next Steps