RSS

Summary Tables Simplified

The latest version of Hydrolix Summary Tables significantly reduces the complexity of writing SQL aggregates while preserving the efficiency of querying summary tables. A common limitation of aggregate tables is that they do not preserve enough information to allow for applying aggregate functions to the results of aggregate functions.(The average of several averages will likely…

Dan Sullivan

Published:

Apr 20, 2023

3 minute read
, , ,

The latest version of Hydrolix Summary Tables significantly reduces the complexity of writing SQL aggregates while preserving the efficiency of querying summary tables.

A common limitation of aggregate tables is that they do not preserve enough information to allow for applying aggregate functions to the results of aggregate functions.(The average of several averages will likely produce unwanted results). In earlier versions of Hydrolix Summary Tables, users had a couple of options for addressing those limitations. They could write queries to preserve all the data needed to calculate aggregate at different levels of aggregation. For example, you might keep both the count and sum of a metric to calculate the average.  But for most cases beyond sums, averages, and counts, a user would need to be familiar with ClickHouse Aggregate Intermediate States and Function Combinators. These have a steep learning curve along with additional complexities when using in conjunction with Hydrolix.

The latest version of Summary Tables relieves SQL developers from having to choose between explicitly preserving enough information to compute aggregates over aggregates or wrangling aggregate intermediate states.

To take advantage of Summary Tables, users create a SELECT query with aggregates and dimension columns for grouping (See Listing 1).  Summary Tables support a range of features, including arrays, aggregate suffixes (e.g. -If, -OrNull, etc), and the ability to store multiple related aggregates in a single column (e.g. average, sum, and count).

Listing 1. Example Summary SQL

    

There are a few restrictions on summary SQL. HAVING, JOINs, and LIMIT clauses are not allowed and window functions are not allowed either. Also, a SETTINGS clause is required to specify the primary key. 

Summary Tables store intermediate states so a user can define custom groupings at query time and still leverage the efficiency of querying pre-summarized data. For example, Listing 2 shows a query that returns hourly averages by content delivery network (CDN) over the last month.

Listing 2.  Example query using a Summary Table

Note a few things. First, the aggregate “server_ttlb_avg” was defined in Summary SQL query, the group by clause uses new columns created from functions on existing columns in the source table, and the group by can use fewer columns than used in the summary SQL.

For those who understand how to work with ClickHouse Aggregate Intermediate States and want a peek under the hood, running DESCRIBE on a Summary Table gives meaningful insight.

Listing 4. DESCRIBE of summary table

You can also run EXPLAIN SYNTAX on any summary table query to understand how your query is being translated. For example, the explain syntax of query over a demonstration summary table demo.new_summary is:

Listing 3. Explain plan of example summary query.

For Grafana users, it’s easy to write queries that can easily switch between raw data tables and summary tables. Simply create two common table expressions (CTEs), one based on the raw table and one based on the summary table and then use a Grafana dashboard variable to choose which of the two CTEs to query from.

Figure 1. Grafana dashboard panel that displays data from raw or summary table.

An example pair of CTEs are shown in Listing 4.

Listing 4. Common Table Expressions used for querying from both raw and summary tables.

In addition to the common table expressions, we use Grafana dashboard variables to determine which table to query. If we want to view the average object size at the level of individual sessions, we could specify a session ID and query from the raw table. When we want to query the average object size in aggregate we would use the summary table and not specify a session ID. The dashboard variable for session_id is defined as:

The latest version of Hydrolix Summary Tables significantly reduces the challenges of creating aggregate tables that can be used in a wide range of use cases.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

The latest version of Hydrolix Summary Tables significantly reduces the complexity of writing SQL aggregates while preserving the efficiency of querying summary tables.

A common limitation of aggregate tables is that they do not preserve enough information to allow for applying aggregate functions to the results of aggregate functions.(The average of several averages will likely produce unwanted results). In earlier versions of Hydrolix Summary Tables, users had a couple of options for addressing those limitations. They could write queries to preserve all the data needed to calculate aggregate at different levels of aggregation. For example, you might keep both the count and sum of a metric to calculate the average.  But for most cases beyond sums, averages, and counts, a user would need to be familiar with ClickHouse Aggregate Intermediate States and Function Combinators. These have a steep learning curve along with additional complexities when using in conjunction with Hydrolix.

The latest version of Summary Tables relieves SQL developers from having to choose between explicitly preserving enough information to compute aggregates over aggregates or wrangling aggregate intermediate states.

To take advantage of Summary Tables, users create a SELECT query with aggregates and dimension columns for grouping (See Listing 1).  Summary Tables support a range of features, including arrays, aggregate suffixes (e.g. -If, -OrNull, etc), and the ability to store multiple related aggregates in a single column (e.g. average, sum, and count).

Listing 1. Example Summary SQL

    

There are a few restrictions on summary SQL. HAVING, JOINs, and LIMIT clauses are not allowed and window functions are not allowed either. Also, a SETTINGS clause is required to specify the primary key. 

Summary Tables store intermediate states so a user can define custom groupings at query time and still leverage the efficiency of querying pre-summarized data. For example, Listing 2 shows a query that returns hourly averages by content delivery network (CDN) over the last month.

Listing 2.  Example query using a Summary Table

Note a few things. First, the aggregate “server_ttlb_avg” was defined in Summary SQL query, the group by clause uses new columns created from functions on existing columns in the source table, and the group by can use fewer columns than used in the summary SQL.

For those who understand how to work with ClickHouse Aggregate Intermediate States and want a peek under the hood, running DESCRIBE on a Summary Table gives meaningful insight.

Listing 4. DESCRIBE of summary table

You can also run EXPLAIN SYNTAX on any summary table query to understand how your query is being translated. For example, the explain syntax of query over a demonstration summary table demo.new_summary is:

Listing 3. Explain plan of example summary query.

For Grafana users, it’s easy to write queries that can easily switch between raw data tables and summary tables. Simply create two common table expressions (CTEs), one based on the raw table and one based on the summary table and then use a Grafana dashboard variable to choose which of the two CTEs to query from.

Figure 1. Grafana dashboard panel that displays data from raw or summary table.

An example pair of CTEs are shown in Listing 4.

Listing 4. Common Table Expressions used for querying from both raw and summary tables.

In addition to the common table expressions, we use Grafana dashboard variables to determine which table to query. If we want to view the average object size at the level of individual sessions, we could specify a session ID and query from the raw table. When we want to query the average object size in aggregate we would use the summary table and not specify a session ID. The dashboard variable for session_id is defined as:

The latest version of Hydrolix Summary Tables significantly reduces the challenges of creating aggregate tables that can be used in a wide range of use cases.