The latest version of Hydrolix Summary Tables significantly reduces the complexity of writing SQL aggregates while preserving the efficiency of querying summary tables.
A common limitation of aggregate tables is that they do not preserve enough information to allow for applying aggregate functions to the results of aggregate functions.(The average of several averages will likely produce unwanted results). In earlier versions of Hydrolix Summary Tables, users had a couple of options for addressing those limitations. They could write queries to preserve all the data needed to calculate aggregate at different levels of aggregation. For example, you might keep both the count and sum of a metric to calculate the average. But for most cases beyond sums, averages, and counts, a user would need to be familiar with ClickHouse Aggregate Intermediate States and Function Combinators. These have a steep learning curve along with additional complexities when using in conjunction with Hydrolix.
The latest version of Summary Tables relieves SQL developers from having to choose between explicitly preserving enough information to compute aggregates over aggregates or wrangling aggregate intermediate states.
To take advantage of Summary Tables, users create a SELECT query with aggregates and dimension columns for grouping (See Listing 1). Summary Tables support a range of features, including arrays, aggregate suffixes (e.g. -If, -OrNull, etc), and the ability to store multiple related aggregates in a single column (e.g. average, sum, and count).
Listing 1. Example Summary SQL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | SELECT toStartOfMinute(timestamp) as min, — Shortest rollup that remains performant quantiles(0.5, 0.75, 0.95)(server_ttlb) as server_ttlb_percentiles, quantiles(0.5, 0.75, 0.95)(server_ttlb) [1] as server_ttlb_median, — Array elements allowed, agg only calculated once quantiles(0.5, 0.75, 0.95)(server_ttlb) [3] as server_ttlb_95th, max(server_ttlb) as server_ttlb_max, avg(server_ttlb) as server_ttlb_avg, server_ttlb_median + 10 > avg(server_ttlb) as high_median, — Multiple aggs aliases in a single column, + aliases sum(server_ttlb) as server_ttlb_sum, count(server_ttlb) as server_ttlb_count, — Avg, sum and count internally stored as 1 column uniqExact(user_agent) as user_agent_uniq, countIf(server_ttlb > 5) as greater_5, — Aggregate func suffixes allowed (–If, –OrNull, etc) count() as count, cdn, asn FROM project.raw_table — Query is on the raw table WHERE server_ttlb != 0 — Filter is allowed, but should be used with care. GROUP BY — Table can be over–grouped, w/ fewer keys chosen later min, cdn, asn SETTINGS hdx_primary_key = ‘min’ |
There are a few restrictions on summary SQL. HAVING, JOINs, and LIMIT clauses are not allowed and window functions are not allowed either. Also, a SETTINGS clause is required to specify the primary key.
Summary Tables store intermediate states so a user can define custom groupings at query time and still leverage the efficiency of querying pre-summarized data. For example, Listing 2 shows a query that returns hourly averages by content delivery network (CDN) over the last month.
Listing 2. Example query using a Summary Table
1 2 3 4 5 6 7 8 9 10 11 | SELECT toStartOfHour(min) as hour, server_ttlb_avg, — cdn FROM project.summary_table WHERE min > toStartOfMonth(now()) AND asn NOT IN (1,2,9,99) GROUP BY hour, cdn |
Note a few things. First, the aggregate “server_ttlb_avg” was defined in Summary SQL query, the group by clause uses new columns created from functions on existing columns in the source table, and the group by can use fewer columns than used in the summary SQL.
For those who understand how to work with ClickHouse Aggregate Intermediate States and want a peek under the hood, running DESCRIBE on a Summary Table gives meaningful insight.
Listing 4. DESCRIBE of summary table
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | DESCRIBE TABLE project.summary_table Query id: 9d394b9e–ff64–4eee–abbf–eba299a22968 ┌─name────────────────────────────────────┬─type────────────────────────────────────────────────────────────┬─default_type─┬─default_expression────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ toStartOfHour(timestamp) │ DateTime │ │ │ │ cdn │ Nullable(String) │ │ │ │ quantiles(0.5, 0.75, 0.95)(server_ttlb) │ AggregateFunction(quantiles(0.5, 0.75, 0.95), Nullable(UInt64)) │ │ │ │ max(server_ttlb) │ AggregateFunction(max, Nullable(UInt64)) │ │ │ │ sumCount(server_ttlb) │ AggregateFunction(sumCount, Nullable(UInt64)) │ │ │ │ uniqExact(user_agent) │ AggregateFunction(uniqExact, Nullable(String)) │ │ │ │ countIf(greater(server_ttlb, 5)) │ AggregateFunction(countIf, Nullable(UInt8)) │ │ │ │ count() │ AggregateFunction(count) │ │ │ │ hour │ DateTime │ ALIAS │ `toStartOfHour(timestamp)` │ │ server_ttlb_percentiles │ Array(Float64) │ ALIAS │ quantilesMerge(0.5, 0.75, 0.95)(`quantiles(0.5, 0.75, 0.95)(server_ttlb)`) │ │ server_ttlb_median │ Float64 │ ALIAS │ quantilesMerge(0.5, 0.75, 0.95)(`quantiles(0.5, 0.75, 0.95)(server_ttlb)`)[1] │ │ server_ttlb_95th │ Float64 │ ALIAS │ quantilesMerge(0.5, 0.75, 0.95)(`quantiles(0.5, 0.75, 0.95)(server_ttlb)`)[3] │ │ max_server_ttlb │ Nullable(UInt64) │ ALIAS │ maxMerge(`max(server_ttlb)`) │ │ server_ttlb_avg │ Float64 │ ALIAS │ (sumCountMerge(`sumCount(server_ttlb)`).1) / (sumCountMerge(`sumCount(server_ttlb)`).2) │ │ high_median │ UInt8 │ ALIAS │ ((quantilesMerge(0.5, 0.75, 0.95)(`quantiles(0.5, 0.75, 0.95)(server_ttlb)`)[1]) + 10) > ((sumCountMerge(`sumCount(server_ttlb)`).1) / (sumCountMerge(`sumCount(server_ttlb)`).2)) │ │ server_ttlb_sum │ UInt64 │ ALIAS │ sumCountMerge(`sumCount(server_ttlb)`).1 │ │ server_ttlb_count │ UInt64 │ ALIAS │ sumCountMerge(`sumCount(server_ttlb)`).2 │ │ user_agent_uniq │ UInt64 │ ALIAS │ uniqExactMerge(`uniqExact(user_agent)`) │ │ greater_5 │ UInt64 │ ALIAS │ countIfMerge(`countIf(greater(server_ttlb, 5))`) │ │ count │ UInt64 │ ALIAS │ countMerge(`count()`) │ └─────────────────────────────────────────┴─────────────────────────────────────────────────────────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
You can also run EXPLAIN SYNTAX on any summary table query to understand how your query is being translated. For example, the explain syntax of query over a demonstration summary table demo.new_summary is:
Listing 3. Explain plan of example summary query.
1 2 3 4 5 6 7 8 9 10 | SELECT `toStartOfHour(datetime_second_second_unindexed)`, boolean_indexed, `avg(uint64_indexed)`, `max(uint32_indexed)`, `sum(uint8_indexed)`, `quantiles(0.1, 0.5, 0.95)(uint64_indexed)`, `uniqCombined(uint8_indexed)`, `countIf(less(uint8_indexed, 5))` FROM demo.new_summary |
For Grafana users, it’s easy to write queries that can easily switch between raw data tables and summary tables. Simply create two common table expressions (CTEs), one based on the raw table and one based on the summary table and then use a Grafana dashboard variable to choose which of the two CTEs to query from.
Figure 1. Grafana dashboard panel that displays data from raw or summary table.

An example pair of CTEs are shown in Listing 4.
Listing 4. Common Table Expressions used for querying from both raw and summary tables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | WITH Q1 AS ( SELECT toStartOfMinute(timestamp) as min_timestamp, cdn, avg(bytes_out) bytes_out_avg FROM cdn.logs WHERE $__timeFilter(timestamp) GROUP BY min_timestamp, cdn ), Q2 AS ( SELECT toStartOfMinute(timestamp_min) as min_timestamp, cdn, bytes_out_avg FROM cdn.logs_summary WHERE $__timeFilter(timestamp_min ) GROUP BY min_timestamp, cdn ) SELECT min_timestamp, cdn FROM ${cte:text} ORDER BY min_timestamp ASC |
In addition to the common table expressions, we use Grafana dashboard variables to determine which table to query. If we want to view the average object size at the level of individual sessions, we could specify a session ID and query from the raw table. When we want to query the average object size in aggregate we would use the summary table and not specify a session ID. The dashboard variable for session_id is defined as:
1 | SELECT if (”${session_id:text}‘ = ‘‘, ‘Q2‘, ‘Q1‘) |
The latest version of Hydrolix Summary Tables significantly reduces the challenges of creating aggregate tables that can be used in a wide range of use cases.