HYDROLIX BLOG

Ponderings, insights and industry updates

Handling Multiple Groupings in a Single Pass

June 20, 2022

Author: Federico Rodriguez |

Tags: , ,

The Hydrolix platform has been specifically designed using a number of characteristics that optimize storage for VERY large, high cardinality datasets all while using a distributed storage medium (such as S3, Google Cloud Storage etc). The ability to store more data for longer means that some traditional assumptions made when considering OLTP databases (Postgres, MySQL) can lead to sub-optimal query execution and there can be significantly faster and more effective ways to retrieve information.

As an example GROUP BY s are often used in building queries within OLTP databases with multiple executed to filter and aggregate down a set of rows to the data required. This can be a good ploy with smaller datasets, however there are ways that can be employed within the Hydrolix platform to optimize these queries, so they are executed in a single pass, making them much faster and significantly more efficient. To illustrate this the following tutorial is provided:

The Dataset

We’ll start by building a Temporary Table to build a sample dataset we can play around with.

In this tutorial we will:

  • Build a query on this sample table that groups elements in two different ways, and delivers a final report.
  • Tune this query for significant speed and memory usage gains when run on Hydrolix.

Feel free to run these queries yourself on your own Hydrolix cluster (I recommend using the clickhouse-client, quick installation instructions here).

NOTE: The Temporary table will be removed as soon as your session is completed.

Each event in this table is a page hit to website. Every hit has a certain session id, ISP (Internet Service Provider), user agent, country and availability region associated with it.

TimestampCountryuser_agentsession_idispregion
2022-06-16 20:34:29US…Chrome/65.0.3325.181…123VerizonEast
2022-06-16 20:34:30US…Chrome/65.0.3325.181…123VerizonEast
2022-06-16 20:34:31US…Chrome/65.0.3325.181…123VerizonEast
2022-06-16 20:34:32US…Chrome/65.0.3325.181…234VerizonEast
2022-06-16 20:34:33US…Safari/537.36…345AT&TEast
2022-06-16 20:34:29FR…Chrome/65.0.3325.181…890AT&TWest
2022-06-16 20:34:34FR…Firefox/54.0…456SpectrumWest
2022-06-16 20:34:32FR…Firefox/54.0…456SpectrumWest
2022-06-16 20:34:34UK…Safari/537.36…567SpectrumWest
2022-06-16 20:35:31US…Chrome/65.0.3325.181…231VerizonEast
2022-06-16 20:35:32US…Chrome/65.0.3325.181…214VerizonEast
2022-06-16 20:35:33US…Safari/537.36…341AT&TWest
2022-06-16 20:35:34FR…Firefox/54.0…461SpectrumWest
2022-06-16 20:34:29US…Chrome/65.0.3325.181…012SpectrumWest
2022-06-16 20:35:34UK…Safari/537.36…157XfinityWest
2022-06-16 20:35:34US…Safari/537.36…177XfinityWest
2022-06-16 20:35:40LB…Safari/537.36…159UncommonISPWest
2022-06-16 20:35:34CH…Safari/537.36…159VeryUncommonISPWest
2022-06-16 20:35:34CH…Safari/537.36…159VeryUncommonISPWest

The problem

Imagine you’ve been asked to come up with the following report:

  • For every ISP, report the percentage of page hits from unique session ids that occur outside of the US
  • Report the popularity (% page hits) of the browsers Safari, Chrome, and Firefox, per ISP, availability region.
  • Only return results for ISPs that contain more than 1 unique session

Notice from the table that:

  1. Multiple hits can come from the same session (session is not unique).
  2. Page hits can exist in one of two availability regions East, or West.
  3. UncommonISP and VeryUncommonISP are the only two ISPs with a single unique session id, and should be filtered out from our final result set.

The First Approach

We will try to build a pipeline, with a separate query for each part of the report.

Get list of ISPs w/ more than 1 unique session

Get Percentage of Unique, Outside-US Sessions Per-ISP

outside_us_hits_pctisp
1VeryUncommonISP
0Verizon
0.5Xfinity
1UncommonISP
0.3333333333333AT&T
0.75Spectrum

This query does the following:

  1. Starting at the innermost query, get unique sessions for every country and isp
  2. Group countries and unique-sessions-per-country into arrays, for each isp
  3. Use the multi-argument version of arrayFilter to filter one array by another. Every time an element in the countries array is != US, the corresponding unique session count is omitted from the uniq_session array.
  4. In the outermost query, sum up the unique sessions array values to get total for non-us countries, and divide by all unique for ISP to find percentage.

Get Browser Popularity by ISP, Region

chrome_pctsafari_pctfirefox_pctispregion
0.50.50AT&TWest
010XfinityWest
0.20.20.6SpectrumWest
100VerizonEast
010UncommonISPWest
010AT&TEast
010VeryUncommonISPWest

Uses countIf function to only count sessions for an ISP and region that contain specific browser strings in their user_agent

Putting it together using CTE

chrome_pctsafari_pctfirefox_pctispregionoutside_us_hits_pct
100VerizonEast0
010XfinityWest0.5
0.50.50AT&TWest0.333333333
010AT&TEast0.333333333
0.20.20.6SpectrumWest0.75

The Hydrolix-Friendly Approach

We will rewrite that above query in a way that

  • Avoids JOINs
  • Avoids multiple trips to grab data
  • Applies functions on columns

Percentage of Unique, Non-US Sessions Per-ISP Using uniqIf

Here we’ve rewritten the previous 23-line version of this query to just a few lines, using very handy ClickHouse feature: Aggregation Combinators

Many of you may have used countIf, but might not know that any ClickHouse aggregation function can include an -If suffix to only perform the function if the passed-in predicate is true.

We are thus able to both filter and aggregate the column session_id, in a single function call. Notice that this cannot be done by adding WHERE country != 'US' — we then lose the ability to grab all unique sessions for isp in the same pass.

tip:

Whenever you find yourself using a ClickHouse aggregation function inside a subquery, ask yourself if adding -If would get rid of complexity, and perhaps the subquery altogether.

Browser Popularity + Percent-Outside-US Queries in the Same Pass with ARRAY JOIN

In some respects, I find ARRAY JOIN unfortunately named. While JOIN (INNER, OUTER, etc) should be avoided in ClickHouse, ARRAY JOIN should be sought out!

Previously, two separate queries that performed aggregations on different GROUP BYs were presented in the same dataset using INNER JOIN. This is a very common pattern, however in Hydrolix, it can slow down queries.

Instead, the Percent-Outside-US query is run as it was before, but now keeps track of other columns (user_agent, and region) that we’ll need in the outer query.

The ARRAY JOIN “plops” the user_agents and regions arrays as new columns in the table, right next to the others. Because the table is columnar this is a trivial operation.ARRAY JOIN does not need to keep track of any key that is the same in both tables, and is an entirely different operation (and much faster) than the INNER JOIN used to combine the result set in the previous query.

The query above is 28% faster than the first approach

Further Improvements

While groupArray can offer significant improvements to a query, it can also blow up memory. The user_agents included in the page_hits table are surrounded by ... to denote they can be of arbitrary length. Because groupArray holds the array of values it has grouped in memory for some period of time (until it is ARRAY JOINed back into the table) this could cause Out-of-Memory issues in certain scenarios.

Here’s nice trick to use as workaround:
Notice we are only using the user_agent string in boolean expressions in the outer query. We do not care about its entire contents, just whether or not it contains a certain string.

groupArray accepts expressions as well as columns. Instead of holding an entire user_agents String Array in memory, we could hold an array of the boolean expressions we need. Because the array of booleans will take up many order of magnitudes less memory than the array of potentially-long strings, we can save a significant amount of memory this way.
So:

Turns into:

in the inner query.

In practice, with long user_agent strings the following query:

Took 150x less memory than using groupArray(user_agent)!

Conclusion

It is common and in some databases beneficial to think in terms of calculating results of a large query separately and then joining them together to deliver the final result set.

When using Hydrolix and ClickHouse, you should always be on the lookout for ways of combining multiple passes into one. The tools we used to do this were:

  • Aggregation Function Combinators (-If)
  • groupArray and ARRAY JOIN
  • Using expressions inside of groupArray for memory saving

Share Now