The Hydrolix platform has been specifically designed using a number of characteristics that optimize storage for VERY large, high cardinality datasets all while using a distributed storage medium (such as S3, Google Cloud Storage etc). The ability to store more data for longer means that some traditional assumptions made when considering OLTP databases (Postgres, MySQL) can lead to sub-optimal query execution and there can be significantly faster and more effective ways to retrieve information.
As an example GROUP BY s are often used in building queries within OLTP databases with multiple executed to filter and aggregate down a set of rows to the data required. This can be a good ploy with smaller datasets, however there are ways that can be employed within the Hydrolix platform to optimize these queries, so they are executed in a single pass, making them much faster and significantly more efficient. To illustrate this the following tutorial is provided:
The Dataset
We’ll start by building a Temporary Table to build a sample dataset we can play around with.
In this tutorial we will:
- Build a query on this sample table that groups elements in two different ways, and delivers a final report.
- Tune this query for significant speed and memory usage gains when run on Hydrolix.
Feel free to run these queries yourself on your own Hydrolix cluster (I recommend using the clickhouse-client, quick installation instructions here).
NOTE: The Temporary table will be removed as soon as your session is completed.
1 2 3 4 5 6 7 8 9 |
CREATE TEMPORARY TABLE page_hits ( timestamp DateTime, country Nullable(String), user_agent Nullable(String), session_id Nullable(String), isp Nullable(String), region Nullable(String) ); INSERT INTO page_hits VALUES ('2022-06-16 20:34:29','US','...Chrome/65.0.3325.181...','123','Verizon','Region-East'),('2022-06-16 20:34:30','US','...Chrome/65.0.3325.181...','123','Verizon','Region-East'),('2022-06-16 20:34:31','US','...Chrome/65.0.3325.181...','123','Verizon','Region-East'),('2022-06-16 20:34:32','US','...Chrome/65.0.3325.181...','234','Verizon','Region-East'),('2022-06-16 20:34:33','US','...Safari/537.36...','345','AT&T','Region-East'),('2022-06-16 20:34:29','FR','...Chrome/65.0.3325.181...','890','AT&T','Region-West')('2022-06-16 20:34:34','FR','...Firefox/54.0...','456','Spectrum','Region-West'),('2022-06-16 20:34:32','FR','...Firefox/54.0...','456','Spectrum','Region-West'),('2022-06-16 20:34:34','UK','...Safari/537.36...','567','Spectrum','Region-West'), ('2022-06-16 20:35:31','US','...Chrome/65.0.3325.181...','231','Verizon','Region-East'),('2022-06-16 20:35:32','US','...Chrome/65.0.3325.181...','214','Verizon', 'Region-East'),('2022-06-16 20:35:33','US','...Safari/537.36...','341','AT&T','Region-West'),('2022-06-16 20:35:34','FR','...Firefox/54.0...','461','Spectrum','Region-West'),('2022-06-16 20:34:29','US','...Chrome/65.0.3325.181...','012','Spectrum','Region-West'),('2022-06-16 20:35:34','UK','...Safari/537.36...','157','Xfinity','Region-West'),('2022-06-16 20:35:34','US','...Safari/537.36...','177','Xfinity','Region-West'),('2022-06-16 20:35:40','LB','...Safari/537.36...','159','UncommonISP','Region-West'), ('2022-06-16 20:35:34','CH','...Safari/537.36...','159','VeryUncommonISP','Region-West'), ('2022-06-16 20:35:34','CH','...Safari/537.36...','159','VeryUncommonISP','Region-West'); |
Each event in this table is a page hit to website. Every hit has a certain session id, ISP (Internet Service Provider), user agent, country and availability region associated with it.
Timestamp | Country | user_agent | session_id | isp | region |
---|---|---|---|---|---|
2022-06-16 20:34:29 | US | …Chrome/65.0.3325.181… | 123 | Verizon | East |
2022-06-16 20:34:30 | US | …Chrome/65.0.3325.181… | 123 | Verizon | East |
2022-06-16 20:34:31 | US | …Chrome/65.0.3325.181… | 123 | Verizon | East |
2022-06-16 20:34:32 | US | …Chrome/65.0.3325.181… | 234 | Verizon | East |
2022-06-16 20:34:33 | US | …Safari/537.36… | 345 | AT&T | East |
2022-06-16 20:34:29 | FR | …Chrome/65.0.3325.181… | 890 | AT&T | West |
2022-06-16 20:34:34 | FR | …Firefox/54.0… | 456 | Spectrum | West |
2022-06-16 20:34:32 | FR | …Firefox/54.0… | 456 | Spectrum | West |
2022-06-16 20:34:34 | UK | …Safari/537.36… | 567 | Spectrum | West |
2022-06-16 20:35:31 | US | …Chrome/65.0.3325.181… | 231 | Verizon | East |
2022-06-16 20:35:32 | US | …Chrome/65.0.3325.181… | 214 | Verizon | East |
2022-06-16 20:35:33 | US | …Safari/537.36… | 341 | AT&T | West |
2022-06-16 20:35:34 | FR | …Firefox/54.0… | 461 | Spectrum | West |
2022-06-16 20:34:29 | US | …Chrome/65.0.3325.181… | 012 | Spectrum | West |
2022-06-16 20:35:34 | UK | …Safari/537.36… | 157 | Xfinity | West |
2022-06-16 20:35:34 | US | …Safari/537.36… | 177 | Xfinity | West |
2022-06-16 20:35:40 | LB | …Safari/537.36… | 159 | UncommonISP | West |
2022-06-16 20:35:34 | CH | …Safari/537.36… | 159 | VeryUncommonISP | West |
2022-06-16 20:35:34 | CH | …Safari/537.36… | 159 | VeryUncommonISP | West |
The problem
Imagine you’ve been asked to come up with the following report:
- For every ISP, report the percentage of page hits from unique session ids that occur outside of the US
- Report the popularity (% page hits) of the browsers
Safari
,Chrome
, andFirefox
, per ISP, availability region. - Only return results for ISPs that contain more than 1 unique session
Notice from the table that:
- Multiple hits can come from the same session (session is not unique).
- Page hits can exist in one of two availability regions
East
, orWest
. UncommonISP
andVeryUncommonISP
are the only two ISPs with a single unique session id, and should be filtered out from our final result set.
The First Approach
We will try to build a pipeline, with a separate query for each part of the report.
Get list of ISPs w/ more than 1 unique session
1 2 3 4 |
SELECT isp FROM page_hits GROUP BY isp HAVING uniq(session_id) > 1 |
Get Percentage of Unique, Outside-US Sessions Per-ISP
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
SELECT arraySum(non_us_uniq_sessions) / total_uniq_per_isp AS outside_us_hits_pct, isp FROM ( SELECT sum(per_country_uniq) AS total_uniq_per_isp, groupArray(country) AS countries, groupArray(per_country_uniq) AS uniq_sessions, arrayFilter((s,c) -> c != 'US', uniq_sessions, countries) AS non_us_uniq_sessions, isp FROM ( SELECT uniq(session_id) AS per_country_uniq, country, isp FROM page_hits GROUP BY isp, country ) GROUP BY isp ) |
outside_us_hits_pct | isp |
---|---|
1 | VeryUncommonISP |
0 | Verizon |
0.5 | Xfinity |
1 | UncommonISP |
0.3333333333333 | AT&T |
0.75 | Spectrum |
This query does the following:
- Starting at the innermost query, get unique sessions for every country and isp
- Group countries and unique-sessions-per-country into arrays, for each isp
- Use the multi-argument version of arrayFilter to filter one array by another. Every time an element in the
countries
array is !=US
, the corresponding unique session count is omitted from theuniq_session
array. - In the outermost query, sum up the unique sessions array values to get total for non-us countries, and divide by all unique for ISP to find percentage.
Get Browser Popularity by ISP, Region
1 2 3 4 5 6 7 8 |
SELECT countIf(position(user_agent, 'Chrome') != 0) / count() as chrome_pct, countIf(position(user_agent, 'Safari') != 0) / count() as safari_pct, countIf(position(user_agent, 'Firefox') != 0) / count() as firefox_pct, isp, region FROM page_hits GROUP BY isp, region |
chrome_pct | safari_pct | firefox_pct | isp | region |
---|---|---|---|---|
0.5 | 0.5 | 0 | AT&T | West |
0 | 1 | 0 | Xfinity | West |
0.2 | 0.2 | 0.6 | Spectrum | West |
1 | 0 | 0 | Verizon | East |
0 | 1 | 0 | UncommonISP | West |
0 | 1 | 0 | AT&T | East |
0 | 1 | 0 | VeryUncommonISP | West |
Uses countIf
function to only count sessions for an ISP and region that contain specific browser strings in their user_agent
Putting it together using CTE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
WITH filtered_isps AS ( SELECT isp FROM page_hits GROUP BY isp HAVING uniq(session_id) > 1 ) , outside_us_hits AS ( SELECT arraySum(non_us_uniq_sessions) / total_uniq_per_isp AS outside_us_hits_pct, isp FROM ( SELECT sum(per_country_uniq) AS total_uniq_per_isp, groupArray(country) AS countries, groupArray(per_country_uniq) AS uniq_sessions, arrayFilter((s,c) -> c != 'US', uniq_sessions, countries) AS non_us_uniq_sessions, isp FROM ( SELECT uniq(session_id) AS per_country_uniq, country, isp FROM page_hits GROUP BY isp, country ) GROUP BY isp ) ) , browser_popularity AS ( SELECT countIf(position(user_agent, 'Chrome') != 0) / count() as chrome_pct, countIf(position(user_agent, 'Safari') != 0) / count() as safari_pct, countIf(position(user_agent, 'Firefox') != 0) / count() as firefox_pct, isp, region FROM page_hits GROUP BY isp, region ) SELECT chrome_pct, safari_pct, firefox_pct, isp, region, outside_us_hits_pct FROM ( SELECT * FROM ( SELECT * FROM outside_us_hits INNER JOIN (select * from browser_popularity) USING (isp) ) INNER JOIN (select isp from filtered_isps) USING (isp) ) |
chrome_pct | safari_pct | firefox_pct | isp | region | outside_us_hits_pct |
---|---|---|---|---|---|
1 | 0 | 0 | Verizon | East | 0 |
0 | 1 | 0 | Xfinity | West | 0.5 |
0.5 | 0.5 | 0 | AT&T | West | 0.333333333 |
0 | 1 | 0 | AT&T | East | 0.333333333 |
0.2 | 0.2 | 0.6 | Spectrum | West | 0.75 |
The Hydrolix-Friendly Approach
We will rewrite that above query in a way that
- Avoids JOINs
- Avoids multiple trips to grab data
- Applies functions on columns
Percentage of Unique, Non-US Sessions Per-ISP Using uniqIf
1 2 3 4 5 |
SELECT uniqIf(session_id, country != 'US') / uniq(session_id) AS outside_us_hits_pct, isp FROM page_hits GROUP BY isp |
Here we’ve rewritten the previous 23-line version of this query to just a few lines, using very handy ClickHouse feature: Aggregation Combinators
Many of you may have used countIf
, but might not know that any ClickHouse aggregation function can include an -If
suffix to only perform the function if the passed-in predicate is true.
We are thus able to both filter and aggregate the column session_id
, in a single function call. Notice that this cannot be done by adding WHERE country != 'US'
— we then lose the ability to grab all unique sessions for isp in the same pass.
tip:
Whenever you find yourself using a ClickHouse aggregation function inside a subquery, ask yourself if adding -If
would get rid of complexity, and perhaps the subquery altogether.
Browser Popularity + Percent-Outside-US Queries in the Same Pass with ARRAY JOIN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
WITH outside_us_hits AS ( SELECT uniqIf(session_id, country != 'US') / uniq(session_id) AS outside_us_hits_pct, groupArray(user_agent) as user_agents, groupArray(region) as regions, isp FROM page_hits GROUP BY isp HAVING uniq(session_id) > 1 ) SELECT any(outside_us_hits_pct) as outside_us_hits_pct, countIf(position(user_agent, 'Chrome') != 0) / count() as chrome_pct, countIf(position(user_agent, 'Safari') != 0) / count() as safari_pct, countIf(position(user_agent, 'Firefox') != 0) / count() as firefox_pct, isp, region FROM outside_us_hits ARRAY JOIN user_agents as user_agent, regions as region GROUP BY isp, region |
In some respects, I find ARRAY JOIN
unfortunately named. While JOIN
(INNER
, OUTER
, etc) should be avoided in ClickHouse, ARRAY JOIN
should be sought out!
Previously, two separate queries that performed aggregations on different GROUP BY
s were presented in the same dataset using INNER JOIN
. This is a very common pattern, however in Hydrolix, it can slow down queries.
Instead, the Percent-Outside-US query is run as it was before, but now keeps track of other columns (user_agent
, and region
) that we’ll need in the outer query.
The ARRAY JOIN
“plops” the user_agents
and regions
arrays as new columns in the table, right next to the others. Because the table is columnar this is a trivial operation.ARRAY JOIN
does not need to keep track of any key that is the same in both tables, and is an entirely different operation (and much faster) than the INNER JOIN
used to combine the result set in the previous query.
Further Improvements
While groupArray
can offer significant improvements to a query, it can also blow up memory. The user_agent
s included in the page_hits
table are surrounded by ...
to denote they can be of arbitrary length. Because groupArray
holds the array of values it has grouped in memory for some period of time (until it is ARRAY JOIN
ed back into the table) this could cause Out-of-Memory issues in certain scenarios.
Here’s nice trick to use as workaround:
Notice we are only using the user_agent
string in boolean expressions in the outer query. We do not care about its entire contents, just whether or not it contains a certain string.
groupArray
accepts expressions as well as columns. Instead of holding an entire user_agents
String Array in memory, we could hold an array of the boolean expressions we need. Because the array of booleans will take up many order of magnitudes less memory than the array of potentially-long strings, we can save a significant amount of memory this way.
So:
1 |
groupArray(user_agent) as user_agents, |
Turns into:
1 2 3 |
groupArray(position(user_agent, 'Chrome') != 0) as chrome_user_agents, groupArray(position(user_agent, 'Safari') != 0) as safari_user_agents, groupArray(position(user_agent, 'Firefox') != 0) as firefox_user_agents, |
in the inner query.
In practice, with long user_agent
strings the following query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
WITH outside_us_hits AS ( SELECT uniqIf(session_id, country != 'US') / uniq(session_id) AS outside_us_hits_pct, groupArray(position(user_agent, 'Chrome') != 0) as chrome_user_agents, groupArray(position(user_agent, 'Safari') != 0) as safari_user_agents, groupArray(position(user_agent, 'Firefox') != 0) as firefox_user_agents, groupArray(region) as regions, isp FROM page_hits GROUP BY isp HAVING uniq(session_id) > 1 ) SELECT any(outside_us_hits_pct) as outside_us_hits_pct, countIf(chrome_user_agent) / count() as chrome_pct, countIf(safari_user_agent) / count() as safari_pct, countIf(firefox_user_agent) / count() as firefox_pct, isp, region FROM outside_us_hits ARRAY JOIN chrome_user_agents as chrome_user_agent, safari_user_agents as safari_user_agent, firefox_user_agents as firefox_user_agent, regions as region GROUP BY isp, region |
Took 150x less memory than using groupArray(user_agent)
!
Conclusion
It is common and in some databases beneficial to think in terms of calculating results of a large query separately and then joining them together to deliver the final result set.
When using Hydrolix and ClickHouse, you should always be on the lookout for ways of combining multiple passes into one. The tools we used to do this were:
- Aggregation Function Combinators (
-If
) groupArray
andARRAY JOIN
- Using expressions inside of
groupArray
for memory saving