Query optimization is an essential part of working with big data—otherwise, you’ll run into issues with long-running queries, potential timeouts, and additional compute costs. Some optimizations will make your queries more efficient, while others can help you get more data with fewer passes. In this post, you’ll learn how to use WITH ROLLUP
to calculate the percentage of each item in a data set in a single pass. When using WITH ROLLUP
in Hydrolix, which is built to handle log data at scale, you get up to 5x better performance than using a CROSS JOIN
to get the same data.
Use Cases for WITH ROLLUP
You can use WITH ROLLUP
to get a sum of all the values in a grouping. A common use case is summing total sales or sales by region, but there are many other use cases as well. With log data, you might want to group requests by region or provider while also getting a total summed value of requests.
WITH ROLLUP
can help you calculate percentages, too. You can use the sum that WITH ROLLUP
returns to calculate the overall percentage of each item in a column, all in one query.
By calculating percentages, you can see how requests are distributed across services—is there an even distribution or do resources need to be reallocated? Or perhaps you simply want to see, by percentage, what products your customers are using.
Regardless of your use case, you’ll be able to minimize the number of queries you need to make for calculating sums and percentages by using WITH ROLLUP
. This is important when you’re working with big data, and even more so if you’re regularly compiling the data for dashboards or other visualizations. In these situations, you want to minimize the number of queries you need to make, saving you both time and compute.
Calculate the Percentage of Each Item Using WITH ROLLUP
Let’s take a look at an example that returns the percentage of each item in a data set. This use case is typical for a company that wants to better understand customer behavior. Specifically, a media company might want to break down streaming content usage by percentage. This is a simple example using the following dataset of contentId
s:
contentId |
Video_1 |
Video_2 |
Video_2 |
Video_2 |
Video_3 |
Video_3 |
For a dashboard, it might be helpful to have not just the total number of downloads for each video but also the overall percentage of downloads for each video. Having a percentage value can better help you compare the popularity of each item.
To try out the query, you’ll first need a temporary table that includes sample data. You can easily create a temporary table and test the queries in this post using Clickhouse Fiddle. Note that the query examples mostly use plain SQL but that the Nullable(String)
type is specific to Clickhouse.
To follow along, create a temporary table with the following data:
1 2 | CREATE TEMPORARY TABLE ContentTable (contentId Nullable(String)); INSERT INTO ContentTable VALUES (‘Video_1’), (‘Video_2’), (‘Video_2’), (‘Video_2’), (‘Video_3’), (‘Video_3’); |
Real world datasets will be much larger, but this is all you’ll need for the example.
Next, let’s take a look at a basic WITH_ROLLUP
query that sums the content count of all the items.
1 2 3 4 5 6 | SELECT contentId, count() AS content_count FROM ContentTable GROUP BY contentId WITH ROLLUP |
This returns the following output:
contentId | content_count |
Video_1 | 1 |
Video_2 | 3 |
Video_3 | 2 |
NULL | 6 |
In contrast, a simple count query without WITH_ROLLUP
would return the following:
contentId | content_count |
Video_1 | 1 |
Video_2 | 3 |
Video_3 | 2 |
WITH_ROLLUP
adds an additional row with the sum of the content_count
. Note it has a NULL
value for its contentId
, which makes sense because it’s an aggregation of all the contentId
s and doesn’t have its own ID. This NULL
value is important for the full query—more on that in a moment.
Sometimes a simple sum is all you need, but what if you want to calculate the percentage for each item in a result set? For instance, you might want to know what percentage of users watched video_1
versus video_2
. Fortunately, all of the data for calculating the percentage of each item is already included in the WITH_ROLLUP
query. Each contentId
has its own content_count
and there’s an additional row with a NULL
contentId
and a content_count
that has the total sum of all the items. That means you can make use of the NULL
value to do the following calculation in a query:
1 | ROUND(content_count*100 / (SELECT content_count FROM CountPerContent WHERE content IS NULL), 2) as pct_total |
You simply need to divide the content_count
value of each contentId
by the content_count
of the row that has a NULL
contentId
(the sum), and multiply that by 100, giving you the percentage for each item.
However, there is one gotcha that needs to be accounted for: ensuring that the only NULL
contentId
is for the row that WITH_ROLLUP
generates. Fortunately, SQL provides COALESCE()
to handle NULL
values.
Let’s take a look at the full common table expression, which uses both COALESCE()
and WITH_ROLLUP
, to get both the count of each item as well as its overall percentage in a single pass.
1 2 3 4 5 6 7 8 9 10 11 12 13 | WITH CountPerContent AS ( SELECT COALESCE(contentId, ‘__missing__’)::Nullable(String) as content, count() as content_count FROM ContentTable GROUP BY content WITH ROLLUP ) SELECT content, content_count, ROUND(content_count*100 / (SELECT content_count FROM CountPerContent WHERE content IS NULL), 2) as pct_total FROM CountPerContent WHERE content IS NOT NULL |
The first part of the common table expression uses COALESCE()
to change any NULL
contentId
values in the temporary result set to __missing__
. This ensures that the row that includes the sum will be the only NULL
contentId
value.
Here’s the output of the query:
content | content_count | pct_total |
Video_1 | 1 | 16.67 |
Video_3 | 1 | 50 |
Video_2 | 2 | 33.33 |
Note that this query doesn’t include the aggregated row from WITH_ROLLUP
. You can easily add the aggregated value which includes the sum (and a pct_total
of 100) by changing FROM CountPerContent WHERE content IS NOT NULL
to FROM CountPerContent
.
Summary
By using WITH ROLLUP
, you can get the percentage of each item in a data set in a single pass, saving you both time and computational resources. You’ll also see much better performance than you would with a CROSS JOIN
.
Next Steps
WITH ROLLUP
queries can help you more efficiently work with data at scale, including your log data. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your log data—all at a fraction of the cost of other observability platforms. Learn more about Hydrolix.
Photo by Bozhin Karaivanov