RSS

Using WITH ROLLUP to Efficiently Calculate SQL Percentages

Learn how to use SQL WITH ROLLUP to most efficiently calculate percentages, saving you time and compute resources.

Franz Knupfer

Antoine Thoumelin

Published:

Oct 30, 2023

5 minute read
,

Query Level Up

Query optimization is an essential part of working with big data—otherwise, you’ll run into issues with long-running queries, potential timeouts, and additional compute costs. Some optimizations will make your queries more efficient, while others can help you get more data with fewer passes. In this post, you’ll learn how to use WITH ROLLUP to calculate the percentage of each item in a data set in a single pass. When using WITH ROLLUP in Hydrolix, which is built to handle log data at scale, you get up to 5x better performance than using a CROSS JOIN to get the same data. 

Use Cases for WITH ROLLUP

You can use WITH ROLLUP to get a sum of all the values in a grouping. A common use case is summing total sales or sales by region, but there are many other use cases as well. With log data, you might want to group requests by region or provider while also getting a total summed value of requests.

WITH ROLLUP can help you calculate percentages, too. You can use the sum that WITH ROLLUP returns to calculate the overall percentage of each item in a column, all in one query.

By calculating percentages, you can see how requests are distributed across services—is there an even distribution or do resources need to be reallocated? Or perhaps you simply want to see, by percentage, what products your customers are using.

Regardless of your use case, you’ll be able to minimize the number of queries you need to make for calculating sums and percentages by using WITH ROLLUP. This is important when you’re working with big data, and even more so if you’re regularly compiling the data for dashboards or other visualizations. In these situations, you want to minimize the number of queries you need to make, saving you both time and compute.

Calculate the Percentage of Each Item Using WITH ROLLUP

Let’s take a look at an example that returns the percentage of each item in a data set. This use case is typical for a company that wants to better understand customer behavior. Specifically, a media company might want to break down streaming content usage by percentage. This is a simple example using the following dataset of contentIds:

contentId
Video_1
Video_2
Video_2
Video_2
Video_3
Video_3

For a dashboard, it might be helpful to have not just the total number of downloads for each video but also the overall percentage of downloads for each video. Having a percentage value can better help you compare the popularity of each item.

To try out the query, you’ll first need a temporary table that includes sample data. You can easily create a temporary table and test the queries in this post using Clickhouse Fiddle. Note that the query examples mostly use plain SQL but that the Nullable(String) type is specific to Clickhouse.

To follow along, create a temporary table with the following data:

Real world datasets will be much larger, but this is all you’ll need for the example.

Next, let’s take a look at a basic WITH_ROLLUP query that sums the content count of all the items.

This returns the following output:

contentIdcontent_count
Video_11
Video_23
Video_32
NULL6

In contrast, a simple count query without WITH_ROLLUP would return the following:

contentIdcontent_count
Video_11
Video_23
Video_32

WITH_ROLLUP adds an additional row with the sum of the content_count. Note it has a NULL value for its contentId, which makes sense because it’s an aggregation of all the contentIds and doesn’t have its own ID. This NULL value is important for the full query—more on that in a moment.

Sometimes a simple sum is all you need, but what if you want to calculate the percentage for each item in a result set? For instance, you might want to know what percentage of users watched video_1 versus video_2. Fortunately, all of the data for calculating the percentage of each item is already included in the WITH_ROLLUP query. Each contentId has its own content_count and there’s an additional row with a NULL contentId and a content_count that has the total sum of all the items. That means you can make use of the NULL value to do the following calculation in a query:

You simply need to divide the content_count value of each contentId by the content_count of the row that has a NULL contentId (the sum), and multiply that by 100, giving you the percentage for each item.

However, there is one gotcha that needs to be accounted for: ensuring that the only NULL contentId is for the row that WITH_ROLLUP generates. Fortunately, SQL provides COALESCE() to handle NULL values.

Let’s take a look at the full common table expression, which uses both COALESCE() and WITH_ROLLUP, to get both the count of each item as well as its overall percentage in a single pass.

The first part of the common table expression uses COALESCE() to change any NULL contentId values in the temporary result set to __missing__. This ensures that the row that includes the sum will be the only NULL contentId value.

Here’s the output of the query:

contentcontent_countpct_total
Video_1116.67
Video_3150
Video_2233.33

Note that this query doesn’t include the aggregated row from WITH_ROLLUP. You can easily add the aggregated value which includes the sum (and a pct_total of 100) by changing FROM CountPerContent WHERE content IS NOT NULL to FROM CountPerContent

Summary

By using WITH ROLLUP, you can get the percentage of each item in a data set in a single pass, saving you both time and computational resources. You’ll also see much better performance than you would with a CROSS JOIN.

Next Steps

WITH ROLLUP queries can help you more efficiently work with data at scale, including your log data. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your log data—all at a fraction of the cost of other observability platforms. Learn more about Hydrolix.

Photo by Bozhin Karaivanov

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Query optimization is an essential part of working with big data—otherwise, you’ll run into issues with long-running queries, potential timeouts, and additional compute costs. Some optimizations will make your queries more efficient, while others can help you get more data with fewer passes. In this post, you’ll learn how to use WITH ROLLUP to calculate the percentage of each item in a data set in a single pass. When using WITH ROLLUP in Hydrolix, which is built to handle log data at scale, you get up to 5x better performance than using a CROSS JOIN to get the same data. 

Use Cases for WITH ROLLUP

You can use WITH ROLLUP to get a sum of all the values in a grouping. A common use case is summing total sales or sales by region, but there are many other use cases as well. With log data, you might want to group requests by region or provider while also getting a total summed value of requests.

WITH ROLLUP can help you calculate percentages, too. You can use the sum that WITH ROLLUP returns to calculate the overall percentage of each item in a column, all in one query.

By calculating percentages, you can see how requests are distributed across services—is there an even distribution or do resources need to be reallocated? Or perhaps you simply want to see, by percentage, what products your customers are using.

Regardless of your use case, you’ll be able to minimize the number of queries you need to make for calculating sums and percentages by using WITH ROLLUP. This is important when you’re working with big data, and even more so if you’re regularly compiling the data for dashboards or other visualizations. In these situations, you want to minimize the number of queries you need to make, saving you both time and compute.

Calculate the Percentage of Each Item Using WITH ROLLUP

Let’s take a look at an example that returns the percentage of each item in a data set. This use case is typical for a company that wants to better understand customer behavior. Specifically, a media company might want to break down streaming content usage by percentage. This is a simple example using the following dataset of contentIds:

contentId
Video_1
Video_2
Video_2
Video_2
Video_3
Video_3

For a dashboard, it might be helpful to have not just the total number of downloads for each video but also the overall percentage of downloads for each video. Having a percentage value can better help you compare the popularity of each item.

To try out the query, you’ll first need a temporary table that includes sample data. You can easily create a temporary table and test the queries in this post using Clickhouse Fiddle. Note that the query examples mostly use plain SQL but that the Nullable(String) type is specific to Clickhouse.

To follow along, create a temporary table with the following data:

Real world datasets will be much larger, but this is all you’ll need for the example.

Next, let’s take a look at a basic WITH_ROLLUP query that sums the content count of all the items.

This returns the following output:

contentIdcontent_count
Video_11
Video_23
Video_32
NULL6

In contrast, a simple count query without WITH_ROLLUP would return the following:

contentIdcontent_count
Video_11
Video_23
Video_32

WITH_ROLLUP adds an additional row with the sum of the content_count. Note it has a NULL value for its contentId, which makes sense because it’s an aggregation of all the contentIds and doesn’t have its own ID. This NULL value is important for the full query—more on that in a moment.

Sometimes a simple sum is all you need, but what if you want to calculate the percentage for each item in a result set? For instance, you might want to know what percentage of users watched video_1 versus video_2. Fortunately, all of the data for calculating the percentage of each item is already included in the WITH_ROLLUP query. Each contentId has its own content_count and there’s an additional row with a NULL contentId and a content_count that has the total sum of all the items. That means you can make use of the NULL value to do the following calculation in a query:

You simply need to divide the content_count value of each contentId by the content_count of the row that has a NULL contentId (the sum), and multiply that by 100, giving you the percentage for each item.

However, there is one gotcha that needs to be accounted for: ensuring that the only NULL contentId is for the row that WITH_ROLLUP generates. Fortunately, SQL provides COALESCE() to handle NULL values.

Let’s take a look at the full common table expression, which uses both COALESCE() and WITH_ROLLUP, to get both the count of each item as well as its overall percentage in a single pass.

The first part of the common table expression uses COALESCE() to change any NULL contentId values in the temporary result set to __missing__. This ensures that the row that includes the sum will be the only NULL contentId value.

Here’s the output of the query:

contentcontent_countpct_total
Video_1116.67
Video_3150
Video_2233.33

Note that this query doesn’t include the aggregated row from WITH_ROLLUP. You can easily add the aggregated value which includes the sum (and a pct_total of 100) by changing FROM CountPerContent WHERE content IS NOT NULL to FROM CountPerContent

Summary

By using WITH ROLLUP, you can get the percentage of each item in a data set in a single pass, saving you both time and computational resources. You’ll also see much better performance than you would with a CROSS JOIN.

Next Steps

WITH ROLLUP queries can help you more efficiently work with data at scale, including your log data. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your log data—all at a fraction of the cost of other observability platforms. Learn more about Hydrolix.

Photo by Bozhin Karaivanov