In this blog post we’ll see how to leverage Clickhouse plugin and Grafana to create really performant dashboard on Hydrolix data platform.
Dashboard Variables
Grafana has a concept of variable built-into the dashboard, those variables can then be used to filter the data. We recommend limiting those variables to either fixed value pre-defined in advance, or limit the value selection to the timeframe of the dashboard.
Lets use HTTP status codes as an example, we know in advance all the status code that exists, we don’t necessarily need to query the data in Hydrolix to get those. What we can do is the following query:
1 |
SELECT arrayJoin([200,201,202,204,205,206,302,304,307,308,400,401,403,404,405,406,408,409,410,411,412,413,414,415,416,417,421,425,428,429,431,451,500,501,502,503,504,505,506,511]) |
This will generate a list of HTTP status codes directly in the variable:
This concept is really important and interesting, as it avoids queries on cloud storage to get those values.
Indeed if we were doing query on the column status_code it would requires downloading part of the partition to list the different status code. With that example we are leveraging only local data to generate the list of http status code.
If you need to have some variable values defined by querying the Hydrolix data (for example User-Agent string) you have to limit the selection of those values to the time range of your dashboard. Otherwise this will generate a full column scan which requires opening every partition and is very expensive (not to mention offer users values outside of the time bounds).
1 |
SELECT user_agent FROM project.table WHERE $__timeFilter_ms(timestamp) GROUP BY user_agent ORDER BY COUNT() DESC LIMIT 50 |
Here we are leveraging the built-in macro $__timeFilter
(for seconds resolution), $__timeFilter_ms
(for milliseconds resolution). This built-in macro will automatically add the time range filtering using the column specified.
In my example we are using a timestamp column name and it’s stored in milliseconds.
This example also avoids using SELECT DISTINCT but uses GROUP BY instead, it’s more efficient and less memory consuming.
Last example is using a dictionary file to SELECT values. If you are using dictionary with Hydrolix you can specify the filename uploaded to your cluster, you can then leverage the following statement to open the dictionary file and use it to select value:
1 |
SELECT DISTINCT country from file('country.csv', 'CSVWithNames') |
In this example it means we have uploaded a dictionary file called country.csv and it’s a CSVWithNames that Hydrolix is reading locally. This approach is very useful when you have a lot of values, as we don’t open data from cloud storage and dictionary are in memory.
Improve your SQL statement
Do not include variable not used in your SQL Statement, otherwise it’ll impact the performance of your charts.
If you use a drop down filter variable with multiple elements you can use the built-in variable $__conditionalAll
It’ll check that the variable is All
, if all is selected it’ll then replace the statement with 1=1.
If you don’t leverage this function then the SQL statement generated will automatically include all the possible value from the filter, which makes it very difficult to read and expensive.
Example of usage:
1 |
AND $__conditionalAll(status_code IN (${http_status}), $http_status) |
This example means that if the users SELECT all the http status code I won’t have any filters, the SQL statement generated will be:
1 |
AND 1=1 |
If the variable is selecting 200 and 304 status code then the statement will be rewritten as:
1 |
AND status_code IN (200, 304) |
Unfortunately text-box don’t work with $__conditionalAll
, so if you want to use those a free text search and filter with a regex or LIKE
function for example you need to use the following in your query:
1 |
AND if ('${session_id:text}' = '', true, session_id LIKE '%${session_id:text}%') |
With the following text box:
If the value of the text box is empty we are not applying the SQL statement session_id LIKE ‘%${session_id:text}%’.
If the value is different than empty then we’ll apply the SQL statement.
A full query example would be something like:
1 2 3 4 5 6 |
SELECT count() FROM table.column WHERE $__timeFilter_ms(timestamp) AND $__conditionalAll(status_code IN (${http_status}), $http_status) AND $__conditionalAll(user_agent IN (${user_agent}), $user_agent) AND $__conditionalAll(country IN (${country}), $country) AND if ('${session_id:text}' = '', true, session_id LIKE '%${session_id:text}%') |
This leverage all the different variables define above.
Add username in query settings
Hydrolix provides a real time view of the query running in the cluster, we can add extra information to the query running by specifying extra information in the SETTINGS of the query.
In this view you can view the query running and also the username in the admin_comment section. To add the username automatically you can modify your SQL statement and append the following:
1 |
SETTINGS hdx_query_admin_comment='${__user.login} |
This will get the login used by the user and add that information as an admin comment which is then displayed in the active query section.
So to summarise your query for your chart should look like the following:
1 2 3 4 5 6 7 |
SELECT count() FROM table.column WHERE $__timeFilter_ms(timestamp) AND $__conditionalAll(status_code IN (${http_status}), $http_status) AND $__conditionalAll(user_agent IN (${user_agent}), $user_agent) AND $__conditionalAll(country IN (${country}), $country) AND if ('${session_id:text}' = '', true, session_id LIKE '%${session_id:text}%') SETTINGS hdx_query_admin_comment='${__user.login}' |