In the previous blog post eating our own logs we saw how Hydrolix cluster is indexing its own logs.
Our logs contains lots of very useful information, in this blog post we’ll see how we can use those logs to create a query performance monitoring dashboard.
Define KPI
Before we start we need to think a little bit about what are the KPI we need to monitor, some are pretty obvious:
- Query rate
- Error rate
- Response time
But what does impact response time? What are the other informations we have in our logs to help us address performance?
Monitoring the number of partitions used by a query is important, looking at the bytes read, rows read and memory usage for a given query is also critical.
Let’s look at an example of query log indexed into Hydrolix:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | { “app”: “query-head”, “bytes_read”: “92666655”, “component”: “query_executor”, “container”: “query-head”, “exec_time”: “1435”, “file”: “\/var\/log\/pods\/hdxcli-linodepd_query-head-5f78476fb4-grzdt_a53d0bf4-8a06-4364-9a17-59b295da8337\/query-head\/0.log”, “hdx_query_admin_comment”: “akamai”, “hydrolix_version”: “v3.21.1”, “index_stats”: “[{\”akamai.logs\”:{\”columns_read\”:[\”apikey\”,\”reqTimeSec\”],\”distributed_join_or_in\”:false,\”indexes_used\”:[\”reqTimeSec\”],\”shard_key_values_used\”:[]}}]”, “interface”: “tcp”, “kubernetes.container_id”: “docker:\/\/957c6d1764a4b795cb7ce08920024e703b7ebce24294179d082630d0c52e9557”, “kubernetes.container_image”: “public.ecr.aws\/l2i3s2a2\/turbine:007524b1-90”, “kubernetes.container_name”: “query-head”, “kubernetes.namespace_labels”: {“kubernetes.io\/metadata.name”:“hdxcli-linodepd”}, “kubernetes.pod_annotations”: {“cluster-autoscaler.kubernetes.io\/safe-to-evict”:“true”,“cni.projectcalico.org\/containerID”:“5bcbffe51e82caf81d36d9ac5dbed71b71d05c998e05cc006d3ad34525654732”,“cni.projectcalico.org\/podIP”:“10.2.12.6\/32”,“cni.projectcalico.org\/podIPs”:“10.2.12.6\/32”,“hydrolix.io\/metrics-info”:“query-head:8088:\/metrics”,“hydrolix.io\/service”:“query-head”}, “kubernetes.pod_ip”: “10.2.12.6”, “kubernetes.pod_ips”: [“10.2.12.6”], “kubernetes.pod_labels”: {“app”:“query-head”,“pod-template-hash”:“5f78476fb4”}, “kubernetes.pod_name”: “query-head-5f78476fb4-grzdt”, “kubernetes.pod_namespace”: “hdxcli-linodepd”, “kubernetes.pod_node_name”: “lke67272-104387-634d7dbb276e”, “kubernetes.pod_owner”: “ReplicaSet\/query-head-5f78476fb4”, “kubernetes.pod_uid”: “a53d0bf4-8a06-4364-9a17-59b295da8337”, “level”: “info”, “memory_usage_bytes”: “24413128”, “num_partitions”: “58”, “num_peers”: “6”, “original_timestamp”: “2022-10-20T10:39:35.299+00:00”, “query”: “SELECT toStartOfInterval(toDateTime(reqTimeSec), INTERVAL 2 second) as time, count(), apikey FROM akamai.logs WHERE reqTimeSec >= ‘1666260572’ AND reqTimeSec <= ‘1666262372’ \nAND 1=1\nAND 1=1\nAND 1=1\nAND 1=1\nAND 1=1\nAND 1=1\nGROUP BY time, apikey ORDER BY time ASC\nSETTINGS hdx_query_admin_comment=’akamai'”, “query_attempts”: 0, “query_id”: “d32f6dad-6744-4071-8b54-ed677c2c09d3”, “query_start_time”: “2022-10-20 10:39:33”, “result_bytes”: “1567669”, “result_rows”: “19758”, “rows_read”: “2030639”, “source_type”: “kubernetes_logs”, “stream”: “stdout”, “timestamp”: “2022-10-20 10:39:35.299”, “turbine_revision”: “007524b1”, “user”: “services@hydrolix.io” } |
As you can see our logs contains lots of detailed informations about the query performance.
Now let’s dive into Grafana and start building !
Creating the dashboard
Define dashboard filters
To be able to identify bad query performance we want to be able to filter through our logs very quickly.
We want to be able to quickly find queries that are using the most memory, partitions that have bad response time.
Grafana let you define variable, in our query performance monitoring dashboard we are using the following:

Let’s take a look at the memory_usage_percentile
:

It’s a custom type which list the different percentile we want to be able to filter on.
Now let’s see how we can use that variable into a SQL statement to actually find the value in question.
To do that we are creating a second Grafana variable which is hidden from the user and will do the SQL query to get the value:

Here we are creating a new variable called memory_usage_bytes
which is running the SQL statement:
1 2 3 4 5 6 | SELECT arrayJoin(quantiles(${memory_usage_percentile})(memory_usage_bytes)) FROM hydro.logs WHERE component = ‘query_executor’ AND app = ‘query-head’ AND $__timeFilter_ms(timestamp) SETTINGS hdx_query_max_timerange_sec=604800, hdx_query_max_execution_time=60,hdx_query_admin_comment=‘${__user.login}’ |
This SQL statement is leveraging our previous variable ${memory_usage_percentile}
so when the user select the 0.95
value for example this will automatically calculate the 95th percentile of memory usage.
The same principle is being applied for the num_partitions_percentile
but this time using the column num_partitions
.
Response time is another example where we define the interval we want to filter on:

Here we define interval where the first part is what users will select and the second part will be the actual value for the variable response_time
1 2 3 4 5 6 7 8 | <1s : exec_time < 1000, 1s–2s : exec_time BETWEEN 1000 AND 2000, 2s–5s : exec_time BETWEEN 2000 AND 5000, 5s–10s : exec_time BETWEEN 5000 AND 10000, 10s–20s : exec_time BETWEEN 10000 AND 20000, 20s–30s : exec_time BETWEEN 20000 AND 30000, 30s–1m : exec_time BETWEEN 30000 AND 60000, 1min> : exec_time > 60000 |
Hydrolix query contains lots of different options such as admin comment, Grafana and other tool can leverage those to include information about users, see more details on our docs.
This allow our performance monitoring dashboard to be able to filter on specific users too.

Here we are creating a new variable called user
which is listing the hdx_query_admin_comment
leveraging the following statement:
1 2 3 4 5 6 7 | SELECT hdx_query_admin_comment FROM hydro.logs WHERE component = ‘query_executor’ AND app = ‘query-head’ AND $__timeFilter_ms(timestamp) GROUP BY hdx_query_admin_comment SETTINGS hdx_query_max_timerange_sec=604800, hdx_query_max_execution_time=60,hdx_query_admin_comment=‘${__user.login}’ |
And finally we have free text search to look for specific pattern in query or errors, those are leveraging our full text search capability.
Create charts
Now that we have define all the criteria we want to use to filter our data, let’s create a chart leveraging all those variable.

The first chart is actually counting the number of query, if it’s a success or an error. It’s leveraging the following statement:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | SELECT $__timeInterval(timestamp) as time, count(), level FROM hydro.logs WHERE component = ‘query_executor’ AND app = ‘query-head’ AND $__timeFilter_ms(timestamp) AND $__conditionalAll(memory_usage_bytes >= $memory_usage_bytes, $memory_usage_percentile) AND $__conditionalAll($response_time, $response_time) AND $__conditionalAll(num_partitions >= $num_partitions, $num_partitions_percentile) AND $__conditionalAll(level IN ($query_status), $query_status) AND $__conditionalAll(hdx_query_admin_comment IN ($user), $user) AND if(‘${query_search:text}’ = ”, true, query LIKE ‘%${query_search}%’) AND if(‘${error_search:text}’ = ”, true, stack_trace LIKE ‘%${error_search}%’) GROUP BY time, level ORDER BY time ASC SETTINGS hdx_query_max_timerange_sec=604800, hdx_query_max_execution_time=60,hdx_query_admin_comment=‘${__user.login}’ |
This query leverages several built-in Grafana function
$__timeFilter_ms
add the timerange filter from the dashboard.$__timeInterval
generate time interval to group by the data.$__conditionalAll
replace the value with 1=1 if the variable is set to All
So in our example we are leveraging an interesting principle:AND $__conditionalAll(memory_usage_bytes >= $memory_usage_bytes, $memory_usage_percentile)
This means that if the variable $memory_usage_percentile
is All
we do nothing, but if the variable is set to something else then we are running the sql statement:AND memory_usage_bytes >= $memory_usage_bytes
$memory_usage_bytes
is our hidden variable which calculate automatically the memory_usage_bytes percentile.
Another interesting concept is the following:AND if('${query_search:text}' = '', true, query LIKE '%${query_search}%')
This time we leverage if
statement from SQL, so if the variable ${query_search:text}
is empty the condition will be true
and won’t run the SQL statement LIKE '%${query_search}%'
.
But if the variable is not empty the evaluation won’t be true anymore and we will run the LIKE '%${query_search}%'
.