In this post, we will show you how to analyze the Fastly CDN logs you have streamed to Hydrolix.
This post is part of a series showing how to use Hydrolix and an open source dashboard to maximize your Fastly CDN observability quickly, cheaply, and in your own VPC. Before following these quick instructions for analyzing Fastly transaction logs using Hydrolix, check out how to configure Fastly to stream logs into Hydrolix with an HTTPS endpoint and how to configure the Hydrolix streaming intake for those logs. Once you are querying the logs, you can also set up a dashboard to visualize your queries. You can also always refer to our Hydrolix documentation.
This short tutorial covers a few example queries that are designed to tell us a bit more about the Fastly log data that we are collecting. These are example queries to give you a sense for observing your Fastly CDN infrastructure with Hydrolix. You can certainly write your own Clickhouse queries better suited to your own needs, and we encourage you to reach out to us if you need any help.
The data that we will be querying in these examples is stored in the fastly.xlogs
table.
Defining the dataset
Let’s start with understanding a bit more about the data that we are working with. The extended log format adds quite a few more columns so visual confirmation of the table structure is always useful. We can employ a simple DESCRIBE
command to take a closer look at the table definition.
1 2 |
DESCRIBE fastly.xlogs |
Returns the table definition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
cache_status String client_ip String content_type String geo_city String geo_continent_code String geo_country_code String geo_datacenter String geo_region String h2_stream_id String host String is_cacheable Nullable(UInt8) is_h2 Nullable(UInt8) is_h2_push Nullable(UInt8) is_ipv6 Nullable(UInt8) is_tls Nullable(UInt8) origin_host String protocol String req_body_size Nullable(UInt32) req_header_size Nullable(UInt32) request String request_accept_charset String request_accept_content String request_accept_encoding String request_accept_language String request_cache_control String request_connection String request_dnt String request_forwarded String request_referer String request_user_agent String request_via String request_x_forwarded_for String request_x_requested_with String resp_body_size Nullable(UInt32) resp_header_size Nullable(UInt32) response_age Nullable(UInt32) response_cache_control String response_expires String response_last_modified String response_tsv String service_id String service_version String socket_cwnd Nullable(UInt32) socket_nexthop String socket_ploss Nullable(Float64) socket_tcpi_delta_retrans Nullable(UInt32) socket_tcpi_last_data_sent Nullable(UInt32) socket_tcpi_rcv_mss Nullable(UInt32) socket_tcpi_rcv_rtt Nullable(UInt32) socket_tcpi_rcv_space Nullable(UInt32) socket_tcpi_rtt Nullable(UInt32) socket_tcpi_rttvar Nullable(UInt32) socket_tcpi_snd_mss Nullable(UInt32) socket_tcpi_total_retrans Nullable(UInt32) status String time_elapsed Nullable(UInt32) time_end Nullable(DateTime) time_start DateTime tls_client_cipher String tls_client_cipher_sha String tls_client_protocol String tls_client_servername String tls_client_tlsexts_sha String url String |
For this example, we recently moved to an extended log format for all of our Fastly real-time streaming logs. The following query will help to establish exactly when that cutover took place as well as providing us with the most recent record and overall row count. min()
and max()
aggregate functions against our selected datetimecolumn, time_start
, will provide us with the earliest and latest records for our data.
1 2 3 |
SELECT min(time_start), max(time_start), count() FROM fastly.xlogs WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) |
Find the top N requested urls per week
The url
column represents page requests which can contain query string parameters
and references to static assets.
We can clean up the urls at query time using cutQueryString(url)
to strip any query string params
and also filter out the static assets using match(string, pattern)
to exclude unwanted results. Finally, we can group byStartOfWeek(time_start)
and topK(N)(column)
function to calculate top values. A sub-query will do the filtering, cleaning, and time grouping.
1 2 3 4 5 6 7 8 9 |
SELECT week, topK(10)(path) FROM ( SELECT toStartOfWeek(time_start) AS week, cutQueryString(url) AS path FROM fastly.xlogs WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND match(url,'^.*\.(css$|js$|json$|jpg$|js$|png$|svg$|xml$)') != 1 GROUP BY week, path) GROUP BY week |
Find the top 3 slowest urls per host
While it’s certainly useful to know what the top requested urls for the week are, data becomes significantly more actionable with the application of context. What would happen if our top requested urls were also our slowest? It goes without saying that they probably wouldn’t be the top N urls for long. As such, we may also want to keep an eye out for potentially problematic urls by monitoring response times in the event that we can proactively identify issues and resolve them before they become a problem.
This query is a bit different in that it looks at the full urls that have the highest time_elapsed
values (in milliseconds) since the start of the week. In this case, it’s useful to have full visibility into the entire url since we are looking to identify areas of concern from a latency perspective; hence, we won’t be using cutQueryString(url)
but instead using concat(host, url)
to gives us the full url path.
If you didn’t notice, we are using a relatively unknown clause, LIMIT BY
, which is unique to Clickhouse. It helps us limit the results returned in the subquery to 3 urls per host and is a great way to gain greater control of your data. The LIMIT n BY
expression clause selects the first n
rows for each distinct value of expressions. The key for LIMIT BY
can contain any number of expressions.
1 2 3 4 5 6 7 8 9 10 |
SELECT week, concat(host, url) as full_url, (time_elapsed/1000) as time_elapsed_msec FROM ( SELECT toStartOfWeek(time_start) AS week, host, url, time_elapsed FROM fastly.xlogs WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) GROUP BY week, host, url, time_elapsed LIMIT 3 BY host) GROUP BY week, full_url, time_elapsed_msec ORDER BY time_elapsed_msec |
Identify the top error causing urls using HTTP 4xx/5xx response codes
Keeping with the theme of leveraging the data to better understand the health of our website, HTTP response codes offer an option for monitoring client and server-side errors. HTTP 4xx status codes typically represent client-side errors and HTTP 5xx status codes are reserved for cases where the server has encountered an error and is incapable of performing the request. In this example, instead of setting the timeframe toStartOfWeek()
, we use interval 7 day
with now()
to provide a trailing time window and match(status, ‘^[45]\d\d’)
to pick up any status string result that starts with ‘4’ or ‘5’.
1 2 3 4 5 6 7 |
SELECT time_start, status, host, url, count(), cache_status FROM fastly.xlogs WHERE (time_start >= now() - interval 7 day) and match(status, '^[45]\d\d') GROUP BY time_start, host, url, status, cache_status ORDER by count() DESC LIMIT 50 |
Find out what % of requests go to docs.hydrolix.io
Our website hosts both our documentation (docs.hydrolix.io) and live query API (try.hydrolix.io) that executes these SQL examples. Previously, we found our topN urls but what percentage of that traffic is directed at our docs site?
The host
column represents which domain the traffic is targetting. It’s a simple task at query time to compare a foreground data set to the full data set. We will use the countIf(column, condition)
which is a combinator aggregation function
on count()
. The condition will be based on startsWith(string, condition)
. This allows us to conditionally count a foreground subset of the data. The time will be grouped on toStartOfDay(datetime)
. As this returns a time component by default, we format the output using formatDateTime(datetime, format)
to remove the time.
1 2 3 4 5 6 7 8 |
SELECT formatDateTime(toStartOfDay(time_start), '%D') as day, countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs FROM fastly.xlogs WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) GROUP BY day ORDER BY day |
Top 10 weekly visitor locations
Now that we have a pretty good handle on our top urls and are on the lookout for potential issues that may impact the user experience, it begs the question where are all these visitors coming from?
Everyone likes a good geolocation-based stat so let’s use a simple query to find out. Note that there are some null
values present in the dataset so we’ve added city != (null)
to avoid returning results that don’t provide us with a city.
1 2 3 4 5 6 7 8 |
SELECT toStartOfWeek(time_start) as week, geo_city AS city, geo_country_code AS country, count() as count FROM fastly.xlogs WHERE ((time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND city != '(null)') GROUP BY week, city, country ORDER BY count DESC LIMIT 10 |
While the query itself is simple, this may serve as a starting point for better understanding who your clients are, where they are coming from, and ultimately using the additional insight to ensure you are best serving your audience. Alternatively, if you are more of a security-minded individual, you could combine a geolocation-based query with uniq(client_ip)
, request_referer
, request_user_agent
, request_x_forwarded_for
and look for non-standard request types as an example.
Find the impact of response time vs cache status
The time_elapsed
column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the cache_status
.
It’s worth mentioning that you could use the quantiles(l1, l2 ..)(column)
function to quickly grab multiple percentile quantiles on a single scan but we’ve chosen to use separate quantileExact(level)(expr)
functions for higher resolution and better clarity in the results formatting.
1 2 3 4 5 6 7 8 9 10 |
SELECT toDate(time_start) AS day, host, cache_status, (quantileExact(0.50)(time_elapsed)/1000) AS 50_percentile, (quantileExact(0.75)(time_elapsed)/1000) AS 75_percentile, (quantileExact(0.99)(time_elapsed)/1000) AS 99_percentile FROM fastly.xlogs WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) GROUP BY day, host, cache_status ORDER BY day, cache_status |
As expected, a cache MISS
will have a big impact on response times; although, the worst response times seem to occur with a cache ERROR
. Hey, that sounds like a great opportunity to possibly modify the top N query from above with the addition of a cache_status
filter on cache_status
ERROR
!
Monitor the cache error rate
Based on what we just discovered above with cache ERROR
, a high-level metric that helps to monitor overall cache error rate sounds like it would be of value. Similar to the approach we used in the query to determine what percent of visitors are visiting docs.hydrolix.io versus try.hydrolix.io, we leverage countIf(cache_status = 'ERROR')
to count and increment only when cache_status
= 'ERROR'
.That value is then divided by the total number of requests to produce a percentage rate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
SELECT toStartOfMonth(day) AS month, sum(error) AS error, sum(total) AS sumtotal, round((error/sumtotal)*100, 2) AS error_rate FROM ( SELECT toDate(time_start) AS day, count(request) AS total, countIf(cache_status = 'ERROR') AS error FROM fastly.xlogs WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) GROUP BY day ORDER BY day ASC) GROUP BY month ORDER BY month |
Monthly HTTP response code metrics
For our final example, we produce an extremely useful suite of monthly metrics based on HTTP response code status. countIf
is used in conjunction with (match(haystack, pattern)
to produce selective and running totals based on response codes in the subquery. We subsequently take those totals, summarize them using sum()
, and produce monthly totals and rates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SELECT toStartOfMonth(day) AS month, sum(reqtotal) AS total_requests, sum(1XX_2XX) AS success_total, round((success_total/total_requests)*100, 2) AS success_rate, sum(3XX) AS redirect_total, round((redirect_total/total_requests)*100, 2) AS redirect_rate, sum(4XX_5XX) AS error_total, round((error_total/total_requests)*100, 2) AS error_rate FROM ( SELECT toDate(time_start) AS day, count(request) AS reqtotal, countIf(match(status, '^[12]\d\d')) AS 1XX_2XX, countIf(match(status, '^3\d\d')) AS 3XX, countIf(match(status, '^[45]\d\d')) AS 4XX_5XX FROM fastly.xlogs WHERE (time_start BETWEEN day AND now()) GROUP BY day ORDER BY day ASC) GROUP BY month ORDER BY month |
While all of this data is undoubtedly useful in the right context, deriving value from metrics-oriented datasets can be difficult without the right visualization strategy. Hydrolix is quieried through Clickhouse, so we automatically support any visualization tool that Clickhouse supports. For the next example, we’ll show the Grafana dashboards we use for Fastly CDN observability.