RSS

Analyzing Fastly CDN transaction logs with Hydrolix

In this post, we will show you how to analyze the Fastly CDN logs you have streamed to Hydrolix. This post is part of a series showing how to use Hydrolix and an open source dashboard to maximize your Fastly CDN observability quickly, cheaply, and in your own VPC. Before following these quick instructions for…

Alan

Published:

Feb 24, 2021

7 minute read
,

In this post, we will show you how to analyze the Fastly CDN logs you have streamed to Hydrolix.

This post is part of a series showing how to use Hydrolix and an open source dashboard to maximize your Fastly CDN observability quickly, cheaply, and in your own VPC. Before following these quick instructions for analyzing Fastly transaction logs using Hydrolix, check out how to configure Fastly to stream logs into Hydrolix with an HTTPS endpoint and how to configure the Hydrolix streaming intake for those logs. Once you are querying the logs, you can also set up a dashboard to visualize your queries. You can also always refer to our Hydrolix documentation.

This short tutorial covers a few example queries that are designed to tell us a bit more about the Fastly log data that we are collecting. These are example queries to give you a sense for observing your Fastly CDN infrastructure with Hydrolix. You can certainly write your own Clickhouse queries better suited to your own needs, and we encourage you to reach out to us if you need any help.

The data that we will be querying in these examples is stored in the fastly.xlogs table.

Defining the dataset

Let’s start with understanding a bit more about the data that we are working with. The extended log format adds quite a few more columns so visual confirmation of the table structure is always useful. We can employ a simple DESCRIBE command to take a closer look at the table definition.

Returns the table definition:

For this example, we recently moved to an extended log format for all of our Fastly real-time streaming logs. The following query will help to establish exactly when that cutover took place as well as providing us with the most recent record and overall row count. min() and max() aggregate functions against our selected datetimecolumn, time_start, will provide us with the earliest and latest records for our data.

Find the top N requested urls per week

The url column represents page requests which can contain query string parameters and references to static assets.

We can clean up the urls at query time using cutQueryString(url) to strip any query string params and also filter out the static assets using match(string, pattern) to exclude unwanted results. Finally, we can group byStartOfWeek(time_start) and topK(N)(column) function to calculate top values. A sub-query will do the filtering, cleaning, and time grouping.

Find the top 3 slowest urls per host

While it’s certainly useful to know what the top requested urls for the week are, data becomes significantly more actionable with the application of context. What would happen if our top requested urls were also our slowest? It goes without saying that they probably wouldn’t be the top N urls for long. As such, we may also want to keep an eye out for potentially problematic urls by monitoring response times in the event that we can proactively identify issues and resolve them before they become a problem.

This query is a bit different in that it looks at the full urls that have the highest time_elapsed values (in milliseconds) since the start of the week. In this case, it’s useful to have full visibility into the entire url since we are looking to identify areas of concern from a latency perspective; hence, we won’t be using cutQueryString(url) but instead using concat(host, url) to gives us the full url path.

If you didn’t notice, we are using a relatively unknown clause, LIMIT BY, which is unique to Clickhouse. It helps us limit the results returned in the subquery to 3 urls per host and is a great way to gain greater control of your data. The LIMIT n BY expression clause selects the first n rows for each distinct value of expressions. The key for LIMIT BY can contain any number of expressions.

Identify the top error causing urls using HTTP 4xx/5xx response codes

Keeping with the theme of leveraging the data to better understand the health of our website, HTTP response codes offer an option for monitoring client and server-side errors. HTTP 4xx status codes typically represent client-side errors and HTTP 5xx status codes are reserved for cases where the server has encountered an error and is incapable of performing the request. In this example, instead of setting the timeframe toStartOfWeek(), we use interval 7 day with now() to provide a trailing time window and match(status, '^[45]\d\d’) to pick up any status string result that starts with ‘4’ or ‘5’.

Find out what % of requests go to docs.hydrolix.io

Our website hosts both our documentation (docs.hydrolix.io) and live query API (try.hydrolix.io) that executes these SQL examples. Previously, we found our topN urls but what percentage of that traffic is directed at our docs site?

The host column represents which domain the traffic is targetting. It’s a simple task at query time to compare a foreground data set to the full data set. We will use the countIf(column, condition) which is a combinator aggregation function on count(). The condition will be based on startsWith(string, condition). This allows us to conditionally count a foreground subset of the data. The time will be grouped on toStartOfDay(datetime). As this returns a time component by default, we format the output using formatDateTime(datetime, format) to remove the time.

Top 10 weekly visitor locations

Now that we have a pretty good handle on our top urls and are on the lookout for potential issues that may impact the user experience, it begs the question where are all these visitors coming from?

Everyone likes a good geolocation-based stat so let’s use a simple query to find out. Note that there are some null values present in the dataset so we’ve added city != (null) to avoid returning results that don’t provide us with a city.

While the query itself is simple, this may serve as a starting point for better understanding who your clients are, where they are coming from, and ultimately using the additional insight to ensure you are best serving your audience. Alternatively, if you are more of a security-minded individual, you could combine a geolocation-based query with uniq(client_ip)request_refererrequest_user_agentrequest_x_forwarded_for and look for non-standard request types as an example.

Find the impact of response time vs cache status

The time_elapsed column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the cache_status.

It’s worth mentioning that you could use the quantiles(l1, l2 ..)(column) function to quickly grab multiple percentile quantiles on a single scan but we’ve chosen to use separate quantileExact(level)(expr) functions for higher resolution and better clarity in the results formatting.

As expected, a cache MISS will have a big impact on response times; although, the worst response times seem to occur with a cache ERROR. Hey, that sounds like a great opportunity to possibly modify the top N query from above with the addition of a cache_status filter on cache_status ERROR!

Monitor the cache error rate

Based on what we just discovered above with cache ERROR, a high-level metric that helps to monitor overall cache error rate sounds like it would be of value. Similar to the approach we used in the query to determine what percent of visitors are visiting docs.hydrolix.io versus try.hydrolix.io, we leverage countIf(cache_status = 'ERROR') to count and increment only when cache_status = 'ERROR'.That value is then divided by the total number of requests to produce a percentage rate.

Monthly HTTP response code metrics

For our final example, we produce an extremely useful suite of monthly metrics based on HTTP response code status. countIf is used in conjunction with (match(haystack, pattern) to produce selective and running totals based on response codes in the subquery. We subsequently take those totals, summarize them using sum(), and produce monthly totals and rates.

While all of this data is undoubtedly useful in the right context, deriving value from metrics-oriented datasets can be difficult without the right visualization strategy. Hydrolix is quieried through Clickhouse, so we automatically support any visualization tool that Clickhouse supports. For the next example, we’ll show the Grafana dashboards we use for Fastly CDN observability.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

In this post, we will show you how to analyze the Fastly CDN logs you have streamed to Hydrolix.

This post is part of a series showing how to use Hydrolix and an open source dashboard to maximize your Fastly CDN observability quickly, cheaply, and in your own VPC. Before following these quick instructions for analyzing Fastly transaction logs using Hydrolix, check out how to configure Fastly to stream logs into Hydrolix with an HTTPS endpoint and how to configure the Hydrolix streaming intake for those logs. Once you are querying the logs, you can also set up a dashboard to visualize your queries. You can also always refer to our Hydrolix documentation.

This short tutorial covers a few example queries that are designed to tell us a bit more about the Fastly log data that we are collecting. These are example queries to give you a sense for observing your Fastly CDN infrastructure with Hydrolix. You can certainly write your own Clickhouse queries better suited to your own needs, and we encourage you to reach out to us if you need any help.

The data that we will be querying in these examples is stored in the fastly.xlogs table.

Defining the dataset

Let’s start with understanding a bit more about the data that we are working with. The extended log format adds quite a few more columns so visual confirmation of the table structure is always useful. We can employ a simple DESCRIBE command to take a closer look at the table definition.

Returns the table definition:

For this example, we recently moved to an extended log format for all of our Fastly real-time streaming logs. The following query will help to establish exactly when that cutover took place as well as providing us with the most recent record and overall row count. min() and max() aggregate functions against our selected datetimecolumn, time_start, will provide us with the earliest and latest records for our data.

Find the top N requested urls per week

The url column represents page requests which can contain query string parameters and references to static assets.

We can clean up the urls at query time using cutQueryString(url) to strip any query string params and also filter out the static assets using match(string, pattern) to exclude unwanted results. Finally, we can group byStartOfWeek(time_start) and topK(N)(column) function to calculate top values. A sub-query will do the filtering, cleaning, and time grouping.

Find the top 3 slowest urls per host

While it’s certainly useful to know what the top requested urls for the week are, data becomes significantly more actionable with the application of context. What would happen if our top requested urls were also our slowest? It goes without saying that they probably wouldn’t be the top N urls for long. As such, we may also want to keep an eye out for potentially problematic urls by monitoring response times in the event that we can proactively identify issues and resolve them before they become a problem.

This query is a bit different in that it looks at the full urls that have the highest time_elapsed values (in milliseconds) since the start of the week. In this case, it’s useful to have full visibility into the entire url since we are looking to identify areas of concern from a latency perspective; hence, we won’t be using cutQueryString(url) but instead using concat(host, url) to gives us the full url path.

If you didn’t notice, we are using a relatively unknown clause, LIMIT BY, which is unique to Clickhouse. It helps us limit the results returned in the subquery to 3 urls per host and is a great way to gain greater control of your data. The LIMIT n BY expression clause selects the first n rows for each distinct value of expressions. The key for LIMIT BY can contain any number of expressions.

Identify the top error causing urls using HTTP 4xx/5xx response codes

Keeping with the theme of leveraging the data to better understand the health of our website, HTTP response codes offer an option for monitoring client and server-side errors. HTTP 4xx status codes typically represent client-side errors and HTTP 5xx status codes are reserved for cases where the server has encountered an error and is incapable of performing the request. In this example, instead of setting the timeframe toStartOfWeek(), we use interval 7 day with now() to provide a trailing time window and match(status, '^[45]\d\d’) to pick up any status string result that starts with ‘4’ or ‘5’.

Find out what % of requests go to docs.hydrolix.io

Our website hosts both our documentation (docs.hydrolix.io) and live query API (try.hydrolix.io) that executes these SQL examples. Previously, we found our topN urls but what percentage of that traffic is directed at our docs site?

The host column represents which domain the traffic is targetting. It’s a simple task at query time to compare a foreground data set to the full data set. We will use the countIf(column, condition) which is a combinator aggregation function on count(). The condition will be based on startsWith(string, condition). This allows us to conditionally count a foreground subset of the data. The time will be grouped on toStartOfDay(datetime). As this returns a time component by default, we format the output using formatDateTime(datetime, format) to remove the time.

Top 10 weekly visitor locations

Now that we have a pretty good handle on our top urls and are on the lookout for potential issues that may impact the user experience, it begs the question where are all these visitors coming from?

Everyone likes a good geolocation-based stat so let’s use a simple query to find out. Note that there are some null values present in the dataset so we’ve added city != (null) to avoid returning results that don’t provide us with a city.

While the query itself is simple, this may serve as a starting point for better understanding who your clients are, where they are coming from, and ultimately using the additional insight to ensure you are best serving your audience. Alternatively, if you are more of a security-minded individual, you could combine a geolocation-based query with uniq(client_ip)request_refererrequest_user_agentrequest_x_forwarded_for and look for non-standard request types as an example.

Find the impact of response time vs cache status

The time_elapsed column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the cache_status.

It’s worth mentioning that you could use the quantiles(l1, l2 ..)(column) function to quickly grab multiple percentile quantiles on a single scan but we’ve chosen to use separate quantileExact(level)(expr) functions for higher resolution and better clarity in the results formatting.

As expected, a cache MISS will have a big impact on response times; although, the worst response times seem to occur with a cache ERROR. Hey, that sounds like a great opportunity to possibly modify the top N query from above with the addition of a cache_status filter on cache_status ERROR!

Monitor the cache error rate

Based on what we just discovered above with cache ERROR, a high-level metric that helps to monitor overall cache error rate sounds like it would be of value. Similar to the approach we used in the query to determine what percent of visitors are visiting docs.hydrolix.io versus try.hydrolix.io, we leverage countIf(cache_status = 'ERROR') to count and increment only when cache_status = 'ERROR'.That value is then divided by the total number of requests to produce a percentage rate.

Monthly HTTP response code metrics

For our final example, we produce an extremely useful suite of monthly metrics based on HTTP response code status. countIf is used in conjunction with (match(haystack, pattern) to produce selective and running totals based on response codes in the subquery. We subsequently take those totals, summarize them using sum(), and produce monthly totals and rates.

While all of this data is undoubtedly useful in the right context, deriving value from metrics-oriented datasets can be difficult without the right visualization strategy. Hydrolix is quieried through Clickhouse, so we automatically support any visualization tool that Clickhouse supports. For the next example, we’ll show the Grafana dashboards we use for Fastly CDN observability.