RSS

Transform Your Data With Hydrolix

Learn how to write transforms that give you maximum flexibility for processing, enriching, and standardizing your log data.

Franz Knupfer

Published:

Sep 25, 2023

7 minute read
,

Hydrolix uses transform schemas to determine how your data will be mapped to a table in a Hydrolix datastore. While a transform schema is simple to create and publish, it’s also a powerful tool to help you standardize and normalize your data. Hydrolix tables are extremely flexible—they have unlimited dimensionality and cardinality, and they are multi-schema, which means a table can ingest many data sources and also have many transforms. Transform schemas can help you follow logging best practices, including standardizing your data, and you’ll automatically benefit from all the optimizations that Hydrolix performs on your data at ingest.

In this blog post, you’ll learn how to:

  • Write a transform
  • Add a transform using the Hydrolix UI
  • Validate a transform
  • Publish and use a transform

Writing a Transform

Let’s create a basic transform that uses a real world example: CDN log data. For brevity, this example shows only a few of the fields from a CDN log. In addition to showing several different data types you can work with, you’ll see how to utilize some logging best practices in Hydrolix, including standardizing naming conventions and adding metadata. Note that each table in your datastore will have at least one transform, but can potentially have many in order to handle data from different sources.

After defining a name, description, and type (JSON) for the type of data ingested, the transform’s settings are defined. "is_default": true means that this is the default transform for the table. You can have as many transforms associated with a table as you need, but only one can be the default. When you send data to Hydrolix (such as via the streaming API), you can specify which transform the data should use for mapping, but the default will always be there if you need it. To specify a transform for ingest, you’ll use the transform name, so giving each transform a concise, descriptive name is important.

Let’s take a look at the settings for each field to be mapped to the datastore.

  • The timestamp is an example of a "datetime" field and uses Go-format time. Note that "primary" is set to true. Because Hydrolix stores time series data, the primary index needs to be a timestamp. Hydrolix automatically stores your data in smaller partitions that are organized by this primary index. This has a number of advantages, including the ability to quickly scale your partitions up or down and improved querying speed. For example, if you query log data from a three-day period four months ago, the query head will instruct the query peers to look into the specific partitions related to that time period, narrowing down the search significantly and making the query much faster.
  • Next, there’s a "source" field with a datatype of "string". This is a virtual field, which means it isn’t being mapped from the CDN data itself. It’s enriched metadata that’s added at ingest time to make sure that the original source context isn’t lost. This is an example of the data enrichment best practice at work. Always make sure you add context as needed!
  • The next field is "client_ip" and it’s a string. It will go into the datastore as "client_ip" and there’s nothing to map other than the datatype. If no additional processing is needed, mapping fields is very simple.
  • In contrast, the "status_code" field mapping uses another logging best practice—standardizing a naming convention. Status codes from different sources often have different names. As an example, Fastly uses "status". To standardize the name to "status_code" in your data store, you specify the name of the input field from the original source with "from_input_field": "status", and Hydrolix will automatically map that field to the name you’ve specified.

Learn about writing transforms.

Adding a Transform in the Hydrolix UI

Once you write a transform, you need to add it to a Hydrolix table before it can be used for mapping data. You can do so through either the Hydrolix API or the user interface. You can use the documentation to make a create transform API call—or use your favorite tool such as curl or Postman.

To apply a transform through the Hydrolix UI, log in to Hydrolix and select the + Add new button in the upper right corner of the screen, then select Table Transform from the dropdown menu.

Select Table Transform from the Add new drop down menu.

On the New ingest transform screen, you’ll specify the table, transform name, description, and data type. Remember that a transform can only be used with its associated table, so make sure that the table has already been created and that you’ve selected the right one. You should also have a unique, concise name for the transform since the name (not ID) is specified as a parameter for batch and stream ingests. For data type, you can select either CSV or JSON.

After you select the data type, you’ll have the option to clone, create, generate, or upload a transform file.

  • Clone: You can create a copy of an existing transform, which is especially useful when you want to use an already existing transform as a template for a new one. For instance, you might want to create a transform schema for a different service that has many similar fields as an existing transform, but still needs some modifications.
  • Create: Create a new transform.
  • Generate: This option allows you to input sample data from a service—and a transform schema will automatically be generated based on the sample data! Generating a transform schema can eliminate most of the boilerplate creating a schema with a big caveat—you need to verify the schema is correct and you will likely have to make some modifications manually.
  • Upload: Finally, you can upload a schema file(either JSON or CSV).

This example uses Create. Select Create and then Validate transform. You’re ready to validate that the maps in your transform schema correctly transform a sample dataset.

Validating Your Transform Schema

Validating that your data is being mapped as expected is another logging best practice. While you can optionally skip the validation process in the UI, it’s recommended that you test your mappings. You can input Transform Output Columns, Transform SQL, and Sample data to validate your transform. Adding transform SQL is a more advanced use case that isn’t covered in this blog, but with Hydrolix, you have the option to use (and validate) SQL transforms for use cases such as dictionary lookups and other data standardization such as splitting strings.

Add your transform output columns as an array of JSON objects, with each object being a field in your transform schema. Here’s what that looks like for the sample transform schema created in this post:

Next, you can input sample data to see how the transform schema will map it to your database. Here’s an example with two log lines:

Note that the virtual "source" field isn’t included because it’s not part of the sample data—the transform will automatically enrich and include the data. Also note that the data attribute is "status".

The validator outputs an array of arrays, with each log line having its own array:

The next image shows how this example looks in the Hydrolix UI.

Make sure you check for null values, as they often mean that the expected field for mapping wasn’t found. Learn more about how to test and validate transforms.

Publishing and Using a Transform

After you’ve validated your transform, you can select Publish transform in the UI. You can also use the API to POST a transform. If "is_default": true, your transform will be the default for the table when data is ingested. However, you’ll often specify which transform to use when you are setting up a data source. For many data sources such as Kafka and Kinesis, the name of the transform is a required parameter. For instance, if you wanted to use the transform created in this blog post for a data source, the parameter would be "fastly", the name given to the transform.

Summary

In this post, you learned how to create, validate, publish and use transforms. Transforms allow you to determine how Hydrolix tables ingest your data. And because Hydrolix tables have both unlimited dimensionality and cardinality, you have tremendous power and flexibility to capture all of your observability data for analysis and storage.

Next Steps

To learn more about Hydrolix transforms, start with the Write Transforms documentation.

Ready for more advanced methods of processing and enriching your data such as custom functions and dictionary lookups? Learn how to enrich your data with SQL statements.

If you don’t have Hydrolix yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.

Featured image by Hunter Harritt on Unsplash.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Hydrolix uses transform schemas to determine how your data will be mapped to a table in a Hydrolix datastore. While a transform schema is simple to create and publish, it’s also a powerful tool to help you standardize and normalize your data. Hydrolix tables are extremely flexible—they have unlimited dimensionality and cardinality, and they are multi-schema, which means a table can ingest many data sources and also have many transforms. Transform schemas can help you follow logging best practices, including standardizing your data, and you’ll automatically benefit from all the optimizations that Hydrolix performs on your data at ingest.

In this blog post, you’ll learn how to:

  • Write a transform
  • Add a transform using the Hydrolix UI
  • Validate a transform
  • Publish and use a transform

Writing a Transform

Let’s create a basic transform that uses a real world example: CDN log data. For brevity, this example shows only a few of the fields from a CDN log. In addition to showing several different data types you can work with, you’ll see how to utilize some logging best practices in Hydrolix, including standardizing naming conventions and adding metadata. Note that each table in your datastore will have at least one transform, but can potentially have many in order to handle data from different sources.

After defining a name, description, and type (JSON) for the type of data ingested, the transform’s settings are defined. "is_default": true means that this is the default transform for the table. You can have as many transforms associated with a table as you need, but only one can be the default. When you send data to Hydrolix (such as via the streaming API), you can specify which transform the data should use for mapping, but the default will always be there if you need it. To specify a transform for ingest, you’ll use the transform name, so giving each transform a concise, descriptive name is important.

Let’s take a look at the settings for each field to be mapped to the datastore.

  • The timestamp is an example of a "datetime" field and uses Go-format time. Note that "primary" is set to true. Because Hydrolix stores time series data, the primary index needs to be a timestamp. Hydrolix automatically stores your data in smaller partitions that are organized by this primary index. This has a number of advantages, including the ability to quickly scale your partitions up or down and improved querying speed. For example, if you query log data from a three-day period four months ago, the query head will instruct the query peers to look into the specific partitions related to that time period, narrowing down the search significantly and making the query much faster.
  • Next, there’s a "source" field with a datatype of "string". This is a virtual field, which means it isn’t being mapped from the CDN data itself. It’s enriched metadata that’s added at ingest time to make sure that the original source context isn’t lost. This is an example of the data enrichment best practice at work. Always make sure you add context as needed!
  • The next field is "client_ip" and it’s a string. It will go into the datastore as "client_ip" and there’s nothing to map other than the datatype. If no additional processing is needed, mapping fields is very simple.
  • In contrast, the "status_code" field mapping uses another logging best practice—standardizing a naming convention. Status codes from different sources often have different names. As an example, Fastly uses "status". To standardize the name to "status_code" in your data store, you specify the name of the input field from the original source with "from_input_field": "status", and Hydrolix will automatically map that field to the name you’ve specified.

Learn about writing transforms.

Adding a Transform in the Hydrolix UI

Once you write a transform, you need to add it to a Hydrolix table before it can be used for mapping data. You can do so through either the Hydrolix API or the user interface. You can use the documentation to make a create transform API call—or use your favorite tool such as curl or Postman.

To apply a transform through the Hydrolix UI, log in to Hydrolix and select the + Add new button in the upper right corner of the screen, then select Table Transform from the dropdown menu.

Select Table Transform from the Add new drop down menu.

On the New ingest transform screen, you’ll specify the table, transform name, description, and data type. Remember that a transform can only be used with its associated table, so make sure that the table has already been created and that you’ve selected the right one. You should also have a unique, concise name for the transform since the name (not ID) is specified as a parameter for batch and stream ingests. For data type, you can select either CSV or JSON.

After you select the data type, you’ll have the option to clone, create, generate, or upload a transform file.

  • Clone: You can create a copy of an existing transform, which is especially useful when you want to use an already existing transform as a template for a new one. For instance, you might want to create a transform schema for a different service that has many similar fields as an existing transform, but still needs some modifications.
  • Create: Create a new transform.
  • Generate: This option allows you to input sample data from a service—and a transform schema will automatically be generated based on the sample data! Generating a transform schema can eliminate most of the boilerplate creating a schema with a big caveat—you need to verify the schema is correct and you will likely have to make some modifications manually.
  • Upload: Finally, you can upload a schema file(either JSON or CSV).

This example uses Create. Select Create and then Validate transform. You’re ready to validate that the maps in your transform schema correctly transform a sample dataset.

Validating Your Transform Schema

Validating that your data is being mapped as expected is another logging best practice. While you can optionally skip the validation process in the UI, it’s recommended that you test your mappings. You can input Transform Output Columns, Transform SQL, and Sample data to validate your transform. Adding transform SQL is a more advanced use case that isn’t covered in this blog, but with Hydrolix, you have the option to use (and validate) SQL transforms for use cases such as dictionary lookups and other data standardization such as splitting strings.

Add your transform output columns as an array of JSON objects, with each object being a field in your transform schema. Here’s what that looks like for the sample transform schema created in this post:

Next, you can input sample data to see how the transform schema will map it to your database. Here’s an example with two log lines:

Note that the virtual "source" field isn’t included because it’s not part of the sample data—the transform will automatically enrich and include the data. Also note that the data attribute is "status".

The validator outputs an array of arrays, with each log line having its own array:

The next image shows how this example looks in the Hydrolix UI.

Make sure you check for null values, as they often mean that the expected field for mapping wasn’t found. Learn more about how to test and validate transforms.

Publishing and Using a Transform

After you’ve validated your transform, you can select Publish transform in the UI. You can also use the API to POST a transform. If "is_default": true, your transform will be the default for the table when data is ingested. However, you’ll often specify which transform to use when you are setting up a data source. For many data sources such as Kafka and Kinesis, the name of the transform is a required parameter. For instance, if you wanted to use the transform created in this blog post for a data source, the parameter would be "fastly", the name given to the transform.

Summary

In this post, you learned how to create, validate, publish and use transforms. Transforms allow you to determine how Hydrolix tables ingest your data. And because Hydrolix tables have both unlimited dimensionality and cardinality, you have tremendous power and flexibility to capture all of your observability data for analysis and storage.

Next Steps

To learn more about Hydrolix transforms, start with the Write Transforms documentation.

Ready for more advanced methods of processing and enriching your data such as custom functions and dictionary lookups? Learn how to enrich your data with SQL statements.

If you don’t have Hydrolix yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.

Featured image by Hunter Harritt on Unsplash.