RSS

Best Practices for Data Transformation

Learn best practices for data transformation, including tips on standardizing, enriching and optimizing your data.

Franz Knupfer

Published:

Oct 02, 2023

10 minute read

To unlock the transformative power of your data, you first need to transform it yourself. This process, also known as data transformation, includes standardizing, enriching, and optimizing data either before or after it enters the database. Data transformation is an essential aspect of log management. Without transformation, your data will get increasingly messy and complex and you’ll run into issues analyzing, querying, and understanding it. The ripple effects will affect all of your tooling that relies on data, from dashboards to alerting systems. And if your organization takes a data-driven approach (which it should), the impact of poorly-managed data won’t be limited to the ops and site reliability teams tasked with monitoring your services for incidents. Everything from compliance and security to decision-making at the leadership level will suffer.

On the other hand, when your data is organized, well-structured, and optimized, you can build robust monitoring and alerting systems, perform data analysis and compliance audits, drive decision-making, and much more. By applying data transformation best practices from the start, you can ensure that you’re getting the most out of your data over the long term. In this post, you’ll learn some data transformation best practices. You’ll also learn what Hydrolix does under the hood to optimize and transform your data before it’s stored.

Read more about transforming your data with Hydrolix.

The Basics of Data Transformation

Data transformation is the process of converting your data when it’s transferred from one source to another. This process can happen either before or after data is added to your database. In complex data pipelines, transformations may occur in several different places. For example, you might transform data after it leaves a Kafka service and then transform it again before it enters your database.

This process is often referred to as either ETL (extract—transform—load) or ELT (extract—load—transform) depending on when the transformation process happens. With ETL, the transformation process occurs before data is loaded into your database. With ELT, transformation happens after the data is loaded into your database. With modern streaming paradigms, the term streaming ETL has also become more common, and it’s no longer accurate to refer to it as an extraction process because streaming data is usually pushed, not extracted, into a datastore. 

Some services allow you to manually configure how your data is transformed. In the case of Hydrolix, you can define a transform file for each data source, giving you maximum flexibility to shape your data. Hydrolix tables are multi-schema, which means you can ingest as many data sources as you want into a table and customize which transform file each source should use.

Sometimes transformation happens automatically—for example, Hydrolix compresses and indexes your data before it’s loaded, making your queries much more efficient and cutting down on storage costs.

There are many ways you can transform your data. These transformations can be constructive, meaning they add and enrich your incoming data. They can also modify or filter your data through computations, aggregation, encrypting, standardizing, and renaming fields. Finally, they can optimize your data through indexing, compression, adding summary tables, and more. Ultimately, the goal of transformation is to shape your data to fit the needs of your business. To that end, it should be complete (no data gaps), easy for humans to analyze and query, and optimized so that you get timely insights even when working with very large datasets.

Data Transformation Best Practices

Let’s take a look at some transformation best practices you can apply to get the most out of your data. You should apply these practices any time you add a new ingest source.

Standardize data types and naming conventions.

When you ingest data from many sources, you’ll need to make sure that you standardize data types and naming conventions across sources so that you can accurately analyze and compare your data. For example, you might need to standardize HTTP status code fields to use the same naming convention. One service might use status, another might use status_code, and a third might use statusCode. This data should all be stored in one table column using the naming convention of your choice.

Naming conventions make it much easier for users to understand and compare data from different sources. Without standardization, other users will find it much more difficult to analyze your data, resulting in potential gaps and misunderstandings. Querying large datasets is also much more challenging if you have to account for multiple differently-named columns representing the same data instead of just one. And some queries, like calculating uniques, will be incredibly inefficient or even impossible if the data is spread over many columns.

In addition to consistent naming conventions, you also need to standardize data types and units of measure. One of the most common examples is making sure that all timestamps use a similar format.

Add metadata to provide context.

Often, log data won’t provide all of the context you need. This could include lost context as data flows through different systems, or you might need to enrich logs with additional mission-critical context.

Let’s take a look at examples of each. One common issue with lost context is the source itself. For example, if you’re ingesting logs from a third-party provider or cloud service, it’s very likely that information about the source isn’t included. That context isn’t needed while the data is stored in the provider, but as soon as it’s sent to a database and mixed with other sources, that source context is lost. Without that source information, you won’t be able to compare different sources (for instance, you might want to see which service has more 400 errors), and you won’t be able to do root cause analysis to determine which source might be causing an issue.

There are also many use cases for enriching data with business-specific context. For instance, a data field might contain a URL with parameters that you want to extract and store in separate fields. Or you might want to do a lookup on an in-memory dictionary to provide context. As a potential use case, an ecommerce site could extract specific information about a promotion (“fall2023special”) from a URL parameter, do a dictionary lookup to get an internal code for the promotion, and then save that information to a field called promoCode.

With Hydrolix, you can add virtual attributes. These virtual attributes aren’t mapped to incoming data—instead, they can be constants (such as the name of a source) or can be derived from scripts. For advanced use cases like dictionary lookups, you can also create custom SQL statements.

Determine if any data should be left out of the database.

You’ll only have true observability into your system if you collect all of your data, but there are valid reasons you might not want to store some data. For instance, you might want to exclude sensitive data for compliance and security reasons, or a service might include metadata in its logs that simply isn’t relevant to your use case—or can’t be standardized with other service data you’re collecting. If storage costs are a consideration for you, keep in mind that you get unlimited dimensionality and powerful data compression with Hydrolix, allowing you to store all your data while keeping costs low. You can ensure that all your data is captured with a catchall, which captures any JSON fields in string form. You can also exclude data attributes from Hydrolix by not using a catchall and leaving the excluded fields out of the transform schema.

Optimize your data before it’s stored.

To handle large amounts of data, you need to optimize data to maximize the efficiency of both compute and storage. That includes indexing data so your queries are more efficient and compressing data to reduce your storage costs. If you’re using Hydrolix, then all of your log data is automatically indexed (on all columns) and then compressed. 

Because even the logs from just a handful of computers can quickly become terabytes worth of data, you should always be prepared to work with big data sets. For some databases, indexing can lead to significant increases in memory and storage, adding to your costs. However, Hydrolix provides high-density compression and indexes all columns on ingest by default. If you know you aren’t ever going to search a specific field, you can turn indexing off—just keep in mind that you’ll run into issues later if you discover a new use case for querying unindexed data. Learn more about indexing in the docs.

You may also want to manually optimize your data further. With Hydrolix, you can use full text search to index longer strings and make them searchable by query. Or you may want to extract data from a larger string and save it to a separate field (such as a parameter from a URL) to make it more efficient to query.

Put a process in place to handle malformed or incomplete data.

Part of ensuring that your data remains clean is rejecting or fixing data that is incorrect or malformed. That could mean putting a data limiter in place for certain fields that rejects certain data types, string lengths, or some other required parameter for your data.

Data rejection can be scary—for example, what if your engineering team accidentally alters the outputted log data from a service in such a way that it’s slated for rejection? You could potentially have a gap in your data collection. Ideally, you should have a process in place that allows you to review data that is rejected, giving you a chance to fix the issue so that you don’t have gaps in your data.

With Hydrolix, you can specify when data should be rejected, whether that’s because it doesn’t have expected fields, uses incorrect formats, or other reasons. Hydrolix uses datatype limiters, which allow you to reject data based on certain parameters such as if a string exceeds a certain length. If data is rejected, it’s sent to cloud storage so you can inspect it and determine if you need to update your transform schema or the data itself so that it can be properly ingested and stored. Learn how to reject incorrect data in Hydrolix.

Validate that your transformations are working as intended.

Before you send large amounts of data to a database, you should make sure that you’ve set up your data transformation correctly. This should be done any time there is a new ingest source. Otherwise, you might be sending malformed or incomplete data to your database, and if you’re already ingesting large amounts of data, you’ll have to deal with computationally expensive updates to your data store later—or deal with incorrect information and data gaps. To do so, you can ingest a small amount of test log data and then review what that data looks like when you read it from your datastore. You’ll want to ensure the following:

  • All expected fields are included and have the correct names.
  • Data is being stored as the correct datatype.
  • If there are more advanced computations and transformations, verify that the data is being transformed correctly.

To optimize compute, Hydrolix is set up to be append-only, but alter operations are available if they are absolutely needed. Learn about validating your data transforms.

Configure your transform to apply your specific use case.

Many—perhaps even most—attributes from your log data won’t need any transformation at all, meaning that they can be saved to your datastore as a string using the same column name. However, you should think carefully about use cases specific to your applications and services. Is all the data you need included on ingest, or is there additional metadata necessary to provide context? Do you need to perform computations or more complex transformations? Are there further optimizations you can perform to make your queries more efficient? You should consider ways to future-proof your log data as it scales.

How Hydrolix Helps You Transform Your Data

Hydrolix is built to give you ingest at terabyte scale, sub-second query time (even on very large datasets), and high-density compression that ensures you minimize your storage costs. There are optimizations throughout the product to make these things possible. In this section, you’ll get an overview of how Hydrolix helps you transform your data through both automatic optimizations and fine-grained control to customize your data.

  • High-density compression: Hydrolix’s patented high-density compression reduces the total size of data before it’s ingested into your system. This minimizes the long-term storage costs of your data—plus you get 15 months of data retention standard with Hydrolix.
  • By default, all data is indexed on ingest. Hydrolix uses a columnar datastore for your data, and all data is automatically indexed at ingest time. Indexing your data is an important factor in providing sub-second query speed, even for large data sets. And by indexing on ingest, you’re also breaking up the overall compute time of indexing a column into very small, manageable chunks. Contrast that with the compute needed to index a column all at once—especially when that column already has millions or even billions of rows. If needed, you can manually turn off indexing for columns.
  • Full control over data transformation. With Hydrolix, you create data transforms for each data source, giving you granular control and tremendous flexibility to shape your data. Learn how to write data transformations.
  • Functionality to index full text. You can also optionally index data with full text search, which allows you to split and sub-index longer strings in your data. For example, you can sub-index a "message" field in your data so that you can quickly find messages that include the word "error".
  • Automatically merge data for efficiency. At ingest time, Hydrolix stores your data in small partitions and assigns a query peer to each partition—giving you near real-time querying of your data when it’s ingested. Older data is gradually merged into larger partitions to achieve optimal sizing. Learn more about merging. You can also customize merge pools to fit your needs.

Next Steps

With Hydrolix, you can use transforms to shape your data—all while ingesting, querying, your data with extremely low latency and flexible scalability. If you’re ingesting a terabyte of data or more every day and you don’t have Hydrolix yet, sign up for a free trial. You’ll get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

To unlock the transformative power of your data, you first need to transform it yourself. This process, also known as data transformation, includes standardizing, enriching, and optimizing data either before or after it enters the database. Data transformation is an essential aspect of log management. Without transformation, your data will get increasingly messy and complex and you’ll run into issues analyzing, querying, and understanding it. The ripple effects will affect all of your tooling that relies on data, from dashboards to alerting systems. And if your organization takes a data-driven approach (which it should), the impact of poorly-managed data won’t be limited to the ops and site reliability teams tasked with monitoring your services for incidents. Everything from compliance and security to decision-making at the leadership level will suffer.

On the other hand, when your data is organized, well-structured, and optimized, you can build robust monitoring and alerting systems, perform data analysis and compliance audits, drive decision-making, and much more. By applying data transformation best practices from the start, you can ensure that you’re getting the most out of your data over the long term. In this post, you’ll learn some data transformation best practices. You’ll also learn what Hydrolix does under the hood to optimize and transform your data before it’s stored.

Read more about transforming your data with Hydrolix.

The Basics of Data Transformation

Data transformation is the process of converting your data when it’s transferred from one source to another. This process can happen either before or after data is added to your database. In complex data pipelines, transformations may occur in several different places. For example, you might transform data after it leaves a Kafka service and then transform it again before it enters your database.

This process is often referred to as either ETL (extract—transform—load) or ELT (extract—load—transform) depending on when the transformation process happens. With ETL, the transformation process occurs before data is loaded into your database. With ELT, transformation happens after the data is loaded into your database. With modern streaming paradigms, the term streaming ETL has also become more common, and it’s no longer accurate to refer to it as an extraction process because streaming data is usually pushed, not extracted, into a datastore. 

Some services allow you to manually configure how your data is transformed. In the case of Hydrolix, you can define a transform file for each data source, giving you maximum flexibility to shape your data. Hydrolix tables are multi-schema, which means you can ingest as many data sources as you want into a table and customize which transform file each source should use.

Sometimes transformation happens automatically—for example, Hydrolix compresses and indexes your data before it’s loaded, making your queries much more efficient and cutting down on storage costs.

There are many ways you can transform your data. These transformations can be constructive, meaning they add and enrich your incoming data. They can also modify or filter your data through computations, aggregation, encrypting, standardizing, and renaming fields. Finally, they can optimize your data through indexing, compression, adding summary tables, and more. Ultimately, the goal of transformation is to shape your data to fit the needs of your business. To that end, it should be complete (no data gaps), easy for humans to analyze and query, and optimized so that you get timely insights even when working with very large datasets.

Data Transformation Best Practices

Let’s take a look at some transformation best practices you can apply to get the most out of your data. You should apply these practices any time you add a new ingest source.

Standardize data types and naming conventions.

When you ingest data from many sources, you’ll need to make sure that you standardize data types and naming conventions across sources so that you can accurately analyze and compare your data. For example, you might need to standardize HTTP status code fields to use the same naming convention. One service might use status, another might use status_code, and a third might use statusCode. This data should all be stored in one table column using the naming convention of your choice.

Naming conventions make it much easier for users to understand and compare data from different sources. Without standardization, other users will find it much more difficult to analyze your data, resulting in potential gaps and misunderstandings. Querying large datasets is also much more challenging if you have to account for multiple differently-named columns representing the same data instead of just one. And some queries, like calculating uniques, will be incredibly inefficient or even impossible if the data is spread over many columns.

In addition to consistent naming conventions, you also need to standardize data types and units of measure. One of the most common examples is making sure that all timestamps use a similar format.

Add metadata to provide context.

Often, log data won’t provide all of the context you need. This could include lost context as data flows through different systems, or you might need to enrich logs with additional mission-critical context.

Let’s take a look at examples of each. One common issue with lost context is the source itself. For example, if you’re ingesting logs from a third-party provider or cloud service, it’s very likely that information about the source isn’t included. That context isn’t needed while the data is stored in the provider, but as soon as it’s sent to a database and mixed with other sources, that source context is lost. Without that source information, you won’t be able to compare different sources (for instance, you might want to see which service has more 400 errors), and you won’t be able to do root cause analysis to determine which source might be causing an issue.

There are also many use cases for enriching data with business-specific context. For instance, a data field might contain a URL with parameters that you want to extract and store in separate fields. Or you might want to do a lookup on an in-memory dictionary to provide context. As a potential use case, an ecommerce site could extract specific information about a promotion (“fall2023special”) from a URL parameter, do a dictionary lookup to get an internal code for the promotion, and then save that information to a field called promoCode.

With Hydrolix, you can add virtual attributes. These virtual attributes aren’t mapped to incoming data—instead, they can be constants (such as the name of a source) or can be derived from scripts. For advanced use cases like dictionary lookups, you can also create custom SQL statements.

Determine if any data should be left out of the database.

You’ll only have true observability into your system if you collect all of your data, but there are valid reasons you might not want to store some data. For instance, you might want to exclude sensitive data for compliance and security reasons, or a service might include metadata in its logs that simply isn’t relevant to your use case—or can’t be standardized with other service data you’re collecting. If storage costs are a consideration for you, keep in mind that you get unlimited dimensionality and powerful data compression with Hydrolix, allowing you to store all your data while keeping costs low. You can ensure that all your data is captured with a catchall, which captures any JSON fields in string form. You can also exclude data attributes from Hydrolix by not using a catchall and leaving the excluded fields out of the transform schema.

Optimize your data before it’s stored.

To handle large amounts of data, you need to optimize data to maximize the efficiency of both compute and storage. That includes indexing data so your queries are more efficient and compressing data to reduce your storage costs. If you’re using Hydrolix, then all of your log data is automatically indexed (on all columns) and then compressed. 

Because even the logs from just a handful of computers can quickly become terabytes worth of data, you should always be prepared to work with big data sets. For some databases, indexing can lead to significant increases in memory and storage, adding to your costs. However, Hydrolix provides high-density compression and indexes all columns on ingest by default. If you know you aren’t ever going to search a specific field, you can turn indexing off—just keep in mind that you’ll run into issues later if you discover a new use case for querying unindexed data. Learn more about indexing in the docs.

You may also want to manually optimize your data further. With Hydrolix, you can use full text search to index longer strings and make them searchable by query. Or you may want to extract data from a larger string and save it to a separate field (such as a parameter from a URL) to make it more efficient to query.

Put a process in place to handle malformed or incomplete data.

Part of ensuring that your data remains clean is rejecting or fixing data that is incorrect or malformed. That could mean putting a data limiter in place for certain fields that rejects certain data types, string lengths, or some other required parameter for your data.

Data rejection can be scary—for example, what if your engineering team accidentally alters the outputted log data from a service in such a way that it’s slated for rejection? You could potentially have a gap in your data collection. Ideally, you should have a process in place that allows you to review data that is rejected, giving you a chance to fix the issue so that you don’t have gaps in your data.

With Hydrolix, you can specify when data should be rejected, whether that’s because it doesn’t have expected fields, uses incorrect formats, or other reasons. Hydrolix uses datatype limiters, which allow you to reject data based on certain parameters such as if a string exceeds a certain length. If data is rejected, it’s sent to cloud storage so you can inspect it and determine if you need to update your transform schema or the data itself so that it can be properly ingested and stored. Learn how to reject incorrect data in Hydrolix.

Validate that your transformations are working as intended.

Before you send large amounts of data to a database, you should make sure that you’ve set up your data transformation correctly. This should be done any time there is a new ingest source. Otherwise, you might be sending malformed or incomplete data to your database, and if you’re already ingesting large amounts of data, you’ll have to deal with computationally expensive updates to your data store later—or deal with incorrect information and data gaps. To do so, you can ingest a small amount of test log data and then review what that data looks like when you read it from your datastore. You’ll want to ensure the following:

  • All expected fields are included and have the correct names.
  • Data is being stored as the correct datatype.
  • If there are more advanced computations and transformations, verify that the data is being transformed correctly.

To optimize compute, Hydrolix is set up to be append-only, but alter operations are available if they are absolutely needed. Learn about validating your data transforms.

Configure your transform to apply your specific use case.

Many—perhaps even most—attributes from your log data won’t need any transformation at all, meaning that they can be saved to your datastore as a string using the same column name. However, you should think carefully about use cases specific to your applications and services. Is all the data you need included on ingest, or is there additional metadata necessary to provide context? Do you need to perform computations or more complex transformations? Are there further optimizations you can perform to make your queries more efficient? You should consider ways to future-proof your log data as it scales.

How Hydrolix Helps You Transform Your Data

Hydrolix is built to give you ingest at terabyte scale, sub-second query time (even on very large datasets), and high-density compression that ensures you minimize your storage costs. There are optimizations throughout the product to make these things possible. In this section, you’ll get an overview of how Hydrolix helps you transform your data through both automatic optimizations and fine-grained control to customize your data.

  • High-density compression: Hydrolix’s patented high-density compression reduces the total size of data before it’s ingested into your system. This minimizes the long-term storage costs of your data—plus you get 15 months of data retention standard with Hydrolix.
  • By default, all data is indexed on ingest. Hydrolix uses a columnar datastore for your data, and all data is automatically indexed at ingest time. Indexing your data is an important factor in providing sub-second query speed, even for large data sets. And by indexing on ingest, you’re also breaking up the overall compute time of indexing a column into very small, manageable chunks. Contrast that with the compute needed to index a column all at once—especially when that column already has millions or even billions of rows. If needed, you can manually turn off indexing for columns.
  • Full control over data transformation. With Hydrolix, you create data transforms for each data source, giving you granular control and tremendous flexibility to shape your data. Learn how to write data transformations.
  • Functionality to index full text. You can also optionally index data with full text search, which allows you to split and sub-index longer strings in your data. For example, you can sub-index a "message" field in your data so that you can quickly find messages that include the word "error".
  • Automatically merge data for efficiency. At ingest time, Hydrolix stores your data in small partitions and assigns a query peer to each partition—giving you near real-time querying of your data when it’s ingested. Older data is gradually merged into larger partitions to achieve optimal sizing. Learn more about merging. You can also customize merge pools to fit your needs.

Next Steps

With Hydrolix, you can use transforms to shape your data—all while ingesting, querying, your data with extremely low latency and flexible scalability. If you’re ingesting a terabyte of data or more every day and you don’t have Hydrolix yet, sign up for a free trial. You’ll get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.