Are you dealing with spiraling data costs, complex data pipelines, or issues managing and standardizing your log data? Even worse, are you dealing with all of the above? As businesses face increasing pressure to ingest more data faster, these issues are becoming more common. Too often, the proposed solution is to use ELT (extract—load—transform) processes to ingest large volumes of real-time data, but ELT just pushes the issue further down your data pipeline and makes your data pipeline more complex.
With ELT, you load all of your unstructured data into storage, typically a data lake, and then extract and transform it later, often at query time. ELT is designed to help you ingest more data faster by dumping data as quickly as possible into storage. But with ELT, faster ingest comes with a major tradeoff. Because your data is unorganized, you’ll likely have more issues with ordering, standardizing, and enriching your data later. And because the data is unorganized, near-real-time queries will have higher latency and take more compute, meaning that it will take longer to get actionable insights.
With Hydrolix, you can stream and transform data at scale—all before it’s written to a datastore. By transforming your data first, you save on the total cost of ownership (TCO) of your data management solution and simplify your data pipeline. And with Hydrolix, you can do so without any compromises on ingesting real-time streaming data at scale.
In this post, you’ll learn about:
- The problem with ELT
- Why you should transform your data before storing it
- How Hydrolix ingests and transforms your data at scale
The Problem With ELT
ELT focuses on solving a specific problem: ingesting massive amounts of data at very low latency. The biggest advantage of ELT is that there are minimal data ingestion bottlenecks. Data is extracted and then loaded immediately.
The downside of ELT is the cost and complexity of the mess you have to clean up later. While the upfront compute costs of writing data to a data lake may be cheaper, the compute costs of querying that unstructured data later—or structuring it all and sending it to a datastore—are much more expensive. When you are dealing with vast amounts of data, that quickly leads to runaway costs. You might find yourself in a situation where you are sampling data or significantly pruning to save money. At that point, if it’s going to cost too much to store and compute all your data, you’ve lost the main benefit of ELT in the first place. Why ingest it all so quickly only to throw it away?
On top of that, you’ll have increased complexity in your data pipeline. That means more maintenance and overhead from your teams. And while ELT architecture optimizes for cost and performance at ingest time, that doesn’t necessarily mean that you’ll get actionable insights any faster. To make your data actionable, you’ll either need to perform less efficient queries on unstructured data or transform your data to make it easier to query. For most use cases, it’s not ingest throughput that matters most—it’s the speed to actionable insights.
Downsides of ELT
ELT provides efficient ingest and can be beneficial if you need to store both raw and processed data. However, it comes with downsides.
- ELT adds complexity to your data pipeline. With ELT, you typically need more steps in your data pipeline. You might need to handle multiple data storage solutions, connectors, and transformers, leading to more maintenance and overhead for your teams.
- Higher compute = higher costs. With ELT, you may need to write/alter data multiple times, leading to higher compute costs.
- Queries are more inefficient. It’s computationally expensive and less efficient to query large, unstructured datasets. It’ll take longer to get actionable insights—if you can even get them at all. Keep your fingers crossed that queries don’t time out.
- More complex storage and higher storage costs. With ELT, you might use multiple storage solutions for your data. For instance, unprocessed data might be sent to a data lake, then extracted, transformed, and sent to a data warehouse. You’ll have more storage solutions to manage and more baseline storage costs.
Why You Should Transform Your Data First
For most forward-thinking businesses looking to ingest real-time data while also saving on costs, the best option is to use a solution like Hydrolix to transform your data before storing it. There are specific use cases for ELT—for instance, financial applications that rely on data to be as instantaneous as possible or data science solutions where you might want to keep data in its raw, unstructured form. But for most use cases, including real-world applications like streaming media where data needs to be ingested and analyzed at near-real-time, you’ll minimize headaches by transforming your data before it’s stored.
Let’s take a look at the benefits of transforming your data first.
Benefits of Data Transformation Before Storage
With this approach, you can:
- Simplify your pipeline. Transform your data and write to your datastore once. No complex pipeline needed.
- Save on compute costs. Because you only need to write once, not multiple times, you’ll save on compute costs. By optimizing and transforming your data when it’s ingested, you can distribute the compute into smaller workloads and scale as needed.
- Structured data is easier to query, analyze, and understand. Unstructured data is difficult to query efficiently and it’s often challenging for humans to analyze and understand.
- Retain important context by enriching your data. At ingest time, you may lose valuable context about your data such as its source if you don’t enrich it with metadata before it’s stored. By transforming your data first, you can ensure that your data includes important context.
How Hydrolix Ingests and Transforms Data at Scale
Hydrolix streams and transforms terabyte volumes of data in near real time. So how does Hydrolix transform data at scale while still providing high performance ingest? Let’s start with the TL;DR and then get more in-depth.
- Hydrolix stream peers write and transform incoming data to storage partitions in parallel. You can scale stream peers up as needed for peak events or even down to zero for off-peak times.
- Each partition has a default width of 60 seconds. Because partitions are small, they’re quickly flushed to your datastore, minimizing the time to actionable insights.
- Before adding ingest sources, you’ll specify how your data should be transformed using JSON for most configurations and SQL for more advanced use cases. You get tremendous flexibility and control.
Now let’s take a closer look at how Hydrolix offers high performance ingest and transformation.
Stream Peer Parallelism
Stream peers concurrently handle incoming messages. As previously mentioned, these stream peers can scale up or down, giving you flexibility to manage peak events with low latency or off-peak times with low costs. Stream peers are responsible for both transforming data and writing it to object storage. Transformation happens on a row-by-row basis and occurs while data is buffering to object storage, adding no additional latency to the ingest process.
Small Partitions Lead to Faster Actionable Insights
At ingest time, partitions are small—by default, they contain just 60 seconds worth of data and have several other maximum parameters such as number of rows. The defaults have been fine-tuned for the widest number of use cases, but the configurations are also fully customizable. With small partitions, your data is frequently flushed to the data store, ensuring minimal lag between ingesting and querying data.
Fully-Customizable Transformation
Before you ingest data (either streaming or batch), you’ll specify how that data should be transformed using transform files and sql_transform
s. This can range from very simple transformations (such as saving data properties as strings) to complex transformations that enrich, standardize, and otherwise process your data.
You can have separate transform files for each data source, giving you more granular control over your data, as well as multiple transform files for each table, allowing you to mix multiple sources in a single table.
High-Density Compression Reduces Write Time
Hydrolix also performs automatic compression and indexing before data is written to cloud storage. High-density compression reduces your data footprint and storage costs as well as minimizing the amount of data that needs to be written to object storage. Meanwhile, all columns are indexed, giving you low latency queries.
Transform Your Data With Hydrolix and Simplify Your Data Pipeline
With Hydrolix, you get all of the benefits of transforming data before you store it with none of the problems of ELT. You can simplify your data pipeline by standardizing, organizing, and enriching your data before it goes into storage. All you need to specify is your transform configuration, and there are no additional steps for post-processing data. You’ll get highly performant data ingest, efficient queries, and lower total cost of ownership (TCO). Most Hydrolix customers save 75% or more over their previous solution.
Next Steps
What would you do with your time and cost savings if you didn’t have to worry about cleaning up your data lake anymore? Get a POC with Hydrolix to find out.
Photo by Maxim Berg