HYDROLIX BLOG

Ponderings, insights and industry updates

Clean your data quickly

December 5, 2022

Author: David Sztykman |

Tags: , , ,

We know data pipeline are complicated, especially when you rely on end users information such as RUM and other client sides metrics.

Some users will try to modify data to bypass certain functionality or to avoid being blacklisted.

Limiting, clamping, rejecting wrong data is critical in order to keep your analyst with the rightful information.

Doing it at scale without adding extra latency is even more complex, and that’s why we did it!

Limits on string

There’s nothing more annoying than receiving bogus information, wrong data.
For example let’s assume your application is sending a uuid and this uuid is always 36 character length: 123e4567-e89b-12d3-a456-426655440000 as an example

Hydrolix ingest can now impose that the column uuid is 36 characters:

Action can be either:

  • reject – the row containing the wrong data is rejected
  • clamp – the row is modified to fit your limit.

For example if our data contains a field called message:

With the following text:

InputOutputDescription
HelloHellofoofopadded with foo to min length 10
Hello World!Hello World!kept at length 12 (between min and max)
Lorem ipsum dolor sit ametLorem ipsum dolor sitruncated to max length 20

Limits on numerical values

Let’s assume your application is calculating a response time by doing the difference between 2 values, you want to avoid impossible results with negative value.

Another example is you know the response_time could never be above 3600s because you have set a timeout, you can now reject value higher than 3600 because you know for sure that they are wrong.

You can now use limiters to again reject or clamp the value to ensure the quality of your data

This mean that any value below 0 and higher than 3600 will be rejected by Hydrolix automatically.

Limits on array elements

Forget about limitless array!
You can now impose the limit on how much elements should be included in your array:

This will remove all the extra elements in your array and will limit you array to a maximum of 5 values.

You can also combine the limit on array length with the limit of the datatype included in your array:

We limit the array length to 5 values and those 5 values are strings which are similar to our previous example 10 characters min and 20 maximum.

Limits on date

Imposing limits on date could have saved a lot of time to heroes in different movies, if Doc in back to the future had imposed limits on the time space continuum Marty would have never made such a mess!

Here we limit to 5m in the past and 10m in the future using:

This would have prevented meeting his mother as a teenager and getting information about future sporting events!

Conclusion

Hydrolix streaming data capability with our on the fly enrichment using SQL allows users to generate a data pipeline easily. Now with the limit features you can make sure that everything you index make sense and don’t spend times cleaning up your data!

Share Now