RSS

Creating Custom Views for High Dimensionality Data

With custom views, you can simplify and share datasets and reduce wasteful cloud spend.

Franz Knupfer

Published:

Mar 12, 2024

7 minute read
,

High dimensionality datasets in Hydrolix can have hundreds or even thousands of columns. However, most teams will only need to query a smaller slice of that data in their day-to-day work. With Hydrolix, you can use custom views to make it easy for your teams and stakeholders to access only the columns they need for analysis.

Custom views allow you to curate different views for each team or even share views between teams. You can remove the distraction of unnecessary columns, making query results easier to work with. In addition to providing convenience and productivity benefits, custom views can help you apply FinOps best practices, which include maximizing the value of your data and eliminating wasteful cloud spend. A SELECT * statement using a custom view will retrieve only the columns related to that view instead of retrieving all the columns in a table. It’s an easy step you can take to save money on your cloud bill.

Why Use a Custom View?

To understand the benefits of custom views, let’s take a look at the difference between a basic query and a custom view. Here’s a basic query that a security analyst could use as a basis for making a hypothesis about unauthorized attempts to access application services.

Hydrolix is a columnar database, and a SELECT * statement will retrieve all of the columns in a table by default. If the services table has hundreds of columns and the security analyst only needs five of them to test a hypothesis, this query would be unnecessarily wasteful. It’ll also be much harder to sift through all that data.

Columnar databases like Hydrolix are optimized to retrieve only the columns you need. They can do partial table scans instead of a full table scan like traditional row-based databases, but a SELECT * query negates that benefit.

If you have a lot of database experience, you might be thinking, “I’d never use SELECT *.” But a surprising number of users do run SELECT * queries. Some users have experience with row-based databases, not columnar databases, or don’t typically work with high dimensionality data. And many users aren’t accustomed to working with terabyte-scale data, where small tweaks in a query can make a huge difference in terms of performance and compute costs. As your datasets accumulate data gravity (the tendency of users to gravitate towards large volumes of data), you can maximize the value of that data by providing data to more teams. However, a side effect of doing so is that you’ll have more users that may be inexperienced with big data.

A more curated query would look like this:

Now the query specifies six columns. Like other columnar databases, Hydrolix uses column pruning to disregard columns that aren’t specified in the query, making the query more performant and efficient. The cost savings for one query might be minimal, but if your teams are making thousands of queries, the savings will add up. The query results are also much easier for humans to sift through.

This query is still simplistic, though. A security analyst might want to retrieve dozens of columns. And what if you wanted to share this data slice between teams, or train a new member of your team to use it? While you may generally want to retrieve the same columns for many queries, you’ll likely want to filter the data in many different ways. The SELECT clause should be reusable while other parts of the query are flexible.

With a custom view, the same query might look like this:

Here, SELECT * retrieves all the columns that are specified in security_team_view. You can curate exactly which columns the security team needs and edit the view at any time.

You can (and should) create a default custom view so that a SELECT * doesn’t query every column in a table by default. This is a best practice for high dimensionality data, and one way that custom views can support your FinOps practices.

Creating Custom Views In Hydrolix

You can use the Hydrolix API to create a custom view. If this is the first time you’re using the API, start with the authentication documentation. Once you’re authenticated, you can retrieve the automatically generated auto_view and then use it to create a view. Let’s walk through the process.

1. Retrieve the table auto_view.

All Hydrolix tables have an automatically generated auto_view which includes all of a table’s columns. To create a custom view, your first step is to retrieve this auto_view. You can make a GET request to the following Hydrolix endpoint to see all the views on a table: https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/{table_id}/views/. If you haven’t created a custom view yet, the only view will be the auto_view. Here’s an example auto_view for a table that only has four columns:

You’ll use the JSON for your table’s auto_view as a basis for your custom view.

Note that "is_default" is set to true. Because the auto_view is the default view for the table, a SELECT * query will retrieve every column in the table. If your table has low dimensionality, that’s not necessarily a problem. However, if it’s high dimensionality and you aren’t regularly using all of the columns in the table, you should change the default. You can still query the auto_view table by specifying it in the query.

2. Create your custom view using JSON.

Now that you have the auto_view, the next step is to determine which columns should be in your custom view. The JSON for the custom view will look very similar to the auto_view. Here’s an example custom view that has two of the four columns of the example auto_view from the last section.

Let’s run through the relevant parts:

  • name: The name of your custom view. It should be concise and memorable, especially if it’s not the default view, because users will need to specify it in their queries.
  • description: Add an optional description. The value will be null otherwise.
  • is_default: If set to true, the custom view will be the default. Only one view can be the default for a table. The auto_view is the default view unless you set is_default to true for one of your custom views. If your table has high dimensionality, you should set a custom view to be the default.
  • output_columns: This is an array of JSON objects, and each object represents a column in your database. You can simply copy/paste columns from your auto_view into your custom view.

3. Publish the custom view using the API.

Once you’ve built the JSON for your custom view, you can publish it with a POST call to https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/{table_id}/views/. Simply paste the JSON for your custom view in the body of the API call. You can then verify that the view has been created with a GET call to https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/{table_id}/views/.

4. Using custom views in queries.

If the custom view isn’t the default, you need to specify the view you’re using in the query:

Note that the format is project_name.table_name#view_name. You can make the query more selective by specifying columns in the view instead of using *.

Important note: The ... in this query example represents the rest of the query. Custom views will reduce the dimensionality of a query response, but not the cardinality. You should never make queries on large datasets without predicates. In the case of Hydrolix, your query should at a minimum include a time filter for optimal performance.

If the custom view is the default, you don’t have to specify the view. All queries that don’t specify a view will automatically use the default. If you do need access to all of the columns on the table, you can specify the auto_view like this:

Summary

Using custom views is a best practice, especially if your tables have high dimensionality. As your data sets accumulate data gravity, more teams will likely rely on that data, making it even more important to curate data slices for each team. By doing so, you can maximize the value of your data and reduce wasteful full table scans. And remember, for high dimensionality tables, the auto_view shouldn’t be the default. Even if all of your teams are using the same data, you should still create a default custom view that only queries data that’s used on a daily basis.

Next Steps

Custom views are just one way to reduce cloud costs and apply FinOps best practices. Learn how to use precision query scaling and query circuit breakers to reduce wasteful query compute.

Learn more about Hydrolix and contact us for a POC.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

High dimensionality datasets in Hydrolix can have hundreds or even thousands of columns. However, most teams will only need to query a smaller slice of that data in their day-to-day work. With Hydrolix, you can use custom views to make it easy for your teams and stakeholders to access only the columns they need for analysis.

Custom views allow you to curate different views for each team or even share views between teams. You can remove the distraction of unnecessary columns, making query results easier to work with. In addition to providing convenience and productivity benefits, custom views can help you apply FinOps best practices, which include maximizing the value of your data and eliminating wasteful cloud spend. A SELECT * statement using a custom view will retrieve only the columns related to that view instead of retrieving all the columns in a table. It’s an easy step you can take to save money on your cloud bill.

Why Use a Custom View?

To understand the benefits of custom views, let’s take a look at the difference between a basic query and a custom view. Here’s a basic query that a security analyst could use as a basis for making a hypothesis about unauthorized attempts to access application services.

Hydrolix is a columnar database, and a SELECT * statement will retrieve all of the columns in a table by default. If the services table has hundreds of columns and the security analyst only needs five of them to test a hypothesis, this query would be unnecessarily wasteful. It’ll also be much harder to sift through all that data.

Columnar databases like Hydrolix are optimized to retrieve only the columns you need. They can do partial table scans instead of a full table scan like traditional row-based databases, but a SELECT * query negates that benefit.

If you have a lot of database experience, you might be thinking, “I’d never use SELECT *.” But a surprising number of users do run SELECT * queries. Some users have experience with row-based databases, not columnar databases, or don’t typically work with high dimensionality data. And many users aren’t accustomed to working with terabyte-scale data, where small tweaks in a query can make a huge difference in terms of performance and compute costs. As your datasets accumulate data gravity (the tendency of users to gravitate towards large volumes of data), you can maximize the value of that data by providing data to more teams. However, a side effect of doing so is that you’ll have more users that may be inexperienced with big data.

A more curated query would look like this:

Now the query specifies six columns. Like other columnar databases, Hydrolix uses column pruning to disregard columns that aren’t specified in the query, making the query more performant and efficient. The cost savings for one query might be minimal, but if your teams are making thousands of queries, the savings will add up. The query results are also much easier for humans to sift through.

This query is still simplistic, though. A security analyst might want to retrieve dozens of columns. And what if you wanted to share this data slice between teams, or train a new member of your team to use it? While you may generally want to retrieve the same columns for many queries, you’ll likely want to filter the data in many different ways. The SELECT clause should be reusable while other parts of the query are flexible.

With a custom view, the same query might look like this:

Here, SELECT * retrieves all the columns that are specified in security_team_view. You can curate exactly which columns the security team needs and edit the view at any time.

You can (and should) create a default custom view so that a SELECT * doesn’t query every column in a table by default. This is a best practice for high dimensionality data, and one way that custom views can support your FinOps practices.

Creating Custom Views In Hydrolix

You can use the Hydrolix API to create a custom view. If this is the first time you’re using the API, start with the authentication documentation. Once you’re authenticated, you can retrieve the automatically generated auto_view and then use it to create a view. Let’s walk through the process.

1. Retrieve the table auto_view.

All Hydrolix tables have an automatically generated auto_view which includes all of a table’s columns. To create a custom view, your first step is to retrieve this auto_view. You can make a GET request to the following Hydrolix endpoint to see all the views on a table: https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/{table_id}/views/. If you haven’t created a custom view yet, the only view will be the auto_view. Here’s an example auto_view for a table that only has four columns:

You’ll use the JSON for your table’s auto_view as a basis for your custom view.

Note that "is_default" is set to true. Because the auto_view is the default view for the table, a SELECT * query will retrieve every column in the table. If your table has low dimensionality, that’s not necessarily a problem. However, if it’s high dimensionality and you aren’t regularly using all of the columns in the table, you should change the default. You can still query the auto_view table by specifying it in the query.

2. Create your custom view using JSON.

Now that you have the auto_view, the next step is to determine which columns should be in your custom view. The JSON for the custom view will look very similar to the auto_view. Here’s an example custom view that has two of the four columns of the example auto_view from the last section.

Let’s run through the relevant parts:

  • name: The name of your custom view. It should be concise and memorable, especially if it’s not the default view, because users will need to specify it in their queries.
  • description: Add an optional description. The value will be null otherwise.
  • is_default: If set to true, the custom view will be the default. Only one view can be the default for a table. The auto_view is the default view unless you set is_default to true for one of your custom views. If your table has high dimensionality, you should set a custom view to be the default.
  • output_columns: This is an array of JSON objects, and each object represents a column in your database. You can simply copy/paste columns from your auto_view into your custom view.

3. Publish the custom view using the API.

Once you’ve built the JSON for your custom view, you can publish it with a POST call to https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/{table_id}/views/. Simply paste the JSON for your custom view in the body of the API call. You can then verify that the view has been created with a GET call to https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/{table_id}/views/.

4. Using custom views in queries.

If the custom view isn’t the default, you need to specify the view you’re using in the query:

Note that the format is project_name.table_name#view_name. You can make the query more selective by specifying columns in the view instead of using *.

Important note: The ... in this query example represents the rest of the query. Custom views will reduce the dimensionality of a query response, but not the cardinality. You should never make queries on large datasets without predicates. In the case of Hydrolix, your query should at a minimum include a time filter for optimal performance.

If the custom view is the default, you don’t have to specify the view. All queries that don’t specify a view will automatically use the default. If you do need access to all of the columns on the table, you can specify the auto_view like this:

Summary

Using custom views is a best practice, especially if your tables have high dimensionality. As your data sets accumulate data gravity, more teams will likely rely on that data, making it even more important to curate data slices for each team. By doing so, you can maximize the value of your data and reduce wasteful full table scans. And remember, for high dimensionality tables, the auto_view shouldn’t be the default. Even if all of your teams are using the same data, you should still create a default custom view that only queries data that’s used on a daily basis.

Next Steps

Custom views are just one way to reduce cloud costs and apply FinOps best practices. Learn how to use precision query scaling and query circuit breakers to reduce wasteful query compute.

Learn more about Hydrolix and contact us for a POC.