Big data can be challenging to scale and even more challenging to query. Without the right tools, you’ll likely run into performance issues like query timeouts, making it difficult or even impossible to analyze your data. Even if you have the right tools in place, it can be overwhelming to analyze your data for patterns and anomalies—and the bigger your data gets, the more overwhelming the challenge.
Hydrolix is purpose-built for storing large volumes of data at minimal cost and with extremely fast querying no matter how large your dataset. And now, with the Hydrolix Spark Connector, you can create Apache Spark jobs to directly query your Hydrolix data with Spark SQL, Scala, and Python, allowing you to use powerful advanced analytics tools like MLLib and GraphX to better understand your data.
In this blog post, you’ll learn how to build, configure, and run the Hydrolix Spark Connector using DataBricks. If you aren’t using Hydrolix yet, sign up now for a free trial. If you’d like to take a look at the source code for the connector, which is open source, see the Hydrolix Spark Connector on GitHub.
Table of Contents
Building the Hydrolix Spark Connector
To install the Hydrolix Spark Connector, you’ll first need to make sure you have the following installed:
- Java Development Kit (JDK) 11 or later
- SBT
Run the following commands to download and build the Hydrolix Spark Connector:
1 2 3 4 5 |
git clone git@github.com:hydrolix/spark-connector.git hydrolix-spark-connector cd hydrolix-spark-connector sbt -J-Xmx4G assembly |
Because the sbt
command will often need more memory than the default, the -J-Xmx4G
flag tells the JVM to allot additional memory (4GB) to run the command.
This will produce the following file: ./target/scala-2.12/hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar
. This file is specified as one of the inputs when starting the Spark shell.
Configuring the Hydrolix Spark Connector
The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. Note that the ‘…’ prefix in these option names should be replaced with spark.sql.catalog.hydrolix
.
Option name | Value | Description |
… | io.hydrolix.spark.connector.HdxTableCatalog | Fully-qualified name of the Scala class to instantiate when the hydrolix catalog is selected |
…api_url | https://<hdx-cluster>/config/v1/ | API URL of your Hydrolix cluster |
…jdbc_url | jdbc:clickhouse:tcp://<hdx-cluster>:9440/_local?ssl=true. (If this URL doesn’t work, you can also try: jdbc:clickhouse://<hdx-cluster>:8088/_local?ssl=true) | JDBC URL of your Hydrolix cluster |
…username | <user name> | Hydrolix cluster username |
…password | <password> | Hydrolix cluster password |
…cloud_cred_1 | <base64 or AWS access key ID> | First cloud credential. For AWS, this is an AWS access key ID. For Google Cloud, this is the contents of the Google Cloud service account key file, compressed with gzip and then encoded as base64. |
…cloud_cred_2 | <AWS secret> | This is only needed for AWS, not Google Cloud. |
Deploying on Databricks
The Hydrolix Spark Connector is tested and supported on Databricks Cloud. Other cloud deployment options like AWS Elastic MapReduce (EMR) or Google Cloud Dataproc may work without code changes, but haven’t been validated yet.
To deploy on Databricks, you’ll first need to create or choose a Databricks workspace for the deployment.
Configuring Your Spark Cluster
Next, you’ll create a Spark cluster in your Databricks workspace and set the following configuration:
- Policy: Unrestricted
- Access Mode: No Isolation Shared
- Databricks Runtime Version: Should be version 13 or later
The next image shows how the configuration should look.

Next, you’ll need to set some additional configuration parameters. In the Advanced Options section, open the Spark tab as shown in the following image.

You’ll need to set the following Spark configuration parameters:
spark.sql.catalog.hydrolix
spark.sql.catalog.hydrolix.api_url
spark.sql.catalog.hydrolix.jdbc_url
spark.sql.catalog.hydrolix.username
spark.sql.catalog.hydrolix.password
spark.sql.catalog.hydrolix.cloud_cred_1
spark.sql.catalog.hydrolix.cloud_cred_2
(only for AWS)
You can set the Hydrolix Spark Connector’s configuration parameters in the Advanced Options section as name-value pairs delimited by whitespace, or if you prefer, you can configure them in a notebook using spark.conf.set(<key>, <value>)
, allowing you to use Databricks Secrets.
Credentials With Google Cloud Platform
If you’re using Google Cloud with Databricks, set the hydrolix.cloud_cred_1 parameter
to the base64 encoded gzipped key value:
spark.sql.catalog.hydrolix.cloud_cred_1 <gcpKeyBase64>
You do not need spark.sql.catalog.hydrolix.cloud_cred_2
with Google Cloud Storage Google Cloud Storage.
Credentials With AWS
When using Databricks in AWS, your cloud credentials should be set to the following:
1 2 |
spark.sql.catalog.hydrolix.cloud_cred_1 <AWS_ACCESS_KEY_ID> spark.sql.catalog.hydrolix.cloud_cred_2 <AWS_SECRET_KEY> |
Setting the JNAME Environment Variable
Next, to enable the use of JDK11, set the JNAME environment variable to zulu11-ca-amd64
, as shown in the following image.

Uploading and Installing the Hydrolix Spark Connector
In the Libraries tab for the Spark cluster, select Install new as shown in the next image.

Make sure the Upload and JAR options are selected as shown in the following image.

In your local file manager, navigate to the target/scala-2.12
subfolder of the hydrolix-spark-connector
source tree, and move hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar
into the Drop JAR here window that’s shown in the previous image. Don’t select anything while it’s uploading or you’ll have to upload the file again. After the upload is finished, you’ll see a green checkmark next to the file as shown in the next image.

Once the upload is finished, select Install and wait a few minutes while the cluster restarts. You can now start analyzing your Hydrolix data in Spark!
Using the Hydrolix Spark Connector With a Spark Notebook
After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.
To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:
- SQL fragment:
use hydrolix
. - Python or Scala fragment:
sql("use hydrolix")
.
You can then use your Spark notebook to make a SQL query or use the Dataframe API. Here’s an example SQL query:
1 2 3 4 5 6 7 8 9 10 |
SELECT user_agent, count(*) AS count FROM hydrolix_project.hydrolix_table WHERE my_timestamp_column BETWEEN '2023-05-10T00:00:00.000Z' AND '2023-05-10T00:00:01.000Z' GROUP BY user_agent ORDER BY count DESC |
And here’s the same query with Python and the Dataframe API.
1 2 3 4 5 6 7 8 9 10 |
from pyspark.sql.functions import col, desc, count my_table = table("hydrolix_project.hydrolix_table") ts = col("my_timestamp_column") sample_df = my_table.filter(ts.between('2023-05-10T00:00:00.000Z', '2023-05-10T00:00:01.000Z')) \ .groupBy(my_table.user_agent) \ .agg(count("*").alias("count")) \ .orderBy(desc("count")) \ .cache() |
Conclusion
After you’ve set up the connection between Apache Spark and Hydrolix, you’ll be able to take advantage of all of the data processing tools in the Spark framework and ecosystem to analyze your data. Now you can easily use tools like GraphX, MLLib, and others to better understand how your systems are working.
Next Steps
If you don’t have Hydrolix yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.