Big data can be challenging to scale and even more challenging to query. Without the right tools, you’ll likely run into performance issues like query timeouts, making it difficult or even impossible to analyze your data. Even if you have the right tools in place, it can be overwhelming to analyze your data for patterns and anomalies—and the bigger your data gets, the more overwhelming the challenge.
Hydrolix is purpose-built for storing large volumes of data at minimal cost and with extremely fast querying no matter how large your dataset. And now, with the Hydrolix Spark Connector, you can create Apache Spark jobs to directly query your Hydrolix data with Spark SQL, Scala, and Python, allowing you to use powerful advanced analytics tools like MLLib and GraphX to better understand your data.
In this blog post, you’ll learn how to build, configure, and run the Hydrolix Spark Connector using DataBricks. If you aren’t using Hydrolix yet, sign up now for a free trial. If you’d like to take a look at the source code for the connector, which is open source, see the Hydrolix Spark Connector on GitHub.
Table of Contents
- Building the Hydrolix Spark Connector
- Configuring the Hydrolix Spark Connector
- Deploying on Databricks
- Uploading and Installing the Hydrolix Spark Connector
- Using the Hydrolix Spark Connector With a Spark Notebook
- Next Steps
Building the Hydrolix Spark Connector
To install the Hydrolix Spark Connector, you’ll first need to make sure you have the following installed:
Run the following commands to download and build the Hydrolix Spark Connector:
git clone firstname.lastname@example.org:hydrolix/spark-connector.git hydrolix-spark-connector
sbt -J-Xmx4G assembly
sbt command will often need more memory than the default, the
-J-Xmx4G flag tells the JVM to allot additional memory (4GB) to run the command.
This will produce the following file:
./target/scala-2.12/hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar. This file is specified as one of the inputs when starting the Spark shell.
Configuring the Hydrolix Spark Connector
The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. Note that the ‘…’ prefix in these option names should be replaced with
|…||io.hydrolix.spark.connector.HdxTableCatalog||Fully-qualified name of the Scala class to instantiate when the hydrolix catalog is selected|
|…api_url||https://<hdx-cluster>/config/v1/||API URL of your Hydrolix cluster|
|…jdbc_url||jdbc:clickhouse:tcp://<hdx-cluster>:9440/_local?ssl=true. (If this URL doesn’t work, you can also try: jdbc:clickhouse://<hdx-cluster>:8088/_local?ssl=true)||JDBC URL of your Hydrolix cluster|
|…username||<user name>||Hydrolix cluster username|
|…password||<password>||Hydrolix cluster password|
|…cloud_cred_1||<base64 or AWS access key ID>||First cloud credential. For AWS, this is an AWS access key ID. For Google Cloud, this is the contents of the Google Cloud service account key file, compressed with gzip and then encoded as base64.|
|…cloud_cred_2||<AWS secret>||This is only needed for AWS, not Google Cloud.|
Deploying on Databricks
The Hydrolix Spark Connector is tested and supported on Databricks Cloud. Other cloud deployment options like AWS Elastic MapReduce (EMR) or Google Cloud Dataproc may work without code changes, but haven’t been validated yet.
To deploy on Databricks, you’ll first need to create or choose a Databricks workspace for the deployment.
Configuring Your Spark Cluster
Next, you’ll create a Spark cluster in your Databricks workspace and set the following configuration:
- Policy: Unrestricted
- Access Mode: No Isolation Shared
- Databricks Runtime Version: Should be version 13 or later
The next image shows how the configuration should look.
Next, you’ll need to set some additional configuration parameters. In the Advanced Options section, open the Spark tab as shown in the following image.
You’ll need to set the following Spark configuration parameters:
spark.sql.catalog.hydrolix.cloud_cred_2(only for AWS)
You can set the Hydrolix Spark Connector’s configuration parameters in the Advanced Options section as name-value pairs delimited by whitespace, or if you prefer, you can configure them in a notebook using
spark.conf.set(<key>, <value>), allowing you to use Databricks Secrets.
Credentials With Google Cloud Platform
If you’re using Google Cloud with Databricks, set the
hydrolix.cloud_cred_1 parameter to the base64 encoded gzipped key value:
You do not need
spark.sql.catalog.hydrolix.cloud_cred_2 with Google Cloud Storage Google Cloud Storage.
Credentials With AWS
When using Databricks in AWS, your cloud credentials should be set to the following:
Setting the JNAME Environment Variable
Next, to enable the use of JDK11, set the JNAME environment variable to
zulu11-ca-amd64, as shown in the following image.
Uploading and Installing the Hydrolix Spark Connector
In the Libraries tab for the Spark cluster, select Install new as shown in the next image.
Make sure the Upload and JAR options are selected as shown in the following image.
In your local file manager, navigate to the
target/scala-2.12 subfolder of the
hydrolix-spark-connector source tree, and move
hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar into the Drop JAR here window that’s shown in the previous image. Don’t select anything while it’s uploading or you’ll have to upload the file again. After the upload is finished, you’ll see a green checkmark next to the file as shown in the next image.
Once the upload is finished, select Install and wait a few minutes while the cluster restarts. You can now start analyzing your Hydrolix data in Spark!
Using the Hydrolix Spark Connector With a Spark Notebook
After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.
To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:
- SQL fragment:
- Python or Scala fragment:
You can then use your Spark notebook to make a SQL query or use the Dataframe API. Here’s an example SQL query:
count(*) AS count
my_timestamp_column BETWEEN '2023-05-10T00:00:00.000Z' AND '2023-05-10T00:00:01.000Z'
And here’s the same query with Python and the Dataframe API.
from pyspark.sql.functions import col, desc, count
my_table = table("hydrolix_project.hydrolix_table")
ts = col("my_timestamp_column")
sample_df = my_table.filter(ts.between('2023-05-10T00:00:00.000Z', '2023-05-10T00:00:01.000Z')) \
After you’ve set up the connection between Apache Spark and Hydrolix, you’ll be able to take advantage of all of the data processing tools in the Spark framework and ecosystem to analyze your data. Now you can easily use tools like GraphX, MLLib, and others to better understand how your systems are working.
If you don’t have Hydrolix yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.