HYDROLIX BLOG

Ponderings, insights and industry updates

Picture of sparkler

Analyze your Hydrolix Data with Apache Spark

Published: September 11, 2023

Updated: September 13, 2023

Author: Alex Cruise

Big data can be challenging to scale and even more challenging to query. Without the right tools, you’ll likely run into performance issues like query timeouts, making it difficult or even impossible to analyze your data. Even if you have the right tools in place, it can be overwhelming to analyze your data for patterns and anomalies—and the bigger your data gets, the more overwhelming the challenge.

Hydrolix is purpose-built for storing large volumes of data at minimal cost and with extremely fast querying no matter how large your dataset. And now, with the Hydrolix Spark Connector, you can create Apache Spark jobs to directly query your Hydrolix data with Spark SQL, Scala, and Python, allowing you to use powerful advanced analytics tools like MLLib and GraphX to better understand your data.

In this blog post, you’ll learn how to build, configure, and run the Hydrolix Spark Connector using DataBricks. If you aren’t using Hydrolix yet, sign up now for a free trial. If you’d like to take a look at the source code for the connector, which is open source, see the Hydrolix Spark Connector on GitHub.

Building the Hydrolix Spark Connector

To install the Hydrolix Spark Connector, you’ll first need to make sure you have the following installed:

Run the following commands to download and build the Hydrolix Spark Connector:

Because the sbt command will often need more memory than the default, the -J-Xmx4G flag tells the JVM to allot additional memory (4GB) to run the command.

This will produce the following file: ./target/scala-2.12/hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar. This file is specified as one of the inputs when starting the Spark shell.

Configuring the Hydrolix Spark Connector

The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. Note that the ‘…’ prefix in these option names should be replaced with spark.sql.catalog.hydrolix.

Option nameValueDescription
io.hydrolix.spark.connector.HdxTableCatalogFully-qualified name of the Scala class to instantiate when the hydrolix catalog is selected
…api_urlhttps://<hdx-cluster>/config/v1/API URL of your Hydrolix cluster
…jdbc_urljdbc:clickhouse:tcp://<hdx-cluster>:9440/_local?ssl=true. (If this URL doesn’t work, you can also try: jdbc:clickhouse://<hdx-cluster>:8088/_local?ssl=true)JDBC URL of your Hydrolix cluster
…username<user name>Hydrolix cluster username
…password<password>Hydrolix cluster password
…cloud_cred_1<base64 or AWS access key ID>First cloud credential. For AWS, this is an AWS access key ID. For Google Cloud, this is the contents of the Google Cloud service account key file, compressed with gzip and then encoded as base64. 
…cloud_cred_2<AWS secret>This is only needed for AWS, not Google Cloud.

Deploying on Databricks

The Hydrolix Spark Connector is tested and supported on Databricks Cloud. Other cloud deployment options like AWS Elastic MapReduce (EMR) or Google Cloud Dataproc may work without code changes, but haven’t been validated yet.

To deploy on Databricks, you’ll first need to create or choose a Databricks workspace for the deployment.

Configuring Your Spark Cluster

Next, you’ll create a Spark cluster in your Databricks workspace and set the following configuration:

  • Policy: Unrestricted
  • Access Mode: No Isolation Shared
  • Databricks Runtime Version: Should be version 13 or later

The next image shows how the configuration should look.

Specify policy, Access mode, and Databricks Runtime Version when creating a cluster.

Next, you’ll need to set some additional configuration parameters. In the Advanced Options section, open the Spark tab as shown in the following image.

Specify Spark config and environment variables in the Advanced Options section.

You’ll need to set the following Spark configuration parameters:

  • spark.sql.catalog.hydrolix 
  • spark.sql.catalog.hydrolix.api_url 
  • spark.sql.catalog.hydrolix.jdbc_url
  • spark.sql.catalog.hydrolix.username 
  • spark.sql.catalog.hydrolix.password 
  • spark.sql.catalog.hydrolix.cloud_cred_1 
  • spark.sql.catalog.hydrolix.cloud_cred_2 (only for AWS)

You can set the Hydrolix Spark Connector’s configuration parameters in the Advanced Options section as name-value pairs delimited by whitespace, or if you prefer, you can configure them in a notebook using spark.conf.set(<key>, <value>), allowing you to use Databricks Secrets.

Credentials With Google Cloud Platform

If you’re using Google Cloud with Databricks, set the hydrolix.cloud_cred_1 parameter to the base64 encoded gzipped key value:

spark.sql.catalog.hydrolix.cloud_cred_1 <gcpKeyBase64>

You do not need spark.sql.catalog.hydrolix.cloud_cred_2 with Google Cloud Storage Google Cloud Storage.

Credentials With AWS

When using Databricks in AWS, your cloud credentials should be set to the following:

Setting the JNAME Environment Variable

Next, to enable the use of JDK11, set the JNAME environment variable to zulu11-ca-amd64, as shown in the following image.

Example Spark configuration settings in the Databricks UI.

Uploading and Installing the Hydrolix Spark Connector

In the Libraries tab for the Spark cluster, select Install new as shown in the next image.

Libraries tab shows Install new button.

Make sure the Upload and JAR options are selected as shown in the following image.

In Install library window, Upload and JAR radio buttons are selected.

In your local file manager, navigate to the target/scala-2.12 subfolder of the hydrolix-spark-connector source tree, and move hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar into the Drop JAR here window that’s shown in the previous image. Don’t select anything while it’s uploading or you’ll have to upload the file again. After the upload is finished, you’ll see a green checkmark next to the file as shown in the next image.

File has a green checkmark to show that it has been successfully uploaded.

Once the upload is finished, select Install and wait a few minutes while the cluster restarts. You can now start analyzing your Hydrolix data in Spark!

Using the Hydrolix Spark Connector With a Spark Notebook

After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.

To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:

  • SQL fragment: use hydrolix.
  • Python or Scala fragment: sql("use hydrolix").

You can then use your Spark notebook to make a SQL query or use the Dataframe API. Here’s an example SQL query:

And here’s the same query with Python and the Dataframe API.

Conclusion

After you’ve set up the connection between Apache Spark and Hydrolix, you’ll be able to take advantage of all of the data processing tools in the Spark framework and ecosystem to analyze your data. Now you can easily use tools like GraphX, MLLib, and others to better understand how your systems are working.

Next Steps

If you don’t have Hydrolix yet, sign up for a free trial today. With Hydrolix, you get industry-leading data compression, lightning-fast queries, and unmatched scalability for your observability data. Learn more about Hydrolix.

Sparkler photo by Jez Timms on Unsplash

Share Now