Databricks Connect

Topic Overview

Databricks Connect lets you run Spark code from your local IDE (like VS Code, PyCharm, or IntelliJ) against a remote Databricks cluster. Instead of writing and testing everything inside a Databricks notebook, you can develop locally using your preferred tools, libraries, and debugging workflows while the actual computation happens on a cluster in your Databricks workspace.

This is particularly useful for teams that want to integrate Databricks into existing software engineering workflows. You get the full power of Spark without leaving your local development environment. Databricks Connect v2 (the current version) is built on top of Spark Connect and uses a thin client that sends queries to the cluster for execution. The local machine does not need Spark installed.

For the exam, the key thing to understand is when and why you would use Databricks Connect versus just using notebooks directly. It comes down to local IDE preference, CI/CD integration, and the ability to use local debugging tools.

Key Concepts

What is Databricks Connect?
Databricks Connect v2 vs v1
Supported use cases
Authentication and configuration
Limitations

Code Examples

Setting up a Databricks Connect session (Python)

from databricks.connect import DatabricksSession

# Option 1: Configure directly
spark = DatabricksSession.builder.remote(
    host="https://<workspace-url>",
    token="<your-token>",
    cluster_id="<cluster-id>"
).getOrCreate()

# Option 2: Use a Databricks configuration profile
spark = DatabricksSession.builder.profile("DEFAULT").getOrCreate()

# Option 3: Use environment variables (DATABRICKS_HOST, DATABRICKS_TOKEN, etc.)
spark = DatabricksSession.builder.getOrCreate()

# Now use spark just like you would in a notebook
df = spark.read.table("my_catalog.my_schema.my_table")
df.show()

Reading and writing data through Databricks Connect

# Read from a Unity Catalog table
df = spark.read.table("catalog.schema.table_name")

# Run transformations locally defined, executed on the cluster
result = df.filter(df.status == "active").groupBy("region").count()

# Collect results back to local machine
local_df = result.toPandas()
print(local_df)

# Write back to a table
result.write.mode("overwrite").saveAsTable("catalog.schema.aggregated_table")

Common Exam Scenarios

Scenario 1: Local IDE development