Cluster Types and Configuration

Topic Overview

Databricks provides multiple cluster types and compute options tailored to different workloads and use cases. Understanding when to use each cluster type is critical for the exam and for building efficient data pipelines. The main distinction is between All Purpose Clusters (interactive, long running), Job Clusters (ephemeral, job specific), and SQL Warehouses (SQL optimized). Beyond these, Serverless Compute removes the need for manual cluster management entirely.

All Purpose Clusters are created manually or via API and persist until you explicitly terminate them. They support multiple users and multiple notebooks running simultaneously, making them ideal for interactive development, ad hoc analysis, and exploratory work. You pay for the compute hours regardless of whether you are actively running code.

Job Clusters, by contrast, are created automatically when a job starts and terminated as soon as the job finishes. They are cost effective for production workloads because you only pay for the compute time the job actually runs. Job Cluster configuration is defined within the job specification, not independently. You cannot share a Job Cluster across multiple jobs—each job gets its own ephemeral cluster.

SQL Warehouses are specialized endpoints optimized for SQL queries and BI tools. They come in three flavors—Classic, Pro, and Serverless—and leverage Photon, a native vectorized query engine that accelerates SQL workloads. Serverless Compute is a managed option where Databricks handles cluster provisioning automatically, providing instant startup with no configuration overhead.

Key Concepts

All Purpose vs Job Clusters
SQL Warehouses
Serverless Compute
Autoscaling
Cluster Policies
Cluster Access Modes
Photon
Single Node vs Multi Node Clusters
Instance Types, Spot Instances, and Init Scripts

Code Examples

All Purpose Cluster Configuration (JSON):

{
  "cluster_name": "my-interactive-cluster",
  "spark_version": "15.4.x-scala2.12",
  "node_type_id": "i3.xlarge",
  "driver_node_type_id": "i3.xlarge",
  "num_workers": 2,
  "autoscale": {
    "min_workers": 1,
    "max_workers": 4
  },
  "aws_attributes": {
    "use_spot_instances": true,
    "spot_bid_price_percent": 70
  },
  "init_scripts": [
    {
      "s3": {
        "destination": "s3://my-bucket/scripts/install-packages.sh"
      }
    }
  ],
  "access_mode": "SHARED",
  "photon_driver_node": true,
  "photon_worker_node": true
}

Job Cluster Configuration (via Databricks REST API concept):

{
  "task_key": "process_data",
  "new_cluster": {
    "cluster_name": "job-cluster-process",
    "spark_version": "15.4.x-scala2.12",
    "node_type_id": "i3.xlarge",
    "num_workers": 3,
    "aws_attributes": {
      "use_spot_instances": true
    }
  },
  "spark_python_task": {
    "python_file": "dbfs:/scripts/etl.py"
  }
}