Auto Loader Sources and Use Cases

Topic Overview

Auto Loader is a Databricks feature that automatically detects and ingests new data files as they arrive in cloud storage. It works with S3, ADLS Gen2, GCS, and Unity Catalog Volumes, making it ideal for building scalable data pipelines that can handle thousands or millions of files without manual intervention. The feature uses checkpointing to track processed files and guarantees exactly once processing semantics, which is critical for data pipelines where duplicate data could cause downstream issues.

What makes Auto Loader powerful is its ability to work in two distinct modes for discovering new files: directory listing (the default, which simply scans the directory) and file notification (which leverages cloud service events for real time discovery at scale). The choice between these modes depends on your data volume, latency requirements, and the performance characteristics of your cloud storage. Auto Loader handles schema inference and evolution automatically, so your pipelines can adapt as the structure of incoming data changes over time.

On the exam, you'll need to know which file formats Auto Loader supports (JSON, CSV, Parquet, Avro, ORC, Text, Binary, XML), when to use Auto Loader versus other ingestion methods like COPY INTO, and how to configure it for both batch and streaming scenarios. Auto Loader typically feeds into the Bronze layer of a Medallion Architecture pipeline, serving as the entry point for raw data.

Key Concepts

What is Auto Loader and why use it?
Directory listing vs file notification mode
Supported file formats
Schema inference and evolution
Checkpointing and exactly once guarantees

Code Examples

Basic Auto Loader Streaming Read (PySpark)

# Basic Auto Loader read in streaming mode
df = spark.readStream \\
  .format("cloudFiles") \\
  .option("cloudFiles.format", "json") \\
  .option("cloudFiles.schemaLocation", "/Volumes/catalog/schema/") \\
  .load("s3://my-bucket/data/") 

df.writeStream \\
  .mode("append") \\
  .option("checkpointLocation", "/Volumes/catalog/checkpoint/") \\
  .table("bronze_events")

The cloudFiles.schemaLocation parameter tells Auto Loader where to store the inferred schema. The checkpoint location tells Spark where to maintain state for the streaming job.

Auto Loader with Schema Hints (PySpark)

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

# Define explicit schema
schema = StructType([
  StructField("id", StringType()),
  StructField("name", StringType()),
  StructField("age", IntegerType()),
  StructField("timestamp", TimestampType())
])

# Read with explicit schema
df = spark.readStream \\
  .format("cloudFiles") \\
  .option("cloudFiles.format", "json") \\
  .schema(schema) \\
  .load("s3://my-bucket/raw-data/")

df.writeStream \\
  .mode("append") \\
  .option("checkpointLocation", "/tmp/checkpoint/") \\
  .table("bronze_customers")

Providing an explicit schema is a best practice in production. It makes your pipeline more predictable and catches schema mismatches early rather than at runtime.

Topic Overview

Key Concepts

Code Examples

Basic Auto Loader Streaming Read (PySpark)

Auto Loader with Schema Hints (PySpark)

Auto Loader in Lakeflow Declarative Pipelines (SQL)