Medallion Architecture (Bronze, Silver, Gold)

Topic Overview

The Medallion Architecture (also called multi hop architecture) is a data design pattern that organizes your lakehouse into three distinct layers, each serving a specific purpose in the data pipeline. The Bronze layer captures raw, unprocessed data exactly as it arrives from source systems. The Silver layer contains cleaned, deduplicated, and conformed data that has been validated and standardized. The Gold layer holds business level aggregations, dimensional models, and analytics ready datasets prepared for BI tools and reporting.

This architecture is a foundational concept for the Databricks Certified Data Engineer Associate exam because it represents how real world data pipelines are built in enterprise lakehouse environments. Each layer enforces data quality at its stage, making it easy to identify where issues occur and to reprocess data if needed. The Medallion pattern works seamlessly with Databricks tools like Auto Loader for ingestion, Lakeflow Spark Declarative Pipelines for transformation, and Unity Catalog for governance.

The beauty of Medallion is simplicity combined with power. You don't need complex orchestration to implement it. Start with a raw data ingestion layer, add cleanup logic in the middle, and expose clean aggregates at the end. Each layer becomes a Delta table in your Unity Catalog, creating a clear lineage for governance and auditing. You can support both batch and streaming workloads within the same architecture, adapting your pipeline to how data actually arrives.

Key Concepts

Bronze Layer: Raw Ingestion
Silver Layer: Cleaned and Conformed
Gold Layer: Business Aggregates
Streaming Through the Layers
Lakeflow Spark Declarative Pipelines and Medallion Architecture

Code Examples

Bronze Layer: Auto Loader Ingestion (Python)

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema for incoming JSON events
schema = StructType([
    StructField("customer_id", StringType()),
    StructField("order_id", StringType()),
    StructField("amount", IntegerType()),
    StructField("timestamp", StringType())
])

# Auto Loader ingests raw JSON into bronze table
(spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/tmp/schema_location")
    .schema(schema)
    .load("/mnt/raw/orders/")
    .writeStream
    .format("delta")
    .option("checkpointLocation", "/tmp/checkpoint")
    .option("mergeSchema", "true")
    .table("bronze_orders")
)

This code uses Auto Loader to stream JSON files from cloud storage into the bronze_orders table. Auto Loader automatically detects new files, parses them, and appends rows to the Delta table. The checkpointLocation tracks which files have been processed, preventing duplicates. mergeSchema allows the schema to evolve as new fields appear in incoming data.

Silver Layer: Cleaning and Deduplication (SQL)

CREATE OR REPLACE TABLE silver_orders AS
WITH deduplicated AS (
  SELECT 
    customer_id,
    order_id,
    amount,
    CAST(timestamp AS TIMESTAMP) AS order_timestamp,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY timestamp DESC) AS rn
  FROM bronze_orders
  WHERE customer_id IS NOT NULL
    AND order_id IS NOT NULL
    AND amount > 0
)
SELECT 
  customer_id,
  order_id,
  amount,
  order_timestamp,
  CURRENT_TIMESTAMP() AS processed_at
FROM deduplicated
WHERE rn = 1;

This SQL transformation reads from bronze_orders and applies data quality rules. It filters out rows with null keys or invalid amounts, casts timestamp to the correct type, deduplicates using a window function (keeping the most recent record per order_id), and adds a processed_at column for audit trails. The result is a clean, conformed silver table.

Topic Overview

Key Concepts

Code Examples

Bronze Layer: Auto Loader Ingestion (Python)

Silver Layer: Cleaning and Deduplication (SQL)

Gold Layer: Business Aggregation (SQL)