Implementing Data Pipelines with Lakeflow

Topic Overview

Lakeflow Spark Declarative Pipelines provide a declarative way to build data pipelines in Databricks. Instead of writing imperative code that describes how to transform data step by step, you define what data you want to process and what transformations to apply. The pipeline engine handles scheduling, fault tolerance, and incremental processing automatically. This makes it easier to build reliable, maintainable data pipelines that scale.

A Lakeflow pipeline consists of one or more notebooks that contain SQL or Python code defining tables. Each notebook contains table definitions using statements like CREATE OR REFRESH STREAMING TABLE or CREATE OR REPLACE MATERIALIZED VIEW. The pipeline orchestrates these notebooks, manages dependencies between tables, and handles data quality checks. Pipelines integrate directly with Unity Catalog, storing results in a specified catalog and schema with full governance and lineage tracking.

For the exam, you need to understand how to configure pipelines, implement change data capture with APPLY CHANGES, choose between SCD Type 1 and Type 2, understand refresh modes, monitor pipelines through event logs, and handle errors gracefully. You should also be able to work with multi notebook pipelines and understand how they integrate with Unity Catalog.

Key Concepts

Pipeline Configuration and Setup
APPLY CHANGES for Change Data Capture (CDC)
SCD Type 1 vs Type 2
Full Refresh vs Incremental Refresh
Pipeline Event Logs and Monitoring
Multi Notebook Pipelines

Code Examples

APPLY CHANGES with SCD Type 1 (Overwrite)

-- In a Lakeflow pipeline notebook
CREATE OR REFRESH STREAMING TABLE customers AS
APPLY CHANGES INTO live.customers
FROM raw_customers_cdc
KEYS (customer_id)
SEQUENCE BY event_timestamp
WHEN MATCHED AND operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

APPLY CHANGES with SCD Type 2 (Track History)

-- In a Lakeflow pipeline notebook
CREATE OR REFRESH STREAMING TABLE customers_history
STORED AS SCD TYPE 2 AS
APPLY CHANGES INTO live.customers_history
FROM raw_customers_cdc
KEYS (customer_id)
SEQUENCE BY event_timestamp
WHEN MATCHED AND operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Multi Notebook Pipeline: Bronze Layer

-- Notebook: 01_bronze_ingest.sql
-- Ingests raw data from cloud storage
CREATE OR REFRESH STREAMING TABLE orders_bronze AS
SELECT
  *,
  _metadata.file_path,
  _metadata.file_modification_time
FROM cloud_files(
  '/mnt/raw/orders',
  'parquet',
  map('cloudFiles.schemaLocation', '/mnt/checkpoints/orders')
)