data-engineering

Understanding dbt Docs: Auto-Generated Documentation for Your Data Warehouse

Michael Oswald
8 min readBy Michael Oswald
#dbt#documentation#data-engineering#sql#analytics

Understanding dbt Docs: Auto-Generated Documentation for Your Data Warehouse

If you've ever inherited a data warehouse with dozens of tables and no documentation, you know the pain. Which table has the customer data? What does this column mean? Where does this data come from?

This is where dbt docs shines. It automatically generates beautiful, interactive documentation for your entire data warehouse - and keeps it in sync with your code.

What is dbt?

dbt (data build tool) is a transformation framework that helps data teams build reliable data pipelines using SQL. Instead of writing complex ETL scripts, you write simple SELECT statements, and dbt handles the rest - building tables, managing dependencies, and testing data quality.

What Are dbt Docs?

dbt docs is the documentation layer that comes with dbt. It generates an interactive website that shows:

  • All your data models with descriptions for tables and columns
  • Data lineage graphs showing how data flows through your warehouse
  • The SQL code that creates each model
  • Data quality tests applied to your data
  • Search functionality to quickly find models and columns

The best part? It's auto-generated from your code and comments. No separate documentation tool needed.

Getting Started: A Simple Example

I've created a simple dbt project using the classic Jaffle Shop dataset (a fictional coffee shop) to demonstrate dbt docs. You can follow along using this GitHub repository.

Project Setup

The project uses:

  • PostgreSQL running in Docker as our data warehouse
  • dbt to transform raw data into analytics-ready tables
  • Sample data from a coffee shop (customers, orders, payments)

Quick Start Commands

Here's how to get the project running:

1. Start the database:

docker-compose up -d

This spins up a local Postgres database with all the connection details pre-configured.

2. Install dbt:

pip install dbt-postgres

3. Load sample data:

dbt seed --profiles-dir .

This command loads CSV files from the seeds/ directory into your database as tables. Think of seeds as a way to get starter data into your warehouse.

4. Build your models:

dbt run --profiles-dir .

This executes all your SQL transformations in the correct order. dbt reads your models, figures out dependencies, and creates views and tables in your database.

5. Test your data:

dbt test --profiles-dir .

This runs data quality tests to ensure your data meets expectations (no nulls in key columns, unique IDs, valid status values, etc.).

6. Generate and serve documentation:

dbt docs generate --profiles-dir .
dbt docs serve --profiles-dir .

This creates the interactive documentation site and opens it in your browser at http://localhost:8080.

Understanding Data Lineage

One of the most powerful features of dbt docs is data lineage visualization.

Data lineage is the visual map showing where data comes from, how it's transformed, and where it flows to.

In our Jaffle Shop example, the lineage looks like this:

raw_customers → stg_customers → customer_orders
raw_orders    → stg_orders    ↗
raw_payments  → stg_payments  → order_summary

This graph answers critical questions:

  • "Which source tables does customer_orders depend on?" (Answer: all three staging models)
  • "If I change stg_orders, what's impacted?" (Answer: both customer_orders and order_summary)
  • "Where does our payment data flow?" (Answer: from raw → staging → marts)

Why Lineage Matters

Imagine a colleague asks: "Can I drop this old table? Is anything using it?"

Without lineage: You'd have to manually search through hundreds of SQL files, hoping you don't miss anything.

With dbt lineage: Click the table in the graph. Instantly see every downstream model that depends on it.

Exploring the dbt Docs Site

When you run dbt docs serve, you get an interactive website with several key sections:

1. The Project Overview

Shows your entire project structure with counts of models, tests, and sources. You can see at a glance how your project is organized.

2. The Lineage Graph (DAG)

The centerpiece of dbt docs. This interactive graph shows:

  • Nodes representing each model (table or view)
  • Edges showing dependencies between models
  • Color coding for different layers (staging vs. marts)

You can:

  • Click any node to see its documentation
  • Filter to show only models you care about
  • Trace the path from raw data to final analytics tables

3. Model Documentation Pages

Click any model to see:

Description: Business context about what this model represents

customer_orders: This mart table aggregates customer order information
to provide a comprehensive view of customer purchase behavior including
lifetime value, order frequency, and key dates.

Columns: Every column with its data type and description

• customer_id (integer) - Primary key for customers
  Tests: ✓ unique, ✓ not_null
• lifetime_value (numeric) - Total amount spent by the customer across all orders
  Tests: ✓ not_null

Code: The actual SQL that creates this model, with two views:

  • Source code (what you wrote)
  • Compiled code (with {{ ref() }} replaced with actual table names)

Tests: Which data quality tests are applied and their status

Type any table name or column name to quickly jump to its documentation. This is incredibly useful in large projects with hundreds of models.

How dbt Knows About Lineage

You might wonder: "How does dbt automatically know which models depend on each other?"

The magic is in the {{ ref() }} function. Instead of hardcoding table names in your SQL, you reference other models:

-- ❌ Don't do this
select * from public.stg_customers

-- ✓ Do this
select * from {{ ref('stg_customers') }}

When dbt sees {{ ref('stg_customers') }}, it:

  1. Knows this model depends on stg_customers
  2. Runs stg_customers first before running this model
  3. Draws a line in the lineage graph connecting them
  4. Replaces {{ ref() }} with the actual table name at runtime

This simple pattern gives you automatic dependency management and beautiful lineage graphs.

Real-World Benefits

For Data Engineers

  • Onboard new team members faster: "Here's the docs site, explore!"
  • Impact analysis: See what breaks before making changes
  • Code reviews: Reviewers can see exactly what your PR changes in the lineage

For Analytics Engineers

  • Self-service documentation: Answer your own questions about data
  • Trust in data: See what tests are passing/failing
  • Understand transformations: Click through the lineage to see how metrics are calculated

For Data Analysts

  • Find the right table: Search for what you need
  • Understand business logic: Read descriptions in plain English
  • See data freshness: Know when models were last updated

Important Note: Docs vs. Data

One clarification: dbt docs shows your data structure, not your actual data.

It shows:

  • Table and column names
  • Descriptions and documentation
  • SQL transformation code
  • Data types
  • Lineage relationships

It does NOT show:

  • Actual rows of data
  • Query results
  • Data previews

For browsing actual data, you still need a SQL editor like DBeaver, pgAdmin, or TablePlus. Think of dbt docs as the blueprint for your data warehouse, while SQL editors let you explore the actual contents.

Best Practices for Great Documentation

1. Write Descriptions in schema.yml

Don't just list columns - explain what they mean:

- name: lifetime_value
  description: >
    Total amount spent by the customer across all orders.
    This is calculated by summing all payment amounts.

2. Document Your Business Logic

Explain the "why" behind transformations:

- name: customer_orders
  description: >
    This mart table aggregates customer order information to provide a
    comprehensive view of customer purchase behavior. Used by the marketing
    team for cohort analysis and customer segmentation.

3. Add Tests

Tests serve double duty - they validate your data AND document your expectations:

- name: status
  tests:
    - accepted_values:
        values: ['completed', 'returned', 'placed', 'shipped']

This tells readers: "These are the only valid status values."

4. Use Consistent Naming

Follow a pattern like:

  • raw_* for source data
  • stg_* for staging models
  • Clear business names for marts (customer_orders, not table_1)

This makes your lineage graph much easier to understand.

Try It Yourself

Clone the dbt-practice repository and run through the setup commands. Within minutes, you'll have a fully documented data warehouse with interactive lineage graphs.

The project includes:

  • A simple two-layer architecture (staging → marts)
  • Sample CSV data (no need to connect to a real data source)
  • Pre-written documentation in schema.yml files
  • Data quality tests
  • A local Postgres database via Docker

Everything you need to see dbt docs in action.

Conclusion

dbt docs transforms documentation from a chore into a natural part of your workflow. By documenting your models alongside your code, the documentation stays in sync and actually gets used.

The lineage graph alone is worth the price of admission (which is free, by the way). Being able to visualize data flow, understand dependencies, and trace transformations is invaluable for any data team.

If you work with data warehouses and aren't using dbt yet, give it a try. Your future self - and your teammates - will thank you.


Have questions about dbt docs or want to share your documentation setup? Reach out on LinkedIn!

Enjoyed this post? Get more like it.

Subscribe to get my latest posts about data engineering, AI, and modern data stack delivered to your inbox.