data-engineering

dbt Tutorial: Beginner's Guide for Data Engineers

Michael Oswald
18 min readBy Michael Oswald
#dbt#data-engineering#sql#tutorial#postgres

Intro

If you're a software engineer moving into data engineering, dbt will feel like coming home. It brings the engineering practices you already know like version control, testing, modularity, and CI/CD to the world of SQL transformations.

This comprehensive dbt tutorial will take you from zero to building production-ready data pipelines.

Let's Start!

Table of Contents

  1. What is dbt?
  2. Where does dbt fit in the stack?
  3. Super Simple Example of a dbt Use Case
  4. Hands on dbt tutorial: NBA Analytics Example
  5. Architecture Overview
  6. Key Takeaways
  7. Troubleshooting
  8. Resources

What is dbt?

dbt (data build tool) handles the "T" in "ETL" by transforming data with SQL.

Analogy: 🌲→🪑 Raw data is the tree; dbt is the milling, sanding, and craftsmanship that turns it into polished furniture (gold layer data).

dbt does one thing really well: It transforms raw data in your warehouse into clean, tested, documented datasets that end users can trust.

Where does dbt fit in the stack?

As you can see in the diagram below, dbt acts on the data already loaded into your warehouse (via tools like Fivetran or Airbyte) and transforms it into analytics-ready tables for BI tools like Looker, Tableau, or Mode. dbt high-level architecture showing where dbt fits into an ETL process

One thing I like about dbt is that it doesn't try to be everything to everyone. It focuses on doing SQL transformations really well, leaving data extraction and loading to other tools.

Super Simple Example of a dbt Use Case

Let's say you want to clean and standardize some customer raw customer data you have in your warehouse.

  1. You have raw data in your warehouse (e.g., a raw.customers table).
  2. You create a dbt model (models/staging/stg_customers.sql) to clean and standardize that data (remove duplicates, rename/remove columns)
  3. Define tests to ensure data quality (no null emails, unique customer IDs)
  4. You document the model with descriptions for each column
  5. You run dbt run, which creates a new table/view (stg_customers) in your warehouse with the cleaned data
  6. You run dbt test to validate the data quality
  7. You generate documentation with dbt docs generate and serve it
  8. Your data analysts now query stg_customers for reliable customer data

⚠️ Now you could totally do this with raw python or SQL scripts, but dbt adds structure, testing, documentation, and version control that make it manageable as your data grows.

Next, let's build a real-world dbt project from scratch!


Hands on dbt tutorial: NBA Analytics Example

We'll create an NBA analytics project with dbt and Postgres.

Since I'm a Nuggets fan here in Colorado, we'll focus on answering this question:

"Which Nuggets players are the most consistent scorers?" 🤔

Building an NBA Player Stats Pipeline with dbt: A Complete Tutorial

💡 Want the complete code? Get the full working project on GitHub: dbt-nuggets-nba

What we'll build:

  • A Postgres data warehouse (Docker)
  • Python ingestion scripts using the NBA API
  • A dbt project with staging → intermediate → marts layers
  • Player-level fact table for consistency analysis

Time commitment: ~1 hour

Prerequisites

Install these before starting:

  • Python 3.11+
  • Docker Desktop
  • Git
  • SQL client Optional - (DBeaver, pgAdmin, or DataGrip)

Part 1: Environment Setup

We'll set up an isolated Python environment with all the tools needed for this tutorial.

1.1 Create Project Directory

Create a dedicated folder to keep all tutorial files organized.

mkdir -p nuggets_dbt_tutorial
cd nuggets_dbt_tutorial

1.2 Create Python Virtual Environment

Isolate Python packages in a virtual environment to avoid conflicts with your system Python.

# Create virtual environment
python3.11 -m venv dbt-env

# Activate it
source dbt-env/bin/activate  # macOS/Linux
# OR
dbt-env\Scripts\activate     # Windows

1.3 Install Python Dependencies

Install dbt, NBA API client, and database tools needed for the pipeline.

pip install \
  dbt-postgres \
  nba_api \
  pandas \
  sqlalchemy \
  psycopg2-binary

Expected output:

Successfully installed dbt-core-1.x.x dbt-postgres-1.x.x ...

Part 2: Spin Up Postgres Database

We'll launch a Postgres database in Docker to serve as our data warehouse for storing NBA stats.

2.1 Start Postgres in Docker

Run the following command to start your Postgres 16 container with user nuggets, password nuggets, and database nuggets_db.

docker run --name nuggets-postgres \
  -e POSTGRES_USER=nuggets \
  -e POSTGRES_PASSWORD=nuggets \
  -e POSTGRES_DB=nuggets_db \
  -p 5432:5432 \
  -d postgres:16

What this does:

  • Creates a Postgres 16 container named nuggets-postgres
  • Sets up user nuggets with password nuggets
  • Creates database nuggets_db
  • Maps port 5432 (container) → 5432 (localhost)
  • Runs in detached mode (-d)

2.2 Verify Postgres is Running

Confirm the Postgres container is running and accessible on port 5432.

docker ps

Expected output:

CONTAINER ID   IMAGE         STATUS         PORTS                    NAMES
abc123...      postgres:16   Up 10 seconds  0.0.0.0:5432->5432/tcp   nuggets-postgres

2.3 Test Connection

Verify we can connect to the database using the credentials we set.

docker exec -it nuggets-postgres psql -U nuggets -d nuggets_db

You should see:

psql (16.x)
nuggets_db=#

Type \q to exit.


Part 3: Ingest NBA Data

We'll fetch real NBA game and player data from the NBA API and load it into our Postgres warehouse as raw tables.

We will use the nba_api Python library to fetch Denver Nuggets game logs and player box scores for the 2023-24 season. We'll store the raw data in two Postgres tables:

  • public.raw_nba_games
  • public.raw_nba_boxscores

3.1 Create Scripts Directory

Organize ingestion scripts in a dedicated folder for better project structure.

mkdir -p nuggets_nba/scripts
cd nuggets_nba

3.2 Create scripts/ingest_games.py

Build a script to fetch all Nuggets games for the 2023-24 season from the NBA API.

Create this file with the following content:

from nba_api.stats.endpoints import TeamGameLog
import pandas as pd
from sqlalchemy import create_engine

# Configuration
NUGGETS_TEAM_ID = "1610612743"
SEASON = "2023-24"
DB_URL = "postgresql+psycopg2://nuggets:nuggets@localhost:5432/nuggets_db"

# Fetch Nuggets game logs
print(f"Fetching Nuggets games for {SEASON} season...")
game_log = TeamGameLog(team_id=NUGGETS_TEAM_ID, season=SEASON)
df = game_log.get_data_frames()[0]

# Clean column names
df_clean = df.rename(columns={
    'Game_ID': 'game_id',
    'GAME_DATE': 'game_date',
    'MATCHUP': 'matchup',
    'WL': 'win_loss',
    'PTS': 'team_points'
})

# Add season column
df_clean['season'] = SEASON

# Select relevant columns
df_clean = df_clean[[
    'game_id', 'season', 'game_date',
    'matchup', 'win_loss', 'team_points'
]]

# Write to Postgres
print(f"Writing {len(df_clean)} games to database...")
engine = create_engine(DB_URL)
df_clean.to_sql(
    'raw_nba_games',
    engine,
    schema='public',
    if_exists='replace',
    index=False
)

print("✅ Done! Games loaded into public.raw_nba_games")
print(f"Sample: {df_clean.head(3).to_dict('records')}")

3.3 Run Game Ingestion

Execute the script to populate the raw_nba_games table with 82 games.

python scripts/ingest_games.py

Expected output:

Fetching Nuggets games for 2023-24 season...
Writing 82 games to database...
✅ Done! Games loaded into public.raw_nba_games

Runtime: ~5-10 seconds


3.4 Create scripts/ingest_boxscores.py

Build a script to fetch detailed player statistics (box scores) for each game.

from nba_api.stats.endpoints import BoxScoreTraditionalV3
import pandas as pd
from sqlalchemy import create_engine
import time

# Configuration
DB_URL = "postgresql+psycopg2://nuggets:nuggets@localhost:5432/nuggets_db"

# Read games from database
print("Reading game IDs from database...")
engine = create_engine(DB_URL)
games_df = pd.read_sql("SELECT game_id FROM public.raw_nba_games", engine)
game_ids = games_df['game_id'].tolist()

print(f"Found {len(game_ids)} games to process...")

# Fetch box scores for each game
all_boxscores = []
for i, game_id in enumerate(game_ids, 1):
    print(f"[{i}/{len(game_ids)}] Fetching game {game_id}...")

    try:
        boxscore = BoxScoreTraditionalV3(game_id=game_id)
        df = boxscore.get_data_frames()[0]  # Player stats
        df['game_id'] = game_id
        all_boxscores.append(df)

        # Be nice to the API
        time.sleep(0.6)

    except Exception as e:
        print(f"   ⚠️  Error: {e}")
        continue

# Combine all box scores
print("\nCombining all box scores...")
df_all = pd.concat(all_boxscores, ignore_index=True)

# Rename columns to snake_case
column_mapping = {
    'gameId': 'gameid',
    'teamId': 'teamid',
    'teamCity': 'teamcity',
    'teamName': 'teamname',
    'teamTricode': 'teamtricode',
    'teamSlug': 'teamslug',
    'personId': 'personid',
    'firstName': 'firstname',
    'familyName': 'familyname',
    'nameI': 'namei',
    'playerSlug': 'playerslug',
    'position': 'position',
    'comment': 'comment',
    'jerseyNum': 'jerseynum',
    'minutes': 'minutes',
    'fieldGoalsMade': 'fieldgoalsmade',
    'fieldGoalsAttempted': 'fieldgoalsattempted',
    'fieldGoalsPercentage': 'fieldgoalspercentage',
    'threePointersMade': 'threepointersmade',
    'threePointersAttempted': 'threepointersattempted',
    'threePointersPercentage': 'threepointerspercentage',
    'freeThrowsMade': 'freethrowsmade',
    'freeThrowsAttempted': 'freethrowsattempted',
    'freeThrowsPercentage': 'freethrowspercentage',
    'reboundsOffensive': 'reboundsoffensive',
    'reboundsDefensive': 'reboundsdefensive',
    'reboundsTotal': 'reboundstotal',
    'assists': 'assists',
    'steals': 'steals',
    'blocks': 'blocks',
    'turnovers': 'turnovers',
    'foulsPersonal': 'foulspersonal',
    'points': 'points',
    'plusMinusPoints': 'plusminuspoints'
}

df_all = df_all.rename(columns=column_mapping)

# Write to database
print(f"Writing {len(df_all)} player-game records to database...")
df_all.to_sql(
    'raw_nba_boxscores',
    engine,
    schema='public',
    if_exists='replace',
    index=False
)

print("✅ Done! Box scores loaded into public.raw_nba_boxscores")
print(f"Total records: {len(df_all)}")

3.5 Run Box Score Ingestion

Run the script to load player-level stats into the raw_nba_boxscores table (this takes ~1 minute).

python scripts/ingest_boxscores.py

Expected output:

Reading game IDs from database...
Found 82 games to process...
[1/82] Fetching game 0022300001...
[2/82] Fetching game 0022300015...
...
✅ Done! Box scores loaded into public.raw_nba_boxscores
Total records: ~2,050

Runtime: ~1-2 minutes (0.6s per game × 82 games)

Pause here! Now you should have raw data in Postgres:

  • public.raw_nba_games (82 rows)
  • public.raw_nba_boxscores (~2,050 rows)

Let's double check by connecting with a SQL client or using psql.

SELECT count(*) FROM public.raw_nba_games;
SELECT count(*) FROM public.raw_nba_boxscores;

Expected results:

 count

-------
    82
(1 row)

 count
-------
  2050
(1 row)

Part 4: Initialize dbt Project

We'll create a dbt project and configure it to connect to our Postgres database.

4.1 Create dbt Project

Generate a new dbt project with the standard folder structure for models, tests, and docs.

dbt init nuggets_nba

Prompts:

  • Database: postgres
  • Host: localhost
  • Port: 5432
  • User: nuggets
  • Password: nuggets
  • Database: nuggets_db
  • Schema: analytics
  • Threads: 4

OR skip prompts and configure manually (next step).

4.2 Configure ~/.dbt/profiles.yml

Tell dbt how to connect to our Postgres database by setting up connection credentials.

Create or edit ~/.dbt/profiles.yml:

nuggets_nba:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      user: nuggets
      password: nuggets
      dbname: nuggets_db
      schema: analytics
      threads: 4

4.3 Test dbt Connection

Verify dbt can successfully communicate with Postgres before building models.

cd nuggets_nba
dbt debug

Expected output:

Running with dbt=1.10.x
Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]
Connection:
  host: localhost
  port: 5432
  user: nuggets
  database: nuggets_db
  schema: analytics
  Connection test: [OK connection ok]

All checks passed!

Part 5: Build dbt Models

We'll transform the raw NBA data through multiple layers (staging → intermediate → marts) to calculate player scoring consistency.

5.1 Define Sources

Define source tables so dbt knows where to find our raw data and can track freshness.

Create models/staging/nba/_sources.yml:

version: 2

sources:
  - name: nba_raw
    description: "Raw NBA data ingested from nba_api"
    schema: public
    tables:
      - name: raw_nba_games
        description: "Nuggets team game logs from NBA API"
        columns:
          - name: game_id
            description: "Unique game identifier"

      - name: raw_nba_boxscores
        description: "Player box scores for each Nuggets game"
        columns:
          - name: gameid
            description: "Game identifier (links to raw_nba_games)"
          - name: personid
            description: "Unique player identifier"

Test sources:

dbt run-operation source --help  # Just to verify syntax

5.2 Create Staging Model: stg_nba_games

Clean and standardize game data (rename columns, convert dates, add metadata).

Create models/staging/nba/stg_nba_games.sql:

{{ config(
    materialized = 'view',
    tags = ['staging', 'nba']
) }}

select
    game_id::text                    as game_id,
    season::text                     as season,
    to_date(game_date, 'MM/DD/YYYY') as game_date,
    matchup::text                    as matchup,
    win_loss::text                   as win_loss,
    team_points::int                 as team_points
from {{ source('nba_raw', 'raw_nba_games') }}

Build it:

dbt run --select stg_nba_games

Expected output:

1 of 1 START sql view model analytics.stg_nba_games ........ [RUN]
1 of 1 OK created sql view model analytics.stg_nba_games ... [CREATE VIEW in 0.05s]

5.3 Create Staging Model: stg_nba_boxscores

Clean and standardize player box score data with proper data types and column names.

Create models/staging/nba/stg_nba_boxscores.sql:

{{ config(
    materialized = 'view',
    tags = ['staging', 'nba']
) }}

select
    gameid::text        as game_id,
    gameid::text        as nba_game_id,
    teamid::bigint      as team_id,
    teamcity::text      as team_city,
    teamname::text      as team_name,
    teamtricode::text   as team_tricode,
    teamslug::text      as team_slug,

    personid::bigint    as player_id,
    firstname::text     as first_name,
    familyname::text    as family_name,
    namei::text         as name_i,
    playerslug::text    as player_slug,
    position::text      as position,
    comment::text       as comment,
    jerseynum::text     as jersey_num,

    minutes::text       as minutes_raw,

    fieldgoalsmade::int               as fgm,
    fieldgoalsattempted::int          as fga,
    fieldgoalspercentage::float       as fg_pct,
    threepointersmade::int            as tpm,
    threepointersattempted::int       as tpa,
    threepointerspercentage::float    as tp_pct,
    freethrowsmade::int               as ftm,
    freethrowsattempted::int          as fta,
    freethrowspercentage::float       as ft_pct,

    reboundsoffensive::int  as oreb,
    reboundsdefensive::int  as dreb,
    reboundstotal::int      as treb,
    assists::int            as ast,
    steals::int             as stl,
    blocks::int             as blk,
    turnovers::int          as tov,
    foulspersonal::int      as pf,
    points::int             as pts,
    plusminuspoints::int    as plus_minus

from {{ source('nba_raw', 'raw_nba_boxscores') }}

Build it:

dbt run --select stg_nba_boxscores

5.4 Create Intermediate Model: int_nuggets_player_game

Filter to only Nuggets players and join games with box scores to create player-game records.

Create models/intermediate/nba/int_nuggets_player_game.sql:

{{ config(
    materialized = 'view',
    tags = ['intermediate', 'nba', 'nuggets']
) }}

with box as (
    select *
    from {{ ref('stg_nba_boxscores') }}
),

-- Keep only Nuggets players
nuggets_only as (
    select
        game_id,
        nba_game_id,
        team_id,
        team_city,
        team_name,
        team_tricode,
        team_slug,
        player_id,
        first_name,
        family_name,
        name_i,
        player_slug,
        position,
        comment,
        jersey_num,
        minutes_raw,

        -- convert 'MM:SS' to numeric minutes (e.g. 32.5)
        case
            when minutes_raw is null or minutes_raw = '' or minutes_raw = '00:00' then 0
            else
                split_part(minutes_raw, ':', 1)::int
                + split_part(minutes_raw, ':', 2)::int / 60.0
        end as minutes,

        fgm,
        fga,
        fg_pct,
        tpm,
        tpa,
        tp_pct,
        ftm,
        fta,
        ft_pct,
        oreb,
        dreb,
        treb,
        ast,
        stl,
        blk,
        tov,
        pf,
        pts,
        plus_minus
    from box
    where team_tricode = 'DEN'  -- only Denver Nuggets
)

select * from nuggets_only

Build it:

dbt run --select int_nuggets_player_game

5.5 Create Mart Model: fct_player_game

Build the final analytics table with scoring consistency metrics (average, standard deviation, coefficient of variation).

Create models/marts/nba_nuggets/fct_player_game.sql:

{{ config(
    materialized = 'table',
    unique_key = 'player_game_key',
    tags = ['mart', 'fact', 'nba', 'nuggets']
) }}

with base as (
    select
        game_id,
        player_id,
        concat(game_id, '_', player_id) as player_game_key,

        -- player identity
        first_name,
        family_name,

        -- stats
        minutes,
        pts,
        ast,
        treb as rebounds,
        fgm,
        fga,
        tpm,
        tpa,
        ftm,
        fta,
        plus_minus
    from {{ ref('int_nuggets_player_game') }}
),

with_efficiency as (
    select
        *,
        case when fga > 0
            then (fgm::float / fga)
            else null
        end as fg_pct,

        case when tpa > 0
            then (tpm::float / tpa)
            else null
        end as three_pct,

        case when fta > 0
            then (ftm::float / fta)
            else null
        end as ft_pct
    from base
)

select * from with_efficiency

Build the entire lineage:

dbt run --select +fct_player_game

Expected output:

1 of 3 START sql view model analytics.stg_nba_boxscores ........... [RUN]
1 of 3 OK created sql view model analytics.stg_nba_boxscores ...... [CREATE VIEW in 0.06s]
2 of 3 START sql view model analytics.int_nuggets_player_game ..... [RUN]
2 of 3 OK created sql view model analytics.int_nuggets_player_game  [CREATE VIEW in 0.02s]
3 of 3 START sql table model analytics.fct_player_game ............ [RUN]
3 of 3 OK created sql table model analytics.fct_player_game ....... [SELECT 1847 in 0.15s]

🎉 You now have a player-game fact table!


5.6 Add Tests

Add data quality tests to catch issues like null values, duplicates, or invalid data.

Create models/marts/nba_nuggets/schema.yml:

version: 2

models:
  - name: fct_player_game
    description: "Player-level fact table with one row per player per game"
    columns:
      - name: player_game_key
        description: "Unique key: game_id + player_id"
        tests:
          - not_null
          - unique

      - name: game_id
        tests:
          - not_null

      - name: player_id
        tests:
          - not_null

      - name: pts
        description: "Points scored"
        tests:
          - not_null

Run tests:

dbt test --select fct_player_game

Expected output:

1 of 4 START test not_null_fct_player_game_player_game_key .... [RUN]
1 of 4 PASS not_null_fct_player_game_player_game_key .......... [PASS in 0.05s]
2 of 4 START test unique_fct_player_game_player_game_key ...... [RUN]
2 of 4 PASS unique_fct_player_game_player_game_key ............ [PASS in 0.06s]
...

Part 6: Query Your Data!

Now that our dbt pipeline has transformed the data, let's query it to answer our original question about scoring consistency.

6.1 Connect SQL Client to Postgres

Connect a SQL client to Postgres so you can explore the transformed tables visually.

Connection details:

  • Host: localhost
  • Port: 5432
  • Database: nuggets_db
  • User: nuggets
  • Password: nuggets
  • Schema: analytics

6.2 Basic Query: Player Averages

Query the fact table to see each Nuggets player's average points per game.

select
    first_name || ' ' || family_name as player_name,
    count(*) as games_played,
    round(avg(pts)::numeric, 1) as avg_points,
    round(avg(ast)::numeric, 1) as avg_assists,
    round(avg(rebounds)::numeric, 1) as avg_rebounds,
    round(avg(minutes)::numeric, 1) as avg_minutes
from analytics.fct_player_game
group by 1
having count(*) >= 10
order by avg_points desc;

Sample results:

player_name         | games | avg_pts | avg_ast | avg_reb | avg_min
--------------------|-------|---------|---------|---------|--------
Nikola Jokić        | 79    | 26.4    | 9.0     | 12.4    | 34.6
Jamal Murray        | 59    | 21.2    | 6.5     | 4.1     | 31.5
Michael Porter Jr.  | 82    | 16.5    | 1.5     | 6.9     | 31.3
Aaron Gordon        | 73    | 13.9    | 3.5     | 6.5     | 31.5

6.3 Consistency Analysis: Coefficient of Variation

Calculate and rank players by scoring consistency using coefficient of variation (lower = more consistent).

Question: Who is the most consistent scorer?

Why Coefficient of Variation (CV)?

When comparing consistency between players with different scoring averages, standard deviation alone can be misleading. A player averaging 25 points with a standard deviation of 8 points is more consistent than a player averaging 10 points with the same 8-point standard deviation.

Coefficient of Variation (CV) solves this by expressing standard deviation as a percentage of the mean:

CV = (Standard Deviation / Mean) × 100

Why it works for this use case:

  • Normalized comparison: Compare variability across players with different scoring levels
  • Scale-independent: A 20% CV means the same level of consistency whether you score 10 or 30 points per game
  • Lower is better: Lower CV = more predictable/consistent performance

🔥 Next run this query to find the most consistent Nuggets scorers:

select
    first_name || ' ' || family_name as player_name,
    count(*) as games_played,
    round(avg(pts)::numeric, 1) as avg_points,
    round(stddev(pts)::numeric, 2) as stddev_points,
    round(((stddev(pts) / nullif(avg(pts), 0)) * 100)::numeric, 1) as cv_points
from analytics.fct_player_game
group by 1
having
    count(*) >= 20
    and avg(minutes) >= 15
    and avg(pts) >= 5
order by cv_points asc;

Results:

player_name         | games | avg_pts | stddev | cv_points
--------------------|-------|---------|--------|----------
Nikola Jokić        | 79    | 26.4    | 8.16   | 30.9%    ← Most consistent!
Jamal Murray        | 59    | 21.2    | 7.83   | 36.9%
Aaron Gordon        | 73    | 13.9    | 5.27   | 38.0%
Michael Porter Jr.  | 82    | 16.5    | 6.97   | 42.2%

Interpretation:

  • Lower CV = more consistent
  • Jokić has a 30.9% coefficient of variation - remarkably consistent for 26.4 PPG
  • Murray and Gordon are also very steady
  • MPJ is more "feast or famine" at 42.2% CV

Architecture Overview

Below is a high-level architecture diagram of the dbt pipeline we built in this tutorial.

dbt high-level architecture of what we built here


Key Takeaways

Hope you enjoyed this dbt tutorial! Here are the main takeaways:

Key benefits of dbt

  • Modularity: Each model has a single responsibility
  • Lineage: Dependencies flow cleanly (staging → intermediate → marts)
  • Testing: Built-in data quality checks
  • Documentation: Self-documenting with schema.yml files
  • Incremental development: Build and test one layer at a time

What we learned about the Nuggets' 2023-24 season

🏀 Nikola Jokić is absurdly consistent - lowest variance among starters

Top 4 players (Jokić, Murray, Gordon, MPJ) all have <43% CV

🏀 Bench players have higher variance (50-72% CV) - expected for role players

What You Learned

  • Real-world data pipeline architecture
  • Python → Database ingestion
  • dbt's medallion architecture (bronze/silver/gold or raw/staging/marts)
  • Statistical analysis of sports data
  • How to answer business questions with data

Troubleshooting

Common Issues

Problem: dbt debug fails with "could not connect to server"

  • Solution: Ensure Docker container is running: docker ps

Problem: Ingestion script fails with rate limit error

  • Solution: Increase sleep time in ingest_boxscores.py to 1-2 seconds

Problem: fct_player_game has duplicate player_game_keys

  • Solution: Check for duplicate games in raw_nba_games - deduplicate at ingestion

Problem: Player minutes show as 0 or null

  • Solution: Verify minutes_raw format in raw table - should be "MM:SS"

Resources


Acknowledgments

  • Data provided by NBA.com via the unofficial nba_api library
  • This tutorial is for educational purposes only, demonstrating dbt and analytics engineering concepts
  • Not affiliated with or endorsed by the NBA

You did it! 🎉

You've built a production-quality analytics pipeline from scratch. This same pattern (ingest → stage → transform → mart) scales to enterprise data warehouses.


Ready to level up your data engineering skills? Subscribe below for in-depth tutorials on dbt, data architecture, and modern data stack tools.

Enjoyed this post? Get more like it.

Subscribe to get my latest posts about data engineering, AI, and modern data stack delivered to your inbox.