Skip to the content.

Table of Contents


Analytics Engineering

Analytics engineering sits at the intersection of data engineering and data analytics. It emerged because modern data tools (cloud warehouses, self-service BI, ELT pipelines) created a gap between the people who load data and the people who analyse it.

The Data Team Roles

Role Focus
Data Engineer Prepares and maintains the data infrastructure. Builds pipelines (ingestion, orchestration), manages warehouses, ensures data availability. Strong software engineering skills
Data Analyst Uses data to answer business questions. Builds dashboards, reports, ad-hoc queries. Strong SQL and BI tool skills, limited programming
Analytics Engineer Bridges the gap. Introduces software engineering practices (version control, CI/CD, testing, documentation) to the data transformation layer. Takes raw data and builds clean, tested, documented datasets that analysts can trust

The analytics engineer owns the transformation layer — the T in ELT. They write the SQL that turns raw loaded data into analytical models.

Tooling

The analytics engineering workflow typically involves:


Data Modelling Concepts

ETL vs ELT

  ETL ELT
Stands for Extract, Transform, Load Extract, Load, Transform
When transforms happen Before loading into the warehouse After loading into the warehouse
Compute Transformations run on a separate processing system Transformations run inside the warehouse itself
Schema approach Schema-on-write (define schema before loading) Schema-on-read (store raw, define structure at query time)
Cost Higher upfront compute costs Higher storage costs (raw data stored first)
Flexibility Less flexible — changes require rebuilding the pipeline More flexible — raw data is always available to re-transform

Modern cloud warehouses (BigQuery, Snowflake) are cheap to store data in and powerful enough to run transformations at scale, making ELT the dominant approach today. dbt is an ELT tool — it assumes the data is already loaded and focuses entirely on the T.

Dimensional Modelling (Kimball)

Ralph Kimball’s dimensional modelling is the most widely used approach for structuring analytical data. The goal is to make data easy to understand and fast to query for business users.

Core principles:

Two types of tables:

  Fact Table: trips                Dimension: zones
 ┌────────────────────────┐      ┌─────────────────────────┐
 │ trip_id                │      │ location_id             │
 │ pickup_location_id  ───┼─────>│ borough                 │
 │ dropoff_location_id ───┼─────>│ zone_name               │
 │ pickup_datetime        │      │ service_zone            │
 │ fare_amount            │      └─────────────────────────┘
 │ trip_distance          │
 └────────────────────────┘       Dimension: vendors
                          │      ┌─────────────────────────┐
                          └─────>│ vendor_id               │
                                 │ vendor_name             │
                                 └─────────────────────────┘

Other approaches:


What is dbt?

dbt (data build tool) is a transformation framework that lets you write SQL SELECT statements and turns them into tables and views in your warehouse. It brings software engineering best practices to SQL-based data transformation.

You write models (SQL files), and dbt handles:

How dbt Works

                    dbt
 ┌─────────────────────────────────────┐
 │                                     │
 │  .sql model files                   │
 │       │                             │
 │       v                             │
 │  Compile (resolve refs, macros)     │
 │       │                             │
 │       v                             │
 │  Build DAG (dependency order)       │
 │       │                             │
 │       v                             │
 │  Execute SQL against warehouse  ────┼──> BigQuery / Snowflake / DuckDB
 │       │                             │
 │       v                             │
 │  Test & Document                    │
 │                                     │
 └─────────────────────────────────────┘

dbt does not extract or load data. It only transforms data that is already in your warehouse. The modelling layer sits between raw loaded data and the final tables that analysts query.

dbt Core vs dbt Cloud

  dbt Core dbt Cloud
What it is Open-source CLI tool Managed SaaS platform built on dbt Core
Interface Command line Web-based IDE + CLI
Execution Runs locally or on your own infrastructure Runs on dbt Labs’ infrastructure
Scheduling You manage scheduling (cron, Airflow, etc.) Built-in job scheduler
Cost Free Free tier available, paid plans for teams
Version control You set up Git integration Built-in Git integration
Documentation dbt docs generate + self-host Hosted documentation site

For this course, both options work. The local setup uses dbt Core + DuckDB (free, no cloud needed). The cloud setup uses dbt Cloud + BigQuery.


dbt Project Structure

A typical dbt project looks like this:

my_dbt_project/
├── dbt_project.yml          # Project configuration
├── models/
│   ├── staging/             # Staging models (1:1 with sources)
│   │   ├── schema.yml       # Source definitions + tests
│   │   ├── stg_green_tripdata.sql
│   │   └── stg_yellow_tripdata.sql
│   └── core/                # Core business models
│       ├── schema.yml
│       ├── dim_zones.sql
│       └── fact_trips.sql
├── macros/                  # Reusable SQL snippets (Jinja)
│   └── get_payment_type_description.sql
├── seeds/                   # Static CSV data
│   └── taxi_zone_lookup.csv
├── tests/                   # Custom data tests
├── packages.yml             # External package dependencies
└── profiles.yml             # Connection configuration (dbt Core only)

Models

Models are the core of dbt. Each model is a .sql file containing a single SELECT statement. dbt compiles and runs these statements to create tables or views in your warehouse.

-- models/staging/stg_green_tripdata.sql
SELECT
    vendorid,
    lpep_pickup_datetime as pickup_datetime,
    lpep_dropoff_datetime as dropoff_datetime,
    passenger_count,
    trip_distance,
    payment_type,
    fare_amount,
    total_amount
FROM {{ source('staging', 'green_tripdata') }}

The {{ source() }} function references a declared data source (defined in a YAML file). The {{ ref() }} function references another dbt model, which tells dbt about the dependency so it can build the DAG.

Key principle: Models only contain SELECT statements. You never write CREATE TABLE or INSERT INTO — dbt generates that DDL/DML based on the materialisation strategy.

Sources

Sources represent the raw data already loaded into your warehouse. You define them in YAML files:

# models/staging/schema.yml
sources:
  - name: staging
    database: my_project       # BigQuery project / DuckDB database
    schema: trips_data_all     # Dataset / schema name
    tables:
      - name: green_tripdata
      - name: yellow_tripdata

Benefits of defining sources:

Seeds

Seeds are small CSV files that dbt loads into your warehouse as tables. Use them for static reference data that rarely changes:

-- seeds/taxi_zone_lookup.csv
locationid,borough,zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone

Run dbt seed to load them. They are version-controlled alongside your models. Do not use seeds for large or frequently changing data — use proper data loading tools for that.

Macros

Macros are reusable snippets written in Jinja (a Python templating language). They work like functions in programming — they accept arguments and return SQL.

-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) %}
    CASE {{ payment_type }}
        WHEN 1 THEN 'Credit card'
        WHEN 2 THEN 'Cash'
        WHEN 3 THEN 'No charge'
        WHEN 4 THEN 'Dispute'
        WHEN 5 THEN 'Unknown'
        WHEN 6 THEN 'Voided trip'
        ELSE 'EMPTY'
    END
{% endmacro %}

Use it in a model:

SELECT
    payment_type,
    {{ get_payment_type_description('payment_type') }} as payment_type_description
FROM {{ source('staging', 'green_tripdata') }}

dbt also includes built-in macros and you can install third-party macros via packages.

Packages

Packages are reusable dbt projects that you can import. They provide macros, models, and tests written by the community.

-- packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-labs/codegen
    version: 0.12.1

Run dbt deps to install them. Popular packages:

Tests

dbt tests are assertions about your data. There are two types:

Generic tests (declared in YAML):

# models/staging/schema.yml
models:
  - name: stg_green_tripdata
    columns:
      - name: tripid
        tests:
          - unique
          - not_null
      - name: payment_type
        tests:
          - accepted_values:
              values: [1, 2, 3, 4, 5, 6]

Built-in generic tests:

Singular tests (custom SQL in the tests/ directory):

-- tests/assert_positive_fare.sql
SELECT *
FROM {{ ref('fact_trips') }}
WHERE fare_amount < 0

A test passes if the query returns zero rows. Any rows returned are failures.

Run tests with dbt test.

Documentation

dbt can auto-generate a documentation website from your project:

# models/staging/schema.yml
models:
  - name: stg_green_tripdata
    description: "Cleaned and standardised green taxi trip data"
    columns:
      - name: tripid
        description: "Unique trip identifier (surrogate key)"
      - name: pickup_datetime
        description: "Timestamp when the meter was engaged"

Generate and serve docs:

dbt docs generate
dbt docs serve     # Opens a local web server with the documentation site

The docs site includes:


Building a dbt Project

Materialisation Strategies

Materialisation controls how dbt persists a model in the warehouse:

Strategy What it creates When to use
View A SQL view (CREATE VIEW) Lightweight models, not queried directly by end users. Default materialisation
Table A physical table (CREATE TABLE) Models queried frequently by BI tools or analysts. Faster reads, more storage
Incremental Appends/merges new rows to an existing table Large datasets where rebuilding the full table is too expensive. Only processes new or changed data
Ephemeral Nothing — inlined as a CTE Intermediate logic you don’t want to persist. Used only as a dependency by other models

Set materialisation in the model file or in dbt_project.yml:

-- In the model file
{{ config(materialized='table') }}

SELECT ...
# In dbt_project.yml (applies to all models in the path)
models:
  my_project:
    staging:
      materialized: view
    core:
      materialized: table

The Staging Layer

Staging models sit between raw sources and business logic. They:

-- models/staging/stg_green_tripdata.sql
{{ config(materialized='view') }}

SELECT
    -- identifiers
    {{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
    CAST(vendorid AS INTEGER) as vendorid,
    CAST(ratecodeid AS INTEGER) as ratecodeid,
    CAST(pulocationid AS INTEGER) as pickup_locationid,
    CAST(dolocationid AS INTEGER) as dropoff_locationid,

    -- timestamps
    CAST(lpep_pickup_datetime AS TIMESTAMP) as pickup_datetime,
    CAST(lpep_dropoff_datetime AS TIMESTAMP) as dropoff_datetime,

    -- trip info
    passenger_count,
    trip_distance,
    trip_type,

    -- payment
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    CAST(payment_type AS INTEGER) as payment_type,
    {{ get_payment_type_description('payment_type') }} as payment_type_description

FROM {{ source('staging', 'green_tripdata') }}
WHERE vendorid IS NOT NULL

The Core Layer

Core models contain the actual business logic and produce the final tables that analysts query. They reference staging models (and other core models) using {{ ref() }}.

-- models/core/fact_trips.sql
{{ config(materialized='table') }}

WITH green_data AS (
    SELECT *, 'Green' as service_type
    FROM {{ ref('stg_green_tripdata') }}
),
yellow_data AS (
    SELECT *, 'Yellow' as service_type
    FROM {{ ref('stg_yellow_tripdata') }}
),
trips_unioned AS (
    SELECT * FROM green_data
    UNION ALL
    SELECT * FROM yellow_data
)
SELECT
    t.tripid,
    t.vendorid,
    t.service_type,
    t.pickup_locationid,
    t.dropoff_locationid,
    t.pickup_datetime,
    t.dropoff_datetime,
    t.passenger_count,
    t.trip_distance,
    t.fare_amount,
    t.total_amount,
    t.payment_type_description,
    z_pickup.borough as pickup_borough,
    z_pickup.zone as pickup_zone,
    z_dropoff.borough as dropoff_borough,
    z_dropoff.zone as dropoff_zone
FROM trips_unioned t
INNER JOIN {{ ref('dim_zones') }} z_pickup
    ON t.pickup_locationid = z_pickup.locationid
INNER JOIN {{ ref('dim_zones') }} z_dropoff
    ON t.dropoff_locationid = z_dropoff.locationid

Variables

Variables let you parameterise your dbt project. Define defaults in dbt_project.yml and override them at runtime:

# dbt_project.yml
vars:
  is_test_run: true

Use in models:

{% if var('is_test_run', default=true) %}
    LIMIT 100
{% endif %}

Override at runtime:

dbt build --vars '{"is_test_run": false}'

This is useful for running quick test builds during development (with a row limit) and full builds in production.


dbt Commands

Command Description
dbt init Initialise a new dbt project
dbt deps Install packages from packages.yml
dbt seed Load CSV seed files into the warehouse
dbt run Execute all models
dbt test Run all tests
dbt build Run models, tests, seeds, and snapshots in DAG order
dbt compile Compile models to SQL without executing
dbt docs generate Generate documentation
dbt docs serve Serve documentation locally
dbt source freshness Check source data freshness
dbt run --select model_name Run a specific model
dbt run --select +model_name Run a model and all its upstream dependencies
dbt run --select model_name+ Run a model and all its downstream dependents

Deployment

Development vs Production

dbt projects typically have two environments:

  Development Production
Who runs it Individual data engineers / analytics engineers Automated scheduler (dbt Cloud, Airflow, cron)
Frequency On-demand, during development Scheduled (e.g., daily, hourly)
Target schema Personal dev schema (e.g., dbt_paolo) Production schema (e.g., production)
Data Often a subset (using LIMIT via variables) Full dataset
Branch Feature branches Main branch

In dbt Cloud, you configure a production environment that runs on a schedule against the main branch. Development happens in the IDE against a personal schema so you don’t break production models while iterating.

CI/CD with dbt

dbt Cloud supports Continuous Integration by running builds on pull requests:

  1. Developer opens a PR with model changes
  2. dbt Cloud automatically triggers a build of the changed models (and downstream dependents) into a temporary schema
  3. All tests run against the temporary schema
  4. If tests pass, the PR can be merged
  5. Merging to main triggers the production build

This ensures that model changes are tested before they affect production data.


SQL Refresher

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing rows like aggregate functions do.

Syntax:

FUNCTION() OVER (PARTITION BY column ORDER BY column)

ROW_NUMBER, RANK, DENSE_RANK

These functions assign a ranking to each row:

SELECT
    total_amount,
    ROW_NUMBER() OVER (ORDER BY total_amount DESC) as row_num,
    RANK()       OVER (ORDER BY total_amount DESC) as rank,
    DENSE_RANK() OVER (ORDER BY total_amount DESC) as dense_rank
FROM green_tripdata

How they differ with ties (e.g., two rows tied for 2nd place):

Function Result
ROW_NUMBER() 1, 2, 3, 4 (always unique, arbitrary tiebreak)
RANK() 1, 2, 2, 4 (ties get same rank, next rank skipped)
DENSE_RANK() 1, 2, 2, 3 (ties get same rank, no gap)

LAG and LEAD

Retrieve values from previous or subsequent rows without a self-join:

SELECT
    lpep_pickup_datetime,
    total_amount,
    LAG(total_amount)  OVER (ORDER BY lpep_pickup_datetime) as prev_amount,
    LEAD(total_amount) OVER (ORDER BY lpep_pickup_datetime) as next_amount
FROM green_tripdata

PERCENTILE_CONT

Computes a percentile value with linear interpolation:

SELECT
    PULocationID,
    total_amount,
    PERCENTILE_CONT(total_amount, 0.9) OVER (PARTITION BY PULocationID) as p90
FROM green_tripdata

This returns the 90th percentile of total_amount for each pickup location.

Common Table Expressions (CTEs)

CTEs create temporary named result sets using WITH. They make complex queries more readable and reusable:

WITH ranked_trips AS (
    SELECT
        lpep_pickup_datetime,
        total_amount,
        RANK() OVER (ORDER BY total_amount DESC) as rank
    FROM green_tripdata
)
SELECT *
FROM ranked_trips
WHERE rank = 2

CTEs are heavily used in dbt models to break down transformations into logical steps:

WITH trip_duration AS (
    SELECT *,
        TIMESTAMP_DIFF(dropoff_datetime, pickup_datetime, SECOND) as duration_seconds
    FROM {{ source('staging', 'fhv_trips') }}
),
with_percentiles AS (
    SELECT
        PULocationID,
        duration_seconds,
        PERCENTILE_CONT(duration_seconds, 0.90) OVER (PARTITION BY PULocationID) as p90
    FROM trip_duration
)
SELECT * FROM with_percentiles

Note: In BigQuery, CTEs are inlined at execution time — they are not cached. If a CTE is referenced multiple times, it executes multiple times. For reusable intermediate results, materialise them as separate dbt models instead.


Resources