Skip to the content.

Table of Contents


OLAP vs OLTP

There are two fundamental database approaches: OLAP and OLTP.

  OLAP OLTP
Stands for Online Analytical Processing Online Transaction Processing
Purpose Analysing large data quantities Processing real-time transactions
Operations Optimised for read-only queries Supports all CRUD operations (read, insert, update, delete)
Query complexity Complex aggregations on large datasets Simple, fast queries on individual records
Users Data analysts, business intelligence End users, applications
Examples Data warehouses, reporting systems Banking systems, e-commerce, CRMs

OLAP is the foundational approach used in Data Warehouses. It is used by companies, universities, or any entity that needs to analyse their data at scale.

OLTP systems handle day-to-day operations. The term “transaction” has a dual meaning: a database transaction (an atomic change of state) and a business transaction (an exchange of economic entities). OLTP systems use the first type to record the second.

Star Schema and Snowflake Schema

OLAP databases typically use a star schema or snowflake schema to organise data.

                              ┌──────────────────────┐
                              │   Time Dimension     │
                              ├──────────┬───────────┤
                              │ time_id  │ timestamp │
                       ┌─────>│ 1234     │ 2008-09-02│
                       │      └──────────┴───────────┘
  ┌────────────────────┤
  │   Sales Fact Table │      ┌──────────────────────────┐
  ├────────┬───────┬───┤      │   Product Dimension      │
  │ amount │time_id│prod_id   ├──────────┬───────────────┤
  │ 930.10 │ 1234  │ 42 ─────>│ prod_id  │ name          │
  │ 120.50 │ 1235  │ 17 │     │ 42       │ Widget Pro    │
  └────────┴───────┴────┘     └──────────┴───────────────┘

In a star schema, dimension tables connect directly to the fact table (forming a star shape). In a snowflake schema, dimension tables are further normalised into sub-dimensions (e.g., a location dimension split into city, country tables).

Data Warehouse Architecture

A data warehouse collects data from many sources (OLTP databases, CRMs, flat files, APIs) and consolidates it for analytical queries.

  Data Sources          Staging Area         Data Warehouse         Data Marts
 ┌───────────┐        ┌────────────┐       ┌───────────────┐     ┌───────────┐
 │  OLTP DBs │───┐    │            │       │               │  ┌─>│ Marketing │
 ├───────────┤   ├───>│  Cleaning  │──────>│  Integrated   │──┤  ├───────────┤
 │   CRMs    │───┤    │  Validating│       │  Historical   │  ├─>│  Finance  │
 ├───────────┤   │    │  Conforming│       │  Read-Only    │  │  ├───────────┤
 │   Files   │───┘    │            │       │               │  └─>│   Sales   │
 └───────────┘        └────────────┘       └───────────────┘     └───────────┘
                                                                       │
                                                                       v
                                                                   End Users
                                                              (BI tools, reports)

Some users, such as data scientists, may prefer to consume raw data directly from the warehouse or even the staging area, bypassing data marts.


BigQuery

BigQuery is a serverless data warehouse provided by Google Cloud. “Serverless” means the infrastructure (machines, databases, servers) is fully managed — you don’t provision or maintain anything.

Key features:

External Tables

Documentation

An external table allows you to query data stored outside of BigQuery (e.g., in Google Cloud Storage) without loading it into BigQuery’s own storage.

Why use external tables?

Creating an external table:

CREATE OR REPLACE EXTERNAL TABLE my_dataset.my_external_table
OPTIONS (
  format = 'JSON',
  uris = ['gs://my_bucket/my_file.json']
);

This avoids defining a schema explicitly — BigQuery infers it from the data.

Limitations of external tables:


Partitioning

When queries frequently filter on a specific column (e.g., a date), we can partition the table on that column. Partitioning physically splits the table into segments based on the partition key, so queries that filter on that key only scan the relevant partitions instead of the entire table.

  Unpartitioned Table                  Partitioned by date
 ┌──────────────────────┐          ┌──────────────────────┐
 │ date       │ amount  │          │ 2024-01-01           │
 │ 2024-01-01 │  50.00  │          │  ┌────────┬────────┐ │
 │ 2024-01-02 │  75.00  │          │  │  50.00 │  20.00 │ │
 │ 2024-01-01 │  20.00  │   ──>    │  └────────┴────────┘ │
 │ 2024-01-03 │ 100.00  │          ├──────────────────────┤
 │ 2024-01-02 │  30.00  │          │ 2024-01-02           │
 │ 2024-01-03 │  60.00  │          │  ┌────────┬────────┐ │
 └──────────────────────┘          │  │  75.00 │  30.00 │ │
                                   │  └────────┴────────┘ │
  Query scans ALL rows             ├──────────────────────┤
                                   │ 2024-01-03           │
                                   │  ┌─────────┬───────┐ │
                                   │  │ 100.00  │ 60.00 │ │
                                   │  └─────────┴───────┘ │
                                   └──────────────────────┘

                                   Query for 2024-01-02
                                   scans ONLY that partition

Key constraint: A table can have a maximum of 4,000 partitions.

You can partition by:

Ingestion Time Partitioning

When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on when BigQuery ingested the data, not based on any column value in the data itself.

This is useful when:

BigQuery creates a pseudocolumn called _PARTITIONTIME that stores the ingestion time, truncated to the partition boundary. For example, with hourly partitioning, data ingested at 14:23 gets a _PARTITIONTIME of 14:00.

You can query it like a regular column:

SELECT *
FROM my_dataset.my_table
WHERE _PARTITIONTIME = '2024-01-15 14:00:00'

Choosing Partition Granularity

Granularity Best for
Daily (default) Data spread over a wide date range, or continuously added over time
Hourly High-volume data spanning a short date range (< 6 months). Watch the 4,000 partition limit
Monthly / Yearly Small daily data volume but wide date range, or workflows that frequently update rows across many dates

For monthly/yearly partitions, combine with clustering on the partition column for best performance.


Clustering

Documentation

After partitioning, you can further organise data within each partition by clustering. Clustering sorts the data based on the values of one or more columns, so that rows with similar values are stored together in the same storage blocks.

When you query with a filter on a clustered column, BigQuery can skip entire blocks of data that don’t match, dramatically reducing the amount of data scanned.

-- Creating a partitioned AND clustered table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

Rules for clustering:

Clustering vs Partitioning

  Clustering Partitioning
Cost estimation Cost benefit unknown until query runs Cost known upfront (partition pruning)
Granularity Fine-grained, block-level pruning Coarse-grained, partition-level pruning
Management Automatic Partition-level management possible (delete/expire partitions)
Filter pattern Queries filter or aggregate on multiple columns Queries filter on a single column
Cardinality Works well with high-cardinality columns Works best with low-to-medium cardinality

When to Use Clustering Over Partitioning

Prefer clustering (instead of or in addition to partitioning) when:

Automatic Reclustering

As new data is added to a clustered table, incoming rows may not be physically co-located with existing rows that share the same cluster values. Over time, this can degrade query performance.

BigQuery automatically performs reclustering in the background to re-sort and re-organise data blocks. For partitioned tables, reclustering is scoped to each individual partition. This is free of charge and requires no user action.


Best Practices

Cost Reduction

Query Performance


Internals of BigQuery

Video: BigQuery Internals

BigQuery is built on top of several Google technologies:

  Query
    │
    v
 ┌──────┐
 │ Root │  Dremel execution tree
 └──┬───┘
   ┌┴─────────┐
   v           v
┌──────┐  ┌──────┐
│Mixer │  │Mixer │  Intermediate servers
└──┬───┘  └──┬───┘
  ┌┴──┐     ┌┴──┐
  v   v     v   v
┌───┐┌───┐┌───┐┌───┐
│Leaf││Leaf││Leaf││Leaf│  Read from Colossus (columnar storage)
└───┘└───┘└───┘└───┘

The columnar storage format is key to performance: when you query only a few columns, BigQuery reads only those columns from disk, skipping all others entirely. This is fundamentally different from row-oriented databases that must read entire rows.

Further reading:


BigQuery Machine Learning

Video: BigQuery ML

BigQuery ML lets you create and execute machine learning models directly in BigQuery using SQL. This removes the need to export data to a separate ML framework (Python/TensorFlow/scikit-learn), which saves time and avoids data movement overhead.

ML Workflow in BigQuery

The traditional ML workflow requires moving data out of the warehouse:

Traditional:  BigQuery → Export → Python/Jupyter → Train → Deploy → Serve

BigQuery ML:  BigQuery → CREATE MODEL → EVALUATE → PREDICT → EXPORT (optional)

With BigQuery ML, the entire workflow stays inside BigQuery. This is particularly valuable when:

Supported model types:

Creating a Model

-- Example: Predict tip amount based on trip features
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model`
OPTIONS (
  model_type = 'linear_reg',
  input_label_cols = ['tip_amount'],
  data_split_method = 'auto_split'
) AS
SELECT
  passenger_count,
  trip_distance,
  PULocationID,
  DOLocationID,
  payment_type,
  fare_amount,
  tolls_amount,
  tip_amount
FROM
  `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned`
WHERE
  tip_amount IS NOT NULL;

Model Evaluation and Prediction

Evaluate the model:

SELECT *
FROM ML.EVALUATE(
  MODEL `taxi-rides-ny.nytaxi.tip_model`,
  (
    SELECT passenger_count, trip_distance, PULocationID,
           DOLocationID, payment_type, fare_amount,
           tolls_amount, tip_amount
    FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned`
    WHERE tip_amount IS NOT NULL
  )
);

This returns metrics such as mean_absolute_error, mean_squared_error, r2_score, and explained_variance.

Make predictions:

SELECT *
FROM ML.PREDICT(
  MODEL `taxi-rides-ny.nytaxi.tip_model`,
  (
    SELECT passenger_count, trip_distance, PULocationID,
           DOLocationID, payment_type, fare_amount,
           tolls_amount
    FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned`
    LIMIT 10
  )
);

The result includes a predicted_tip_amount column alongside the input features.

Model Deployment

BigQuery ML models can be exported and deployed to external serving infrastructure:

Step 1: Export the model to GCS

bq extract -m taxi-rides-ny:nytaxi.tip_model gs://my-bucket/tip_model

Step 2: Download the model locally

mkdir /tmp/model
gsutil cp -r gs://my-bucket/tip_model /tmp/model

Step 3: Serve with a Docker container (TensorFlow Serving)

docker pull tensorflow/serving

docker run -p 8501:8501 \
  --mount type=bind,source=/tmp/model/tip_model,target=/models/tip_model \
  -e MODEL_NAME=tip_model \
  -t tensorflow/serving

Step 4: Send prediction requests

curl -d '{"instances": [{"passenger_count": 1, "trip_distance": 3.5, ...}]}' \
  -X POST http://localhost:8501/v1/models/tip_model:predict

This workflow lets you train models with SQL in BigQuery and deploy them as REST APIs for real-time inference.


Resources