Pre-Aggregations
Materialized rollup tables that store pre-computed aggregations for significant query performance improvements with automatic query routing and refresh strategies.
Pre-aggregations are materialized rollup tables that store pre-computed aggregations for significant query performance improvements. The query engine can automatically route queries to matching pre-aggregations instead of scanning base tables. Pre-aggregations can be configured to refresh automatically on schedules or when source data changes.
Automatic Recommendations
Sidemantic can automatically analyze your query history and recommend optimal pre-aggregations. This is unique to Sidemantic—no manual analysis of query patterns required.
Fetch query history from BigQuery, Snowflake, Databricks, or ClickHouse and get instant recommendations with benefit scores:
sidemantic preagg recommend --connection "bigquery://project/dataset" sidemantic preagg apply models/ --connection "bigquery://project/dataset" --top 5See the Pre-Aggregation Recommendations guide for complete CLI documentation.
Important: Routing Disabled by Default
Pre-aggregation routing is disabled by default and must be explicitly enabled. See Enabling Routing for details.
Why Use Pre-Aggregations?
Performance Benefits
Pre-aggregations provide dramatic performance improvements by reducing the amount of data scanned:
Example: E-commerce orders table with 100M rows
# Without pre-aggregation: scan 100M rows
SELECT status, SUM(amount)
FROM orders
GROUP BY status
# With daily pre-aggregation: scan ~36K rows (10 statuses × 365 days × 10 years)
SELECT status, SUM(revenue_raw)
FROM orders_preagg_daily
GROUP BY status
Performance gain: ~2,777x fewer rows scanned
When to Use Pre-Aggregations
Use pre-aggregations when:
- Base tables are large (millions+ rows)
- Queries aggregate data frequently
- Query patterns are predictable (same dimensions/metrics)
- Real-time data freshness isn't critical (can refresh periodically)
Common use cases:
- Dashboard queries that run frequently
- Time-series analytics (daily/monthly trends)
- High-cardinality dimension reductions
- Cross-model aggregations
Defining Pre-Aggregations
Pre-aggregations are defined in the model's YAML configuration:
Basic Example
models:
- name: orders
table: public.orders
primary_key: order_id
dimensions:
- name: status
type: categorical
sql: status
- name: region
type: categorical
sql: region
- name: created_at
type: time
sql: created_at
granularity: day
metrics:
- name: count
agg: count
- name: revenue
agg: sum
sql: amount
pre_aggregations:
- name: daily_summary
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
This creates a pre-aggregation that:
- Groups by
statusandregiondimensions - Pre-computes
countandrevenuemetrics - Aggregates to daily granularity on
created_at
Configuration Options
Required Fields
| Field | Description | Example |
|---|---|---|
name | Unique pre-aggregation identifier | daily_summary |
measures | List of metrics to pre-compute | [count, revenue] |
dimensions | List of dimensions to group by | [status, region] |
Optional Fields
| Field | Description | Example |
|---|---|---|
time_dimension | Time-based dimension for temporal grouping | created_at |
granularity | Time aggregation level | day |
partition_granularity | Partition size for incremental refresh | month |
refresh | Refresh strategy (see Refresh Configuration) | See examples below |
indexes | Index definitions for performance | See Indexes |
Advanced Configuration
Partitioning
Split large pre-aggregations into partitions for faster incremental refresh:
pre_aggregations:
- name: monthly_partitioned
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
partition_granularity: month # Partition by month
refresh:
every: "1 hour"
incremental: true
update_window: "7 day" # Only refresh last 7 days
Benefits:
- Only refresh changed partitions
- Faster incremental updates
- Reduced rebuild time
Refresh Configuration
Pre-aggregations can be configured to refresh automatically using different strategies. The refresh field controls how and when the materialized table is updated.
Time-Based Refresh
Refresh on a fixed schedule:
pre_aggregations:
- name: hourly_refresh
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
refresh:
every: "1 hour" # or "30 minutes", "1 day", "1 week", etc.
Common intervals:
"30 minutes"- High-frequency dashboards"1 hour"- Standard real-time analytics"1 day"- Daily batch processing"1 week"- Historical analysis
SQL-Triggered Refresh
Refresh when source data changes by checking a SQL query result:
pre_aggregations:
- name: smart_refresh
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
refresh:
sql: "SELECT MAX(updated_at) FROM orders"
The pre-aggregation refreshes when the SQL query returns a different value. Common patterns:
SELECT MAX(updated_at) FROM table- Detects new/updated rowsSELECT COUNT(*) FROM table- Detects row count changesSELECT MAX(id) FROM table- Detects new inserts
Incremental Refresh
Only update changed partitions instead of rebuilding the entire table:
pre_aggregations:
- name: incremental_update
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
partition_granularity: month
refresh:
every: "1 hour"
incremental: true
update_window: "7 day" # Only refresh last 7 days of partitions
Configuration options:
incremental: true- Enable incremental refreshupdate_window- How far back to refresh (e.g.,"7 day","30 day")- Requires
partition_granularityto be set
Benefits:
- Much faster refresh times (only rebuilds recent partitions)
- Efficient for append-only data
- Ideal for large historical datasets
When to use:
- Large pre-aggregations with millions+ rows
- Time-series data with limited updates to historical data
- Queries that mostly focus on recent data
Combined Refresh Strategies
Mix time-based and incremental refresh:
pre_aggregations:
- name: combined_strategy
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
partition_granularity: month
refresh:
every: "1 hour" # Check every hour
incremental: true # Only update recent partitions
update_window: "30 day" # Refresh last 30 days
Indexes
Add indexes for query performance:
pre_aggregations:
- name: daily_summary
measures: [count, revenue]
dimensions: [status, region, customer_id]
time_dimension: created_at
granularity: day
indexes:
- name: status_idx
columns: [status]
- name: composite_idx
columns: [status, region, customer_id]
Index ordering: Put high-selectivity columns first.
Enabling Routing
Warning: Pre-aggregation routing is disabled by default
Queries will use base tables unless routing is explicitly enabled.
Global Enable
Enable for all queries in the semantic layer:
from sidemantic import SemanticLayer
# Enable pre-aggregation routing globally
sl = SemanticLayer(use_preaggregations=True)
Per-Query Override
Override the global setting for specific queries:
# Disabled globally, enable for one query
sl = SemanticLayer(use_preaggregations=False)
sql = sl.compile(
metrics=["orders.revenue"],
dimensions=["orders.status"],
use_preaggregations=True # Enable for this query only
)
# Enabled globally, disable for one query
sl = SemanticLayer(use_preaggregations=True)
sql = sl.query(
metrics=["orders.revenue"],
dimensions=["orders.customer_id"],
use_preaggregations=False # Disable for this query
)
Why Disabled by Default?
- Pre-aggregation tables must exist - Routing fails if materialized tables aren't created in your database
- Safer default - Prevents query errors when pre-aggs aren't materialized
- Explicit opt-in - Users control when optimization is active
Query Matching
The query engine automatically finds the best matching pre-aggregation based on:
Matching Rules
- Dimension Subset: Query dimensions must be a subset of pre-aggregation dimensions
- Measure Compatibility: All query metrics must be derivable from pre-aggregated measures
- Granularity Compatibility: Query time granularity must be coarser or equal to pre-aggregation granularity
- Best Match Selection: Chooses the smallest/most specific matching pre-aggregation
Matching Examples
Given this pre-aggregation:
pre_aggregations:
- name: daily_rollup
measures: [count, revenue]
dimensions: [status, region, customer_id]
time_dimension: created_at
granularity: day
Queries That Match ✓
# Subset of dimensions (just status)
sl.query(metrics=["orders.revenue"], dimensions=["orders.status"])
# → Uses daily_rollup
# All dimensions present
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status", "orders.region", "orders.customer_id"]
)
# → Uses daily_rollup
# Coarser granularity (month > day) - rolls up
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.created_at__month"]
)
# → Uses daily_rollup, converts day → month
# With filters - applied on top
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status"],
filters=["orders.region = 'US'"]
)
# → Uses daily_rollup with WHERE region = 'US'
Queries That Don't Match ✗
# Dimension not in pre-agg
sl.query(metrics=["orders.revenue"], dimensions=["orders.product_id"])
# → Falls back to base table
# Finer granularity (hour < day) - can't drill down
sl.query(metrics=["orders.revenue"], dimensions=["orders.created_at__hour"])
# → Falls back to base table
# Metric not in pre-agg
sl.query(metrics=["orders.profit"], dimensions=["orders.status"])
# → Falls back to base table
Measure Derivability
Different aggregation types have different derivability rules:
| Query Metric | Pre-Agg Requires | Derivable? | Notes |
|---|---|---|---|
SUM | SUM | ✓ Yes | Direct re-aggregation |
COUNT | COUNT | ✓ Yes | Sum counts from partitions |
AVG | SUM + COUNT | ✓ Yes | SUM(sum_raw) / SUM(count_raw) |
MIN/MAX | MIN/MAX | ⚠️ Limited | Only at same granularity |
COUNT_DISTINCT | COUNT_DISTINCT | ✗ No | Can't re-aggregate distinct counts |
Generated Pre-Aggregation Tables
When materialized, pre-aggregations are stored as physical tables:
Table Naming
Format: {model_name}_preagg_{preagg_name}
Example:
orders_preagg_daily_rollup
Table Structure
For this pre-aggregation:
pre_aggregations:
- name: daily_rollup
measures: [count, revenue]
dimensions: [status]
time_dimension: created_at
granularity: day
Generated table schema:
CREATE TABLE orders_preagg_daily_rollup AS
SELECT
DATE(created_at) as created_at_day,
status,
COUNT(*) as count_raw,
SUM(amount) as revenue_raw
FROM orders
GROUP BY 1, 2
Column naming convention:
- Time dimensions:
{dimension_name}_{granularity}(e.g.,created_at_day) - Regular dimensions: Same name as source
- Measures:
{metric_name}_raw(e.g.,revenue_raw)
Generated Queries
When routing to pre-aggregation:
# User query
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status"],
use_preaggregations=True
)
# Generated SQL
SELECT
status,
SUM(revenue_raw) as revenue
FROM orders_preagg_daily_rollup
GROUP BY 1
Best Practices
1. Start with High-Impact Rollups
Define pre-aggregations for your most frequent query patterns:
# Dashboard query: revenue by region, daily
pre_aggregations:
- name: dashboard_main
measures: [count, revenue, avg_order_value]
dimensions: [region, status]
time_dimension: created_at
granularity: day
2. Layer Pre-Aggregations by Granularity
Create multiple levels for different use cases:
pre_aggregations:
# High-level summary (fast, small)
- name: monthly_summary
measures: [count, revenue]
dimensions: [region]
time_dimension: created_at
granularity: month
# Mid-level detail
- name: daily_by_region
measures: [count, revenue]
dimensions: [region, status]
time_dimension: created_at
granularity: day
# Full detail (large, comprehensive)
- name: daily_full
measures: [count, revenue, avg_order_value]
dimensions: [region, status, customer_segment, product_category]
time_dimension: created_at
granularity: day
partition_granularity: month
refresh:
every: "1 hour"
incremental: true
3. Balance Size vs. Coverage
More dimensions = larger rollup but matches more queries
Strategy:
- Include dimensions used in 80% of queries
- Avoid very high-cardinality dimensions (like customer_id) unless necessary
- Use partitioning for large pre-aggregations
4. Use Incremental Refresh for Large Tables
pre_aggregations:
- name: large_rollup
measures: [count, revenue]
dimensions: [status, region, product_id]
time_dimension: created_at
granularity: day
partition_granularity: month # Partition by month
refresh:
every: "1 hour"
incremental: true
update_window: "30 day" # Only refresh last 30 days
5. Add Indexes for Filtered Queries
If queries frequently filter by specific dimensions, add indexes:
pre_aggregations:
- name: daily_rollup
measures: [revenue]
dimensions: [status, region, customer_id]
time_dimension: created_at
granularity: day
indexes:
- name: region_idx
columns: [region] # Fast filtering by region
- name: composite_idx
columns: [region, status] # Fast filtering by both
Limitations
Single-Model Queries Only
Pre-aggregations only work for queries against a single model:
# ✓ Works - single model
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status"]
)
# ✗ Doesn't work - joins multiple models
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status", "customers.segment"]
)
# Falls back to base tables
No Support for Window Functions
Queries using cumulative or time-comparison metrics don't use pre-aggregations:
# ✗ Window functions bypass pre-aggregations
sl.query(
metrics=["orders.cumulative_revenue"], # Uses window functions
dimensions=["orders.created_at__day"]
)
COUNT_DISTINCT Not Derivable
Pre-aggregated COUNT_DISTINCT values can't be re-aggregated:
# This won't work well
pre_aggregations:
- name: rollup
measures: [unique_customers] # COUNT_DISTINCT
dimensions: [region]
Alternative: Use HyperLogLog approximations or store exact values.
Materialization and Refresh
Sidemantic provides a CLI command to materialize and refresh pre-aggregation tables.
Basic Refresh
Refresh all pre-aggregations in your models:
# DuckDB
sidemantic preagg refresh models/ --db data.db
# PostgreSQL, BigQuery, Snowflake, etc.
sidemantic preagg refresh models/ --connection "postgres://localhost:5432/db"
sidemantic preagg refresh models/ --connection "bigquery://project/dataset"
The refresh command:
- Generates materialization SQL for all pre-aggregations
- Creates tables if they don't exist
- Refreshes existing tables based on their
refreshconfiguration - Stateless: derives watermarks from existing table data
- Supports incremental, full, and merge refresh modes
Refresh Modes
Control how pre-aggregations are refreshed:
# Incremental refresh (default) - only update changed partitions
sidemantic preagg refresh models/ --db data.db --mode incremental
# Full refresh - rebuild entire table
sidemantic preagg refresh models/ --db data.db --mode full
# Merge mode - upsert based on partition keys
sidemantic preagg refresh models/ --db data.db --mode merge
# Engine mode - use database-native materialized views
sidemantic preagg refresh models/ --connection "snowflake://..." --mode engine
sidemantic preagg refresh models/ --connection "bigquery://project/dataset" --mode engine
Incremental mode:
- Only refreshes partitions within the
update_window - Checks existing table to determine last refresh timestamp
- Efficient for large tables with time-based partitions
- Requires
partition_granularityandtime_dimension - Supported: DuckDB only
Full mode:
- Drops and recreates the entire table
- Use when schema changes or data has significant updates
- Slower but ensures complete accuracy
- Supported: DuckDB only
Merge mode:
- Updates existing partitions and inserts new ones
- Good for slowly-changing dimensions
- Requires database support for MERGE/UPSERT
- Supported: DuckDB only
Engine mode:
- Uses database-native materialized views
- Snowflake: DYNAMIC TABLES with automatic refresh
- ClickHouse: MATERIALIZED VIEWS
- BigQuery: MATERIALIZED VIEWS with scheduled refresh
- Database manages refresh automatically
- Validates SQL compatibility (no window functions, etc.)
- Supported: Snowflake, ClickHouse, BigQuery
Selective Refresh
Refresh specific models or pre-aggregations:
# Only refresh pre-aggregations for the 'orders' model
sidemantic preagg refresh models/ --db data.db --model orders
# Only refresh a specific pre-aggregation
sidemantic preagg refresh models/ --db data.db --model orders --preagg daily_rollup
Scheduling Refreshes
The refresh command is stateless and designed to run on a schedule:
Cron example:
# Refresh every hour
0 * * * * sidemantic preagg refresh /path/to/models --db /path/to/data.db
Airflow example:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
dag = DAG(
'refresh_preaggregations',
schedule_interval='@hourly',
start_date=datetime(2024, 1, 1),
)
refresh_task = BashOperator(
task_id='refresh_preaggs',
bash_command='sidemantic preagg refresh models/ --connection "postgres://host/db"',
dag=dag,
)
GitHub Actions example:
name: Refresh Pre-Aggregations
on:
schedule:
- cron: '0 * * * *' # Every hour
jobs:
refresh:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install sidemantic
- run: sidemantic preagg refresh models/ --connection "${{ secrets.DB_CONNECTION }}"
How Refresh Works
The refresh command:
- Discovers pre-aggregations - Parses model YAML files and finds all pre-aggregations
- Checks existing tables - Queries database to see which tables exist and their current watermarks
- Determines refresh strategy - Based on
refreshconfig and--modeflag - Generates SQL - Creates materialization/refresh SQL appropriate for your database
- Executes queries - Runs the SQL to create/update tables
For incremental refresh:
- Reads
time_dimensioncolumn from existing table to find max timestamp - Only processes partitions after that timestamp (within
update_window) - Inserts new data into partitioned table
Cube Format Import
Sidemantic can import pre-aggregations from Cube.js files:
# Cube format
cubes:
- name: Orders
sql_table: orders
dimensions:
- name: status
sql: status
type: string
- name: createdAt
sql: created_at
type: time
measures:
- name: count
type: count
- name: revenue
sql: amount
type: sum
pre_aggregations:
- name: dailyRollup
type: rollup
measures:
- CUBE.count
- CUBE.revenue
dimensions:
- CUBE.status
time_dimension: CUBE.createdAt
granularity: day
partition_granularity: month
refresh_key:
every: 1 hour
incremental: true
update_window: 7 day
Import with CubeAdapter:
from sidemantic.adapters.cube import CubeAdapter
adapter = CubeAdapter()
graph = adapter.parse("path/to/cube/files")
# Pre-aggregations are automatically imported
model = graph.get_model("Orders")
print(model.pre_aggregations)
Summary
Key Takeaways:
- Pre-aggregations provide 100-10,000x speedups for large datasets
- Disabled by default - must enable with
use_preaggregations=True - Automatic routing - queries transparently use matching rollups
- Define in YAML - declarative configuration alongside models
- Use CLI for refresh -
sidemantic preagg refreshhandles materialization - Best for predictable patterns - dashboard queries, time-series analytics
Next steps:
- Define pre-aggregations for your busiest queries
- Enable routing:
SemanticLayer(use_preaggregations=True) - Materialize and refresh tables:
sidemantic preagg refresh models/ --db data.db - Monitor performance improvements