Skip to main content

Models YAML

This file is used to define YAML models. For more information on our SQL models, see the SQL models documentation.

tip

Both regular models and source models can use the Model YAML specification described on this page. While SQL models are perfect for simple transformations, Model YAML files provide advanced capabilities for complex data processing scenarios.

When to use Model YAML:

  • Partitions - Optimize performance with data partitioning strategies
  • Incremental models - Process only new or changed data efficiently
  • Pre/post execution hooks - Run custom logic before or after model execution
  • Staging - Create intermediate tables for complex transformations
  • Output configuration - Define specific output formats and destinations

Model YAML files give you fine-grained control over how your data is processed and transformed, making them ideal for production workloads and complex analytics pipelines.

Properties

type

[string] - Refers to the resource type and must be model (required)

refresh

[object] - Specifies the refresh schedule that Rill should follow to re-ingest and update the underlying model data

  • cron - [string] - A cron expression that defines the execution schedule

  • time_zone - [string] - Time zone to interpret the schedule in (e.g., 'UTC', 'America/Los_Angeles').

  • disable - [boolean] - If true, disables the resource without deleting it.

  • ref_update - [boolean] - If true, allows the resource to run when a dependency updates.

  • run_in_dev - [boolean] - If true, allows the schedule to run in development mode.

refresh:
cron: "* * * * *"

connector

[string] - Refers to the resource type and is needed if setting an explicit OLAP engine. IE clickhouse

sql

[string] - Raw SQL query to run against source (required)

pre_exec

[string] - Refers to SQL queries to run before the main query, available for DuckDB-based models. (optional). Ensure pre_exec queries are idempotent. Use IF NOT EXISTS statements when applicable.

pre_exec: ATTACH IF NOT EXISTS 'dbname=postgres host=localhost port=5432 user=postgres password=postgres' AS postgres_db (TYPE POSTGRES)

post_exec

[string] - Refers to a SQL query that is run after the main query, available for DuckDB-based models. (optional). Ensure post_exec queries are idempotent. Use IF EXISTS statements when applicable.

post_exec: DETACH DATABASE IF EXISTS postgres_db

retry

[object] - Refers to the retry configuration for the model. (optional)

  • attempts - [integer] - The number of attempts to retry the model.

  • delay - [string] - The delay between attempts.

  • exponential_backoff - [boolean] - Whether to use exponential backoff.

  • if_error_matches - [array of string] - The error messages to match.

retry:
attempts: 5
delay: 10s
exponential_backoff: true
if_error_matches:
- ".*OvercommitTracker.*"
- ".*Timeout.*"
- ".*Bad Gateway.*"

timeout

[string] - The maximum time to wait for model ingestion

incremental

[boolean] - whether incremental modeling is required (optional)

change_mode

[string] - Configure how changes to the model specifications are applied (optional). 'reset' will drop and recreate the model automatically, 'manual' will require a manual full or incremental refresh to apply changes, and 'patch' will switch to the new logic without re-processing historical data (only applies for incremental models).

state

[oneOf] - Refers to the explicitly defined state of your model, cannot be used with partitions (optional)

  • option 1 - [object] - Executes a raw SQL query against the project's data models.

    • sql - [string] - Raw SQL query to run against existing models in the project. (required)

    • connector - [string] - specifies the connector to use when running SQL or glob queries.

  • option 2 - [object] - Executes a SQL query that targets a defined metrics view.

    • metrics_sql - [string] - SQL query that targets a metrics view in the project (required)
  • option 3 - [object] - Calls a custom API defined in the project to compute data.

    • api - [string] - Name of a custom API defined in the project. (required)

    • args - [object] - Arguments to pass to the custom API.

  • option 4 - [object] - Uses a file-matching pattern (glob) to query data from a connector.

    • glob - [anyOf] - Defines the file path or pattern to query from the specified connector. (required)

      • option 1 - [string] - A simple file path/glob pattern as a string.

      • option 2 - [object] - An object-based configuration for specifying a file path/glob pattern with advanced options.

    • connector - [string] - Specifies the connector to use with the glob input.

  • option 5 - [object] - Uses the status of a resource as data.

    • resource_status - [object] - Based on resource status (required)

      • where_error - [boolean] - Indicates whether the condition should trigger when the resource is in an error state.
state:
sql: SELECT MAX(date) as max_date

partitions

[oneOf] - Refers to the how your data is partitioned, cannot be used with state. (optional)

  • option 1 - [object] - Executes a raw SQL query against the project's data models.

    • sql - [string] - Raw SQL query to run against existing models in the project. (required)

    • connector - [string] - specifies the connector to use when running SQL or glob queries.

  • option 2 - [object] - Executes a SQL query that targets a defined metrics view.

    • metrics_sql - [string] - SQL query that targets a metrics view in the project (required)
  • option 3 - [object] - Calls a custom API defined in the project to compute data.

    • api - [string] - Name of a custom API defined in the project. (required)

    • args - [object] - Arguments to pass to the custom API.

  • option 4 - [object] - Uses a file-matching pattern (glob) to query data from a connector.

    • glob - [anyOf] - Defines the file path or pattern to query from the specified connector. (required)

      • option 1 - [string] - A simple file path/glob pattern as a string.

      • option 2 - [object] - An object-based configuration for specifying a file path/glob pattern with advanced options.

    • connector - [string] - Specifies the connector to use with the glob input.

  • option 5 - [object] - Uses the status of a resource as data.

    • resource_status - [object] - Based on resource status (required)

      • where_error - [boolean] - Indicates whether the condition should trigger when the resource is in an error state.
partitions:
glob: gcs://my_bucket/y=*/m=*/d=*/*.parquet
partitions:
connector: duckdb
sql: SELECT range AS num FROM range(0,10)

materialize

[boolean] - models will be materialized in olap

partitions_watermark

[string] - Refers to a customizable timestamp that can be set to check if an object has been updated (optional).

partitions_concurrency

[integer] - Refers to the number of concurrent partitions that can be read at the same time (optional).

stage

[object] - in the case of staging models, where an input source does not support direct write to the output and a staging table is required

  • connector - [string] - Refers to the connector type for the staging table (required)

  • path - [string] - Refers to the path to the staging table

stage:
connector: s3
path: s3://my_bucket/my_staging_table

output

[object] - to define the properties of output

  • table - [string] - Name of the output table. If not specified, the model name is used.

  • materialize - [boolean] - Whether to materialize the model as a table or view

  • connector - [string] - Refers to the connector type for the output table. Can be clickhouse or duckdb and their named connector

  • incremental_strategy - [string] - Strategy to use for incremental updates. Can be 'append', 'merge' or 'partition_overwrite'

  • unique_key - [array of string] - List of columns that uniquely identify a row for merge strategy

  • partition_by - [string] - Column or expression to partition the table by

Additional properties for output when connector is clickhouse

  • type - [string] - Type to materialize the model into. Can be 'TABLE', 'VIEW' or 'DICTIONARY'

  • columns - [string] - Column names and types. Can also include indexes. If unspecified, detected from the query.

  • engine_full - [string] - Full engine definition in SQL format. Can include partition keys, order, TTL, etc.

  • engine - [string] - Table engine to use. Default is MergeTree

  • order_by - [string] - ORDER BY clause.

  • partition_by - [string] - Partition BY clause.

  • primary_key - [string] - PRIMARY KEY clause.

  • sample_by - [string] - SAMPLE BY clause.

  • ttl - [string] - TTL settings for the table or columns.

  • table_settings - [string] - Table-specific settings.

  • query_settings - [string] - Settings used in insert/create table as select queries.

  • distributed_settings - [string] - Settings for distributed table.

  • distributed_sharding_key - [string] - Sharding key for distributed table.

  • dictionary_source_user - [string] - User for accessing the source dictionary table (used if type is DICTIONARY).

  • dictionary_source_password - [string] - Password for the dictionary source user.

Common Properties

name

[string] - Name is usually inferred from the filename, but can be specified manually.

refs

[array of string] - List of resource references

dev

[object] - Overrides any properties in development environment.

prod

[object] - Overrides any properties in production environment.

Additional properties for output when connector is clickhouse

type

[string] - Type to materialize the model into. Can be 'TABLE', 'VIEW' or 'DICTIONARY'

columns

[string] - Column names and types. Can also include indexes. If unspecified, detected from the query.

engine_full

[string] - Full engine definition in SQL format. Can include partition keys, order, TTL, etc.

engine

[string] - Table engine to use. Default is MergeTree

order_by

[string] - ORDER BY clause.

partition_by

[string] - Partition BY clause.

primary_key

[string] - PRIMARY KEY clause.

sample_by

[string] - SAMPLE BY clause.

ttl

[string] - TTL settings for the table or columns.

table_settings

[string] - Table-specific settings.

query_settings

[string] - Settings used in insert/create table as select queries.

distributed_settings

[string] - Settings for distributed table.

distributed_sharding_key

[string] - Sharding key for distributed table.

dictionary_source_user

[string] - User for accessing the source dictionary table (used if type is DICTIONARY).

dictionary_source_password

[string] - Password for the dictionary source user.

Common Properties

name

[string] - Name is usually inferred from the filename, but can be specified manually.

refs

[array of string] - List of resource references

dev

[object] - Overrides any properties in development environment.

prod

[object] - Overrides any properties in production environment.

Additional properties when connector is athena or named connector for athena

output_location

[string] - Output location for query results in S3.

workgroup

[string] - AWS Athena workgroup to use for queries.

region

[string] - AWS region to connect to Athena and the output location.

Additional properties when connector is azure or named connector of azure

path

[string] - Path to the source

account

[string] - Account identifier

uri

[string] - Source URI

extract

[object] - Arbitrary key-value pairs for extraction settings

glob

[object] - Settings related to glob file matching.

  • max_total_size - [integer] - Maximum total size (in bytes) matched by glob

  • max_objects_matched - [integer] - Maximum number of objects matched by glob

  • max_objects_listed - [integer] - Maximum number of objects listed in glob

  • page_size - [integer] - Page size for glob listing

batch_size

[string] - Size of a batch (e.g., '100MB')

Additional properties when connector is bigquery or named connector of bigquery

project_id

[string] - ID of the BigQuery project.

Additional properties when connector is duckdb or named connector of duckdb

path

[string] - Path to the data source.

format

[string] - Format of the data source (e.g., csv, json, parquet).

pre_exec

[string] - refers to SQL queries to run before the main query, available for DuckDB-based models. (optional). Ensure pre_exec queries are idempotent. Use IF NOT EXISTS statements when applicable.

post_exec

[string] - refers to a SQL query that is run after the main query, available for DuckDB-based models. (optional). Ensure post_exec queries are idempotent. Use IF EXISTS statements when applicable.

pre_exec: ATTACH IF NOT EXISTS 'dbname=postgres host=localhost port=5432 user=postgres password=postgres' AS postgres_db (TYPE POSTGRES);
sql: SELECT * FROM postgres_query('postgres_db', 'SELECT * FROM USERS')
post_exec: DETACH DATABASE IF EXISTS postgres_db

Additional properties when connector is gcs or named connector of gcs

path

[string] - Path to the source

uri

[string] - Source URI

extract

[object] - key-value pairs for extraction settings

glob

[object] - Settings related to glob file matching.

  • max_total_size - [integer] - Maximum total size (in bytes) matched by glob

  • max_objects_matched - [integer] - Maximum number of objects matched by glob

  • max_objects_listed - [integer] - Maximum number of objects listed in glob

  • page_size - [integer] - Page size for glob listing

batch_size

[string] - Size of a batch (e.g., '100MB')

Additional properties when connector is local_file or named connector of local_file

path

[string] - Path to the data source.

format

[string] - Format of the data source (e.g., csv, json, parquet).

Additional properties when connector is redshift or named connector of redshift

output_location

[string] - S3 location where query results are stored.

workgroup

[string] - Redshift Serverless workgroup to use.

database

[string] - Name of the Redshift database.

cluster_identifier

[string] - Identifier of the Redshift cluster.

role_arn

[string] - ARN of the IAM role to assume for Redshift access.

region

[string] - AWS region of the Redshift deployment.

Additional properties when connector is s3 or named connector of s3

region

[string] - AWS region

endpoint

[string] - AWS Endpoint

path

[string] - Path to the source

uri

[string] - Source URI

extract

[object] - key-value pairs for extraction settings

glob

[object] - Settings related to glob file matching.

  • max_total_size - [integer] - Maximum total size (in bytes) matched by glob

  • max_objects_matched - [integer] - Maximum number of objects matched by glob

  • max_objects_listed - [integer] - Maximum number of objects listed in glob

  • page_size - [integer] - Page size for glob listing

batch_size

[string] - Size of a batch (e.g., '100MB')

Additional properties when connector is salesforce or named connector of salesforce

soql

[string] - SOQL query to execute against the Salesforce instance.

sobject

[string] - Salesforce object (e.g., Account, Contact) targeted by the query.

queryAll

[boolean] - Whether to include deleted and archived records in the query (uses queryAll API).

Examples

### Incremental model 
type: model
incremental: true
connector: bigquery
state:
sql: SELECT MAX(date) as max_date
sql: "SELECT ... FROM events \n {{ if incremental }} \n WHERE event_time > '{{.state.max_date}}' \n {{end}}\n"
output:
connector: duckdb
### Partitioned model 
type: model
partitions:
glob:
connector: gcs
path: gs://rilldata-public/github-analytics/Clickhouse/2025/*/commits_*.parquet
sql: SELECT * FROM read_parquet('{{ .partition.uri }}')
output:
connector: duckdb
incremental_strategy: append
### Partitioned Incremental model 
type: model
incremental: true
refresh:
cron: "0 8 * * *"
partitions:
glob:
path: gs://rilldata-public/github-analytics/Clickhouse/2025/*/*
partition: directory
sql: "SELECT * \n FROM read_parquet('gs://rilldata-public/{{ .partition.path }}/commits_*.parquet') \n WHERE '{{ .partition.path }}' IS NOT NULL\n"
output:
connector: duckdb
incremental_strategy: append
### Staging model 
type: model
connector: snowflake
# Use DuckDB to generate a range of days from 1st Jan to today
partitions:
connector: duckdb
sql: SELECT range as day FROM range(TIMESTAMPTZ '2024-01-01', now(), INTERVAL 1 DAY)
# Don't reload previously ingested partitions on every refresh
incremental: true
# Query Snowflake for all events belonging to the current partition
sql: SELECT * FROM events WHERE date_trunc('day', event_time) = '{{ .partition.day }}'
# Since ClickHouse can't ingest from Snowflake or vice versa, we use S3 as a temporary staging connector
stage:
connector: s3
path: s3://bucket/temp-data
# Produce the final output into ClickHouse, requires a clickhouse.yaml connector defined.
output:
connector: clickhouse