Skip to main content

Model YAML

tip

Both regular models and source models can use the Model YAML specification described on this page. While SQL models are perfect for simple transformations, Model YAML files provide advanced capabilities for complex data processing scenarios.

When to use Model YAML:

  • Partitions - Optimize performance with data partitioning strategies
  • Incremental models - Process only new or changed data efficiently
  • Pre/post execution hooks - Run custom logic before or after model execution
  • Staging - Create intermediate tables for complex transformations
  • Output configuration - Define specific output formats and destinations

Model YAML files give you fine-grained control over how your data is processed and transformed, making them ideal for production workloads and complex analytics pipelines.

Properties

type - refers to the resource type and must be 'model'(required)

refresh - Specifies the refresh schedule that Rill should follow to re-ingest and update the underlying source data (optional).

  • cron - a cron schedule expression, which should be encapsulated in single quotes, e.g. '* * * * *' (optional)
  • every - a Go duration string, such as 24h (docs) (optional)
refresh:
cron: "0 8 * * *"

timeout — The maximum time to wait for model ingestion (optional).

incremental - set to true or false whether incremental modeling is required (optional)

state - refers to the explicitly defined state of your model, cannot be used with partitions (optional).

  • sql/glob - refers to the location of the data depending if the data is cloud storage or a data warehouse.

partitions - refers to the how your data is partitioned, cannot be used with state. (optional).

  • connector - refers to the connector that the partitions is using (optional).
  • sql - refers to the SQL query used to access the data in your data warehouse, use sql or glob (optional).
  • glob - refers to the location of the data in your cloud warehouse, use sql or glob (optional).
    • path - in the case glob is selected, you will need to set the path of your source (optional).
    • partition - in the case glob is selected, you can defined how to partition the table. directory or hive (optional).
partitions:
connector: duckdb
sql: SELECT range AS num FROM range(0,10)
partitions:
glob:
connector: [s3/gcs]
path: [s3/gs]://path/to/file/**/*.parquet[.csv]

pre_exec – refers to SQL queries to run before the main query, available for DuckDB-based models. (optional). Ensure pre_exec queries are idempotent. Use IF NOT EXISTS statements when applicable.

sql - refers to the SQL query for your model. (required).

post_exec – refers to a SQL query that is run after the main query, available for DuckDB-based models. (optional). Ensure post_exec queries are dempotent. Use IF EXISTS statements when applicable.

pre_exec: ATTACH IF NOT EXISTS 'dbname=postgres host=localhost port=5432 user=postgres password=postgres' AS postgres_db (TYPE POSTGRES)

sql: SELECT * FROM postgres_query('postgres_db', 'SELECT * FROM USERS')

post_exec: DETACH DATABASE IF EXISTS postgres_db

partitions_watermark - refers to a customizable timestamp that can be set to check if an object has been updated (optional).

partitions_concurrency - refers to the number of concurrent partitions that can be read at the same time (optional).

stage - in the case of staging models, where an input source does not support direct write to the output and a staging table is required (optional).

  • connector - refers to the connector type for the staging table
  • path - path of the temporary staging table

output - in the case of staging models, where the output needs to be defined where the staging table will write the temporary data (optional).

  • connector - refers to the connector type for the staging table (optional).
  • incremental_strategy - refers to how the incremental refresh will behave, (merge or append) (optional).
  • unique_key - required if incremental_strategy is defined, refers to the unique column to use to merge (optional).
  • materialize - refers to the output table being materialized (optional).
  • columns - refers to a list of columns if you required to manually define column name and types (optional).
  • engine_full - refers to the ClickHouse engine specifications, (ENGINE = ... PARTITION BY ... ORDER BY ... SETTINGS ...) (optional).

materialize - refers to the model being materialized as a table or not (optional).