Skip to main content

Splits with Data Warehouses

Once we have the data ready, we can post this one.

Another advanced concept within Rill is using Incremental Models. To understand incremental models, we will also need to discuss splits.

requirements

You will need to setup the connection to your data warehouse, depending on the connection please refer to our documentation.

Your dataset will require a or equivalent to updated_on column to use.

Understanding Splits in Models

Here’s how it works at a high level:

  • Split Definition: Each row from the result set becomes one "split". The model processes each split separately.
  • Execution Strategy:
    • First Split: Runs without incremental processing.
    • Subsequent Splits: Run incrementally, following the output connector's incremental_strategy (either append or merge for SQL connectors).

Let's create a basic split model.

  1. Create a YAML file: SQL-incremental-tutorial.yaml

  2. Use sql: resolver to load files from your data warehouse

sql: >
SELECT *
FROM some_table_in_bq_SF_with_updated_on_column
{{ if incremental }} WHERE updated_on > CAST(FORMAT_TIMESTAMP('%Y-%m-%d', '{{ .state.max_day }}') AS DATE) {{ end }}

Note that {{if incremental}} is needed here as we will use this to increment over your data! As stated in the beginning, you will need an updated_on column to calculate the increments.

Handling errors in splits

If you see any errors in the UI regarding split, you may need to check the status. You can do this via the CLI running:

rill project splits --<model_name> --local

Refreshing Splits

Let's say a specific split in your model had some formatting issues. After fixing the data, you would need to find the key for the split and run rill project splits --<model_name> --local. Once found, you can run the following command that will only refresh the specific split, instead of the whole model.

rill project refresh --model <model_name> --split <split_key>

What is Incremental Modeling?

You can use incremental modeling to load only new data when refreshing a dataset. This becomes important when your data is large and it does not make sense to reload all the data when trying to ingest new data.

  1. Add the SQL to calculate the state max day.
incremental: true
state:
sql: SELECT MAX(updated_on) as max_day FROM some_table_in_bq_SF_with_updated_on_column

This grabs the MAX value of updated_on from your table.

  1. Finally, you will need to define the output and incremental stragety.
output:
connector: duckdb
incremental_strategy: append

Please see below for the full YAML file on incremental modeling from a Data warehouse to DuckDB.

materialize: true

connector: bigquery
#connector: snowflake

sql: >
SELECT *
FROM some_table_in_bq_SF_with_updated_on_column
{{ if incremental }} WHERE updated_on > CAST(FORMAT_TIMESTAMP('%Y-%m-%d', '{{ .state.max_day }}') AS DATE) {{ end }}

incremental: true
state:
sql: SELECT MAX(updated_on) as max_day FROM some_table_in_bq_SF_with_updated_on_column

output:
connector: duckdb
incremental_strategy: append

You now have a working incremental model that refreshed new data based on the updated_on key at 8AM UTC everyday. Along with writing to the default OLAP engine, DuckDB, we have also added some features to use staging tables for connectors that do not have direct read/write capabilities.


Was this content helpful?