Skip to main content

Incremental Models

Incremental Models help with the ingestion of large datasets as it allows a dataset to be broken down into smaller sections to ingest rather than a single read of the entire dataset. Unlike standard SQL models that are created via .sql file, incremental models are defined in a YAML file and are used when a large dataset needs to be incrementally ingested to improve ingestion costs and ingestion time.

Take a look at the Reference!

If you are unsure what are the required parameters, please review the reference page for Advanced Models.

Rill supports incremental models on either cloud storage or data warehouses but the parameters to set these up will be different. Cloud storage requires the glob parameter while data warehouses will need to use sql.

See our reference documentation for more information.

Need help with getting Incremental Models set up?

Please reach out to us if you have any questions regarding incremental modeling!

Creating an Incremental Model

In order to enable incremental model, you will need to set the following: incremental: true.

type: model
incremental: true

sql: #some sql query from source_table
Duplicate Data

Incremental models default to an append strategy and with neither state nor partition defined, your data will append data per incremental refresh from the source table. This will result in duplicate data and is not recommended. Instead, using the merge_strategy, use a unique_key to ensure duplicate data is not ingested.

Late arriving Data

If you have late arriving data, you will need to keep this in mind when designing your incremental model. If you simply use max(date) from the source, you may risk leaving out late arriving data. Depending on your specific use-case, you might consider a larger time difference and use a merge as your incremental_strategy.

Incremental Models with State defined (Optional)

If your data is not partitioned, you can define the incremental model with a predefined state parameter. This is only useful for multi-connector incremental ingestion such as BigQuery to DuckDB.

type: model
incremental: true
connector: bigquery

state:
sql: SELECT MAX(date) as max_date

sql: |
SELECT ... FROM events
{{ if incremental }}
WHERE event_time > '{{.state.max_date}}'
{{end}}
output:
connector: duckdb

Once state is defined in an incremental model, its value can be used as a variable in your SQL statement. In the above example, the state gets the most recent date from the model and when incrementally refreshing ingest data for events that are more recent than the state's max_date.

tip

You can verify the current value of your state in the left hand panel under Incremental Processing.

Refreshing an Incremental Model

When you are testing with incremental models in Rill Developer, you will notice a change in the refresh functionality. Instead of a full refresh, you are given the option for incremental refresh.


What's the difference?

Once increments are enabled on a model, this grants you the ability to refresh the model in increments, instead of loading the full data each time. This is handy when your data is massive and re-ingesting the data may take time. For a project on production, this allows for less downtime when needing to update your dashboards when the source data is updated.

There are times when a full refresh may be required. In these cases, running the full refresh is equivalent to running a normal refresh with incremental disabled.

When selecting to refresh incrementally what is being run in the CLI is:

 rill project refresh --local --model <model_name> 

Kind in mind that if you select Full refresh this will start the ingestion of all of your data from scratch. Only use this when absolutely required. When running a full refresh, the CLI command is:

 rill project refresh --local --model <model_name> --full