Incremental Partitioned Models on Cloud Storage
Now that we understand what Incremental Models and [Partitions]/(build/incremental-models/#what-are-partitions) are, let's apply to them to our ClickHouse project.
Since our ClickHouse data is hosted in GCS, we will be using glob based partitions, instead of the example's sql select statements.
Let's create a basic partitioned model.
In the previous courses, we used a GCS connection to import ClickHouse's repository commit history. Let's go ahead and assume we are using the same folder structure.
#gs://rilldata-public/github-analytics/Clickhouse/YYYY/MM/filename_YYYY_MM.parquet
gs://rilldata-public/github-analytics/Clickhouse/*/*/commits_.parquet
gs://rilldata-public/github-analytics/Clickhouse/*/*/modified_files_*.parquet
-
Create a YAML file:
partitions-tutorial.yaml
-
Use
glob:
resolver to load files from GCS
type: model
partitions:
glob:
connector: gcs
path: gs://rilldata-public/github-analytics/Clickhouse/2024/*/commits_*.parquet
- Set the SQL statement to user the URI.
sql: SELECT * FROM read_parquet('{{ .partition.uri }}')
Once you save the file, Rill will start to ingest all the partitions from GCS. This may take a few minutes. You can see the progress of the ingestion from the CLI.
2024-11-12T13:41:43.355 INFO Executed model partition {"model": "partitions_tutorial", "key": "3c4cdfc819f8a64ecaeecbc9ae9702af", "data": {"path":"github-analytics/Clickhouse/2024/01/commits_2024_01.parquet","uri":"gs://rilldata-public/github-analytics/Clickhouse/2024/01/commits_2024_01.parquet"}, "elapsed": "903.89675ms"}
2024-11-12T13:41:44.158 INFO Executed model partition {"model": "partitions_tutorial", "key": "ecd933fe9b5089f940e592d500b168a0", "data": {"path":"github-analytics/Clickhouse/2024/02/commits_2024_02.parquet","uri":"gs://rilldata-public/github-analytics/Clickhouse/2024/02/commits_2024_02.parquet"}, "elapsed": "802.034542ms"}
2024-11-12T13:41:44.945 INFO Executed model partition {"model": "partitions_tutorial", "key": "0a5023cdd0a340aa95f387bb20c1a942", "data": {"path":"github-analytics/Clickhouse/2024/03/commits_2024_03.parquet","uri":"gs://rilldata-public/github-analytics/Clickhouse/2024/03/commits_2024_03.parquet"}, "elapsed": "786.159292ms"}
Once completed you should see the following:
Viewing Partition Status in the UI
If you see any errors in the UI regarding your partitions, you may need to check the status by selecting "View partitions"
Or, you can check this via the CLI running:
rill project partitions <model_name> --local
KEY (50) DATA EXECUTED ON ELAPSED ERROR
---------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------- --------- -------
9a71c41f9c9b268e7ca3bedfe4c2774b {"path":"github-analytics/Clickhouse/2014/01/commits_2014_01.parquet","uri":"gs://rilldata-public/github-analytics/Clickhouse/2014/01/commits_2014_01.parquet"} 2024-11-12T20:40:55Z 667ms
...
Refreshing Partitions
When issues arise in partitions in your model, you will need to fix the underlying issue then refresh this specific partitions in Rill. In the UI, you can select the dropdown Showing
and select errors.
Or, if you prefer to refresh in the CLI, you can run the command to refresh all errored partitions.
rill project refresh --model partitions_tutorial --errored-partitions --project my-rill-tutorial-1 --local
Error: can't refresh partitions on model "partitions_tutorial" because it is not incremental
As we reviewed before, this is only possible when the model is an incremental one! Did you catch the issue before running the command?
What is Incremental Modeling?
Once partitions are set up, you can use incremental modeling to load only new data when refreshing a dataset. This becomes important when your data is large and it does not make sense to reload all the data when trying to ingest new data.
Let's create an Incremental model for our commits and modified files sources.
-
Create a file CH_incremental_commits.yaml and CH_incremental_modified_files.yaml
-
After copying the previous YAML contents, set
incremental
to true -
You can manually setup a
partitions_watermark
but since our data is using theglob
key, it is automatically set to theupdated_on
field. -
Let's set up a
refresh
based oncron
that runs daily at 8AM UTC.
refresh:
cron: "0 8 * * *"
Once Rill ingests the data, your UI should look something like this:
Your YAML should look like the following:
type: model
incremental: true
refresh:
cron: "0 8 * * *"
partitions:
glob:
connector: gcs
path: gs://rilldata-public/github-analytics/Clickhouse/*/*/commits_*.parquet #modified_filies_*.parquet
sql: SELECT * FROM read_parquet('{{ .partition.uri }}')
You now have a working incremental model that refreshed new data based on the updated_on
key at 8AM UTC everyday. Along with writing to the default OLAP engine, DuckDB, we have also added some features to use staging tables for connectors that do not have direct read/write capabilities.
Was this content helpful?