Skip to main content

Staging Connectors

There are some connections that are not natively supported such as Snowflake to Clickhouse. In order to successfully ingest data from these types of sources, there are times where a staging table is required.

requirements

In order to successfully follow this course, you will need to create an account on Snowflake, AWS, and ClickHouse Cloud.

Snowflake: We will be using a sampled dataset, Horror movies, but you can use any dataset with a date column.

AWS: In order to use the staging table with S3, you need to have an access key setup with read/write access to S3. In this tutorial, my AWS credentrials have access to s3://rilldata-public.

ClickHouse: You will write the output dataset from Snowflake to ClickHouse Cloud, or ClickHouse local so you will need to setup a ClickHouse connector.

Getting the Connections ready

Please refer to our documentation on how to prepare the s3 and snowflake connections.

Once these are setup, we can create the staging model file. Let's create one called staging_to_CH.yaml

Creating the YAML components

First, let's define the model. We could add the refresh cron job here but since the data is static in Snowflake, there would be no reason to refresh the data. In the case that it was an updating dataset, you would need to add the incremental and refresh pairs.

type: model 

incremental: true
refresh:
cron: 0 0 * * *

Next, we can define the SQL partitions based on a time frame. Since that data in the horror movies dataset is old, we can make the range from some data in between these days. Feel free to navigate to your Snowflake console and run some SQL commands to better understand what the data is, as you'll probably be using a different table:

select max(release_date) from RILLQA.PUBLIC.HORROR_MOVIES; -- 2022-12-31
select min(release_date) from RILLQA.PUBLIC.HORROR_MOVIES; -- 1950-01-01

Next, we use the range of dates created for our partitions in our actual SQL query that will read data from Snowflake

partitions_concurrency: 3 #increases the concurrency

partitions:
connector: duckdb
sql: SELECT range as day FROM range(TIMESTAMPTZ '2022-12-15', TIMESTAMPTZ '2022-12-31', INTERVAL 1 DAY)

connector: snowflake
sql: SELECT * FROM RILLQA.PUBLIC.HORROR_MOVIES where date_trunc('day', release_date) = '{{.partition.day}}' limit 10

Since Snowflake cannot write directly to ClickHouse and vice-versa, we use a S3 staging connector that has capabilities to write/read from ClickHouse and Snowflake.

stage:
connector: s3
path: s3://rilldata-public/temp-data

Lastly, we define connector to write the final table to.

output:
connector: clickhouse

Your final output should look like:

img

note

Our team is continuously working to add additional features to staging connectors. If you are looking for a specific combination, please reach out and let us know!


Was this content helpful?