Source YAML
In your Rill project directory, create a <source_name>.yaml
file in the sources
directory containing a type
and location (uri
or path
). Rill will automatically detect and ingest the source next time you run rill start
.
Files that are nested at any level under your native sources
directory will be assumed to be sources (unless otherwise specified by the type
property).
Properties
type
- Refers to the resource type and must be source
(required).
connector
— Refers to the connector type for the source (required).
https
— public files accessible through the web via a http/https URL endpoints3
— a file available on amazon s3- Note: Rill also supports ingesting data from other storage providers that support S3 API. Refer to the
endpoint
property below.
- Note: Rill also supports ingesting data from other storage providers that support S3 API. Refer to the
gcs
— a file available on google cloud platformlocal_file
— a locally available file in a supported format (e.g. parquet, csv, etc.)motherduck
- data stored in motherduckathena
- a data store defined in Amazon Athenaredshift
- a data store in Amazon Redshiftpostgres
- data stored in Postgressqlite
- data stored in SQLitesnowflake
- data stored in Snowflakebigquery
- data stored in BigQueryduckdb
- use the embedded DuckDB engine to submit a DuckDB-supported native SELECT query (should be used in conjunction with thesql
property)
type
— Deprecated but preserves a legacy alias to connector
. Can be used instead to specify the source connector, instead of the resource type (see above), only if the source YAML file belongs in the <RILL_PROJECT_DIRECTORY>/sources/
directory (preserved primarily for backwards compatibility).
uri
— Refers to the URI of the remote connector you are using for the source. Rill also supports glob patterns as part of the URI for S3 and GCS (required for type: http, s3, gcs).
s3://your-org/bucket/file.parquet
— the s3 URI of your filegs://your-org/bucket/file.parquet
— the gsutil URI of your filehttps://data.example.org/path/to/file.parquet
— the web address of your file
path
— Refers to the local path of the connector you are using for the source relative to your project's root directory (required for type: file).
/path/to/file.csv
— the path to your file
sql
— Sets the SQL query to extract data from a SQL source: DuckDB/Motherduck/Athena/BigQuery/Postrgres/SQLite/Snowflake (optional).
region
— Sets the cloud region of the S3 bucket or Athena you want to connect to using the cloud region identifier (e.g. us-east-1
). Only available for S3 and Athena (optional).
endpoint
— Overrides the S3 endpoint to connect to. This should only be used to connect to S3-compatible services, such as Cloudflare R2 or MinIO (optional).
output_location
— Sets the query output location and result files in Athena. Please note that Rill will remove the result files but setting a S3 file retention rule for the output location would make sure no orphaned files are left (optional).
workgroup
— Sets a workgroup for Athena connector. The workgroup is also used to determine an output location. A workgroup may override output_location
if Override client-side settings is turned on for the workgroup (optional).
project_id
— Sets a project id to be used to run BigQuery jobs (required for type: bigquery).
glob.max_total_size
— Applicable if the URI is a glob pattern. The max allowed total size (in bytes) of all objects matching the glob pattern (optional).
- Default value is
107374182400 (100GB)
glob.max_objects_matched
— Applicable if the URI is a glob pattern. The max allowed number of objects matching the glob pattern (optional).
- Default value is
unlimited
glob.max_objects_listed
— Applicable if the URI is a glob pattern. The max number of objects to list and match against glob pattern, not inclusive of files already excluded by the glob prefix (optional).
- Default value is
unlimited
timeout
— The maximum time to wait for souce ingestion (optional).
refresh
- Specifies the refresh schedule that Rill should follow to re-ingest and update the underlying source data (optional).
cron
- a cron schedule expression, which should be encapsulated in single quotes, e.g.'* * * * *'
(optional)every
- a Go duration string, such as24h
(docs) (optional)
extract
- Limits the data ingested from remote sources. Only available for S3 and GCS (optional).
rows
- limits the size of data fetchedstrategy
- strategy to fetch data (head or tail)size
- size of data to be fetched (like100MB
,1GB
, etc). This is best-effort and may fetch more data than specified.
files
- limits the total number of files to be fetched as per glob patternstrategy
- strategy to fetch files (head or tail)size
- number of files
- If both
rows
andfiles
are specified, each file matching thefiles
clause will be extracted according to therows
clause. - If only
rows
is specified, no limit on number of files is applied. For example, getting a 1 GBhead
extract will download as many files as necessary. - If only
files
is specified, each file will be fully ingested.
db
— Sets the database for motherduck connections and/or the path to the DuckDB/SQLite db
file (optional).
- For DuckDB / SQLite, if deploying to Rill Cloud, this
db
file will need to be accessible from the root directory of your project on Github.
database_url
— Postgres connection string that should be used. Refer to Postgres documentation for more details (optional).
- If not specified in the source YAML, the
connector.postgres.database_url
connection string will need to be set when deploying the project to Rill Cloud.
duckdb
– Specifies the raw parameters to inject into the DuckDB read_csv
, read_json
or read_parquet
statement that Rill generates internally (optional).
See the DuckDB docs for a full list of available parameters.
Example #1: Define all column data mappings
duckdb:
header: True
delim: "'|'"
columns: "columns={'FlightDate': 'DATE', 'UniqueCarrier': 'VARCHAR', 'OriginCityName': 'VARCHAR', 'DestCityName': 'VARCHAR'}"
Example #2: Define a column type
duckdb:
header: True
delim: "'|'"
columns: "types={'UniqueCarrier': 'VARCHAR'}"
dsn
- Used to set the Snowflake connection string. For more information, refer to our Snowflake connector page and the official Go Snowflake Driver documentation (optional).
- If not specified in the source YAML, the
connector.snowflake.dsn
connection string will need to be set when deploying the project to Rill Cloud.