Source YAML
In your Rill project directory, create a <source_name>.yaml
file in the sources
directory containing a type
and location (uri
or path
). Rill will automatically detect and ingest the source next time you run rill start
.
Properties
connector
— the type of connector you are using for the source (required). Possible values include:
https
— public files available on the web.s3
— a file available on amazon s3.- Note : Rill also supports ingesting data from other storage providers that support S3 API. Refer to the
endpoint
property below.
- Note : Rill also supports ingesting data from other storage providers that support S3 API. Refer to the
gcs
— a file available on google cloud platform.local_file
— a locally available file.motherduck
- data stored in motherduckathena
- a data store defined in Amazon Athenaredshift
- a data store in Amazon Redshiftpostgres
- data stored in Postgressqlite
- data stored in SQLitesnowflake
- data stored in Snowflakebigquery
- data stored in BigQueryduckdb
- use the embedded DuckDB engine to submit a DuckDB-supported native SELECT query (should be used in conjunction with thesql
property)
type
— deprecated but preserves a legacy alias to connector
.
uri
— the URI of the remote connector you are using for the source (required for type: http, s3, gcs). Rill also supports glob patterns as part of the URI for S3 and GCS.
s3://your-org/bucket/file.parquet
— the s3 URI of your filegs://your-org/bucket/file.parquet
— the gsutil URI of your filehttps://data.example.org/path/to/file.parquet
— the web address of your file
path
— the local path of the connector you are using for the source relative to your project's root directory. (required for type: file)
/path/to/file.csv
— the path to your file
sql
- Optionally sets the SQL query to extract data from a SQL source (DuckDB/Motherduck/Athena/BigQuery/Postrgres/SQLite/Snowflake)
region
— Optionally sets the cloud region of the bucket or Athena you want to connect to. Only available for S3 and Athena.
us-east-1
— the cloud region identifier
endpoint
— Optionally overrides the S3 endpoint to connect to. This should only be used to connect to S3-compatible services, such as Cloudflare R2 or MinIO.
output_location
- Optionally sets the query output location and result files in Athena (Rill removes the result files but an S3 file retention rule for the output location would make sure no orphaned files are left)
workgroup
- Optionally sets a workgroup for Athena connector. The workgroup is also used to determine an output location. A workgroup may override
output_location
if Override client-side settings is turned on for the workgroup.
project_id
- Sets a project id to be used to run BigQuery jobs (mandatory for BiqQuery connection)
glob.max_total_size
— Applicable if the URI is a glob pattern. The max allowed total size (in bytes) of all objects matching the glob pattern.
- default value is
10737418240 (10GB)
glob.max_objects_matched
— Applicable if the URI is a glob pattern. The max allowed number of objects matching the glob pattern.
- default value is
1,000
glob.max_objects_listed
— Applicable if the URI is a glob pattern. The max number of objects to list and match against glob pattern (excluding files excluded by the glob prefix).
- default value is
1,000,000
timeout
— The maximum time to wait for souce ingestion.
refresh
- Optionally specify a schedule after which Rill should re-ingest the source
cron
- a cron schedule expression, which should be encapsulated in single quotes e.g.'* * * * *'
(optional)every
- a Go duration string, such as24h
(docs) (optional)
extract
- Optionally limit the data ingested from remote sources (S3/GCS only)
rows
- limits the size of data fetchedstrategy
- strategy to fetch data (head or tail)size
- size of data to be fetched (like100MB
,1GB
, etc). This is best-effort and may fetch more data than specified.
files
- limits the total number of files to be fetched as per glob patternstrategy
- strategy to fetch files (head or tail)size
- number of files
- Semantics
- If both
rows
andfiles
are specified, each file matching thefiles
clause will be extracted according to therows
clause. - If only
rows
is specified, no limit on number of files is applied. For example, getting a 1 GBhead
extract will download as many files as necessary. - If only
files
is specified, each file will be fully ingested.
- If both
db
— Optionally set database for motherduck connector or path to SQLite db file.
database_url
— Postgres connection string. Refer Postgres docs for format.
duckdb
– Optionally specify raw parameters to inject into the DuckDB read_csv
, read_json
or read_parquet
statement that Rill generates internally. See the DuckDB docs for a full list of available parameters. Example usage:
duckdb:
header: True
delim: "'|'"
columns: "columns={'FlightDate': 'DATE', 'UniqueCarrier': 'VARCHAR', 'OriginCityName': 'VARCHAR', 'DestCityName': 'VARCHAR'}"
dsn
- Optionally sets the Snowflake connection string. For more information, refer to our Snowflake connector page and the official Go Snowflake Driver documentation for the correct syntax to use.