Skip to main content

Source YAML

In your Rill project directory, create a <source_name>.yaml file in the sources directory containing a type and location (uri or path). Rill will automatically detect and ingest the source next time you run rill start.

Did you know?

Files that are nested at any level under your native sources directory will be assumed to be sources (unless otherwise specified by the type property).

Properties

type - Refers to the resource type and must be source (required).

connector — Refers to the connector type for the source (required).

  • https — public files accessible through the web via a http/https URL endpoint
  • s3 — a file available on amazon s3
    • Note: Rill also supports ingesting data from other storage providers that support S3 API. Refer to the endpoint property below.
  • gcs — a file available on google cloud platform
  • local_file — a locally available file in a supported format (e.g. parquet, csv, etc.)
  • motherduck - data stored in motherduck
  • athena - a data store defined in Amazon Athena
  • redshift - a data store in Amazon Redshift
  • postgres - data stored in Postgres
  • sqlite - data stored in SQLite
  • snowflake - data stored in Snowflake
  • bigquery - data stored in BigQuery
  • duckdb - use the embedded DuckDB engine to submit a DuckDB-supported native SELECT query (should be used in conjunction with the sql property)

typeDeprecated but preserves a legacy alias to connector. Can be used instead to specify the source connector, instead of the resource type (see above), only if the source YAML file belongs in the <RILL_PROJECT_DIRECTORY>/sources/ directory (preserved primarily for backwards compatibility).

uri — Refers to the URI of the remote connector you are using for the source. Rill also supports glob patterns as part of the URI for S3 and GCS (required for type: http, s3, gcs).

  • s3://your-org/bucket/file.parquet — the s3 URI of your file
  • gs://your-org/bucket/file.parquet — the gsutil URI of your file
  • https://data.example.org/path/to/file.parquet — the web address of your file

path — Refers to the local path of the connector you are using for the source relative to your project's root directory (required for type: file).

  • /path/to/file.csv — the path to your file

sql — Sets the SQL query to extract data from a SQL source: DuckDB/Motherduck/Athena/BigQuery/Postrgres/SQLite/Snowflake (optional).

region — Sets the cloud region of the S3 bucket or Athena you want to connect to using the cloud region identifier (e.g. us-east-1). Only available for S3 and Athena (optional).

endpoint — Overrides the S3 endpoint to connect to. This should only be used to connect to S3-compatible services, such as Cloudflare R2 or MinIO (optional).

output_location — Sets the query output location and result files in Athena. Please note that Rill will remove the result files but setting a S3 file retention rule for the output location would make sure no orphaned files are left (optional).

workgroup — Sets a workgroup for Athena connector. The workgroup is also used to determine an output location. A workgroup may override output_location if Override client-side settings is turned on for the workgroup (optional).

project_id — Sets a project id to be used to run BigQuery jobs (required for type: bigquery).

glob.max_total_size — Applicable if the URI is a glob pattern. The max allowed total size (in bytes) of all objects matching the glob pattern (optional).

  • Default value is 107374182400 (100GB)

glob.max_objects_matched — Applicable if the URI is a glob pattern. The max allowed number of objects matching the glob pattern (optional).

  • Default value is unlimited

glob.max_objects_listed — Applicable if the URI is a glob pattern. The max number of objects to list and match against glob pattern, not inclusive of files already excluded by the glob prefix (optional).

  • Default value is unlimited

timeout — The maximum time to wait for souce ingestion (optional).

refresh - Specifies the refresh schedule that Rill should follow to re-ingest and update the underlying source data (optional).

  • cron - a cron schedule expression, which should be encapsulated in single quotes, e.g. '* * * * *' (optional)
  • every - a Go duration string, such as 24h (docs) (optional)

extract - Limits the data ingested from remote sources. Only available for S3 and GCS (optional).

  • rows - limits the size of data fetched
    • strategy - strategy to fetch data (head or tail)
    • size - size of data to be fetched (like 100MB, 1GB, etc). This is best-effort and may fetch more data than specified.
  • files - limits the total number of files to be fetched as per glob pattern
    • strategy - strategy to fetch files (head or tail)
    • size - number of files
A note on semantics
  • If both rows and files are specified, each file matching the files clause will be extracted according to the rows clause.
  • If only rows is specified, no limit on number of files is applied. For example, getting a 1 GB head extract will download as many files as necessary.
  • If only files is specified, each file will be fully ingested.

db — Sets the database for motherduck connections and/or the path to the DuckDB/SQLite db file (optional).

  • For DuckDB / SQLite, if deploying to Rill Cloud, this db file will need to be accessible from the root directory of your project on Github.

database_url — Postgres connection string that should be used. Refer to Postgres documentation for more details (optional).

duckdb – Specifies the raw parameters to inject into the DuckDB read_csv, read_json or read_parquet statement that Rill generates internally (optional).

See the DuckDB docs for a full list of available parameters.

Example #1: Define all column data mappings

duckdb:
header: True
delim: "'|'"
columns: "columns={'FlightDate': 'DATE', 'UniqueCarrier': 'VARCHAR', 'OriginCityName': 'VARCHAR', 'DestCityName': 'VARCHAR'}"

Example #2: Define a column type

duckdb:
header: True
delim: "'|'"
columns: "types={'UniqueCarrier': 'VARCHAR'}"

dsn - Used to set the Snowflake connection string. For more information, refer to our Snowflake connector page and the official Go Snowflake Driver documentation (optional).