Skip to main content

Source YAML

In your Rill project directory, create a <source_name>.yaml file in the sources directory containing a type and location (uri or path). Rill will automatically detect and ingest the source next time you run rill start.

Properties

type — the type of connector you are using for the source (required). Possible values include:

  • https — public files available on the web.
  • s3 — a file available on amazon s3.
    • Note : Rill also supports ingesting data from other storage providers that support S3 API. Refer to the endpoint property below.
  • gcs — a file available on google cloud platform.
  • local_file — a locally available file.

uri — the URI of the remote connector you are using for the source (required for type: http, s3, gcs). Rill also supports glob patterns as part of the URI for S3 and GCS.

  • s3://your-org/bucket/file.parquet — the s3 URI of your file
  • gs://your-org/bucket/file.parquet — the gsutil URI of your file
  • https://data.example.org/path/to/file.parquet — the web address of your file

path — the local path of the connector you are using for the source relative to your project's root directory. (required for type: file)

  • /path/to/file.csv — the path to your file

region — Optionally sets the cloud region of the bucket you want to connect to. Only available for S3.

  • us-east-1 — the cloud region identifer

endpoint — Optionally overrides the S3 endpoint to connect to. This should only be used to connect to S3-compatible services, such as Cloudflare R2 or MinIO.

glob.max_total_size — Applicable if the URI is a glob pattern. The max allowed total size (in bytes) of all objects matching the glob pattern.

  • default value is 10737418240 (10GB)

glob.max_objects_matched — Applicable if the URI is a glob pattern. The max allowed number of objects matching the glob pattern.

  • default value is 1,000

glob.max_objects_listed — Appplicable if the URI is a glob pattern. The max number of objects to list and match against glob pattern (excluding files excluded by the glob prefix).

  • default value is 1,000,000

timeout — The maximum time to wait for souce ingestion.

extract - Optionally limit the data ingested from remote sources (S3/GCS only)

  • rows - limits the size of data fetched
    • strategy - strategy to fetch data (head or tail)
    • size - size of data to be fetched (like 100MB, 1GB, etc). This is best-effort and may fetch more data than specified.
  • files - limits the total number of files to be fetched as per glob pattern
    • strategy - strategy to fetch files (head or tail)
    • size - number of files
  • Semantics
    • If both rows and files are specified, each file matching the files clause will be extracted according to the rows clause.
    • If only rows is specified, no limit on number of files is applied. For example, getting a 1 GB head extract will download as many files as necessary.
    • If only files is specified, each file will be fully ingested.