Source YAML
In your Rill project directory, create a <source_name>.yaml
file in the sources
directory containing a type
and location (uri
or path
). Rill will automatically detect and ingest the source next time you run rill start
.
Properties
type
— the type of connector you are using for the source (required). Possible values include:
https
— public files available on the web.s3
— a file available on amazon s3.- Note : Rill also supports ingesting data from other storage providers that support S3 API. Refer to the
endpoint
property below.
- Note : Rill also supports ingesting data from other storage providers that support S3 API. Refer to the
gcs
— a file available on google cloud platform.local_file
— a locally available file.
uri
— the URI of the remote connector you are using for the source (required for type: http, s3, gcs). Rill also supports glob patterns as part of the URI for S3 and GCS.
s3://your-org/bucket/file.parquet
— the s3 URI of your filegs://your-org/bucket/file.parquet
— the gsutil URI of your filehttps://data.example.org/path/to/file.parquet
— the web address of your file
path
— the local path of the connector you are using for the source relative to your project's root directory. (required for type: file)
/path/to/file.csv
— the path to your file
region
— Optionally sets the cloud region of the bucket you want to connect to. Only available for S3.
us-east-1
— the cloud region identifer
endpoint
— Optionally overrides the S3 endpoint to connect to. This should only be used to connect to S3-compatible services, such as Cloudflare R2 or MinIO.
glob.max_total_size
— Applicable if the URI is a glob pattern. The max allowed total size (in bytes) of all objects matching the glob pattern.
- default value is
10737418240 (10GB)
glob.max_objects_matched
— Applicable if the URI is a glob pattern. The max allowed number of objects matching the glob pattern.
- default value is
1,000
glob.max_objects_listed
— Appplicable if the URI is a glob pattern. The max number of objects to list and match against glob pattern (excluding files excluded by the glob prefix).
- default value is
1,000,000
timeout
— The maximum time to wait for souce ingestion.
extract
- Optionally limit the data ingested from remote sources (S3/GCS only)
rows
- limits the size of data fetchedstrategy
- strategy to fetch data (head or tail)size
- size of data to be fetched (like100MB
,1GB
, etc). This is best-effort and may fetch more data than specified.
files
- limits the total number of files to be fetched as per glob patternstrategy
- strategy to fetch files (head or tail)size
- number of files
- Semantics
- If both
rows
andfiles
are specified, each file matching thefiles
clause will be extracted according to therows
clause. - If only
rows
is specified, no limit on number of files is applied. For example, getting a 1 GBhead
extract will download as many files as necessary. - If only
files
is specified, each file will be fully ingested.
- If both