Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 10_data_source_configuration.md #2082

Merged
merged 2 commits into from
May 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 27 additions & 25 deletions soda/contracts/adr/10_data_source_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

In order to verify (Soda) or push (Atlan) a contract, a data source is required.

A contract file will reference a data source by name. Data sources are configured in
A contract file will reference a data source by name. Data sources are configured in
data source YAML files.

Other configurations next to data sources are variables, Soda Cloud connection and Atlan
Other configurations next to data sources are variables, Soda Cloud connection and Atlan
connections.

Different environments require different configurations. The configuration files are
grouped per environment in a directory. When performing a contract operation, next to
the contract files, the environment configurations is passed as a directory.
Different environments require different configurations. The configuration files are
grouped per environment in a directory. When performing a contract operation, next to
the contract files, the environment configuration is passed as a directory.

In an environment configuration directory, the file names are used to determine the file type:

Expand All @@ -25,7 +25,7 @@ In an environment configuration directory, the file names are used to determine
A CLI or API operation like Soda's `verify` or Atlan's `push` requires one or more contract files.
Each contract file will need to resolve the data source by name.

A `data_source` is **required** in a contract file and refers to the data source in the data sources
A `data_source` is **required** in a contract file and refers to the data source in the data sources
configuration file.

Eg `./customers.yml`
Expand All @@ -45,28 +45,30 @@ For example: with following file structure...
+- contracts
+- customers.yml
```
And a command line prompt in the root directory, soda can be invoked with
And a command line prompt in the root directory, soda can be invoked with
```
> soda --configuration .soda --contract contracts/customers.yml
> soda --configuration .soda --contract contracts/customers.yml
```

In this case the `--configuration .soda` is optional as it will be 1 of 4 default locations
In this case the `--configuration .soda` is optional as it will be 1 of 4 default locations
where the configuration is fetched from:

```
> soda --contract contracts/customers.yml
> soda --contract contracts/customers.yml
```
will by default take the first configuration directory that exists and is contains configuration files:
1. ${user.home}/.soda
2. ${user.home}/.atlan
3. ./.soda
4. ./.atlan
will by default take the first configuration directory that exists and contains configuration files:
1. ./.soda
2. ./.atlan
3. ${user.home}/.soda
4. ${user.home}/.atlan

*(Note: for the Atlan CLI the precedence of (1) and (2) may be inverted, and the precedence of (3) and (4) may be inverted.)*

### Working with environments

Typically engineers need to work on different environments like (local) development, cicd and production.
Typically engineers need to work on different environments like (local) development, CI/CD and production.

All configuration files for an environment must be located in a single configuration directory. That
All configuration files for an environment must be located in a single configuration directory. That
environment configuration directory is passed to the CLI or API command.

For example: with following file structure...
Expand All @@ -84,7 +86,7 @@ For example: with following file structure...
+- contracts
| +- customers.yml
```
Soda can be executing using the cicd environment configurations like this:
Soda can be executing using the cicd environment configurations like this:
```shell
soda -cfg ./.soda/cicd ./contracts/customers.yml
```
Expand All @@ -98,7 +100,7 @@ data_source: snowflake_landing_zone

### Data source file format

(1) Example
(1) Example
```yaml
- type: snowflake
connection:
Expand All @@ -118,11 +120,11 @@ data_source: snowflake_landing_zone

A data source contains a list of connections. Each connection has a type.

A connection also has a list of named `data_sources`. These data source keys must match the names referenced
in the contract files with `data_source`.
A connection also has a list of named `data_sources`. These data source keys must match the names referenced
in the contract files with `data_source`.

Each data source can specify a specific schema, database or any other structural element of the specific SQL engine
Data source properties like `database`, `schema` (and in the bigquery case `project` and `datasets`) are dependent
Each data source can specify a specific schema, database or any other structural element of the specific SQL engine.
Data source properties like `database`, `schema` (and in the bigquery case `project` and `datasets`) are dependent
on the connection type. For each of these properties we use connection-specific terminology.

And each data source can also overwrite certain connection properties.
Expand All @@ -134,12 +136,12 @@ Variables must be upper case underscore: regex `[A-Z0-9_]+`

(no jinja templating, only variable substitution without spaces)

In the future we may define a vault.yml configuration file that specifies how to load environment variables from
In the future we may define a `vault.yml` configuration file that specifies how to load environment variables from
a vault or secret store.

## Identifying datasets

TODO: describe how datasets are identified as a combination
TODO: describe how datasets are identified as a combination
* configuration folder name
* data source name
* dataset name
Expand Down
Loading