Skip to content

Commit

Permalink
Contracts v3 (#2067)
Browse files Browse the repository at this point in the history
Contracts v3:
* Added spark session support
* Made all sql properties consistently end with ..._sql
* Introduced warehouse as terminology
* Made identity hash instead of long structured name
* Added support quoting
* Added check level filter_sql support
  • Loading branch information
tombaeyens authored Apr 30, 2024
1 parent 76159ca commit 062b1e2
Show file tree
Hide file tree
Showing 58 changed files with 3,093 additions and 3,764 deletions.
6 changes: 3 additions & 3 deletions soda/contracts/adr/02_contract_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,16 @@ contract verification parameter or some other way.

A wrapper around the DBAPI connection is needed to handle the SQL differences.
It's anticipated that initially the implementation will be based on the existing Soda Core
DataSource and Scan. But that later there will be direct connection implementations
Warehouse and Scan. But that later there will be direct connection implementations
for each database.

The returned connection is immediately open.

```python
import logging
from soda.contracts.connection import Connection, SodaException
from soda.contracts.impl.warehouse import Connection, SodaException
from soda.contracts.contract import Contract, ContractResult
from soda.contracts.soda_cloud import SodaCloud
from soda.contracts.impl.soda_cloud import SodaCloud

connection_file_path = 'postgres_localhost.scn.yml'
contract_file_path = 'customers.sdc.yml'
Expand Down
2 changes: 1 addition & 1 deletion soda/contracts/adr/03_exceptions_vs_error_logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ try:
soda_cloud: SodaCloud = SodaCloud.from_environment_variables()
with Connection.from_yaml_file(file_path=connection_file_path) as connection:
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path)
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud)
contract_result: ContractResult = contract.execute(connection=connection, soda_cloud=soda_cloud)
# contract verification passed
except SodaException as e:
# contract verification failed
Expand Down
7 changes: 4 additions & 3 deletions soda/contracts/adr/04_link_contract_schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,10 @@ columns:
```

b) In the API

```python
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path, schema="CSTMR_DATA_PROD")
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud)
contract_result: ContractResult = contract.execute(connection=connection, soda_cloud=soda_cloud)
```

c) (still in idea-stage) We can expand this basic API with a file naming convention that uses relative references to
Expand All @@ -58,10 +59,10 @@ then we can add a simpler API like
```python
import logging
from soda.contracts.contract import Contracts
from soda.contracts.connection import SodaException
from soda.contracts.impl.warehouse import SodaException

try:
Contracts.verify(["postgres_localhost_db/schemas/CSTMR_DATA_PROD/datasets/*.sdc.yml"])
Contracts.execute(["postgres_localhost_db/schemas/CSTMR_DATA_PROD/datasets/*.sdc.yml"])
except SodaException as e:
logging.exception("Problems verifying contract")
# TODO ensure you stop your ochestration job or pipeline & the right people are notified
Expand Down
4 changes: 2 additions & 2 deletions soda/contracts/adr/07_sql_yaml_keys.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
In order to make it easier for contract authors to know when they are putting in literal SQL vs Soda Contract interpreted values,
all the keys that are used literally in SQL queries should have `sql` in them.

For example `sql_expression`, `invalid_sql_regex`, `valid_sql_regex` etc
For example `sql_expression`, `invalid_regex_sql`, `valid_regex_sql` etc
```yaml
dataset: {table_name}
checks:
- type: metric_sql_expression
- type: metric_expression
metric: us_count
sql_expression: COUNT(CASE WHEN country = 'US' THEN 1 END)
must_be: 0
Expand Down
59 changes: 50 additions & 9 deletions soda/contracts/adr/09_contract_check_identities.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,56 @@
# Contract check identities

TODO explain a bit better:
### From the user perspective

Every check in a contract has an identity.
Check identity is used to correlate checks in files with Soda Cloud.

* why uniqueness required
* relation to schedule
* used in correlation for sodacl
* used in correlation on soda cloud
In contracts, we want to change the user interface regarding identities.

format:
`//{schedule}/{dataset}/{column}/{check_type}/{identity_suffix}`
The contracts parser ensures that all checks in a contract must have a unique identity.
An error will be created if there are multiple checks with the same identity. An identity
will be automatically generated based on a few check properties including the name. If two
checks are not unique, users must use the name property to ensure uniqueness.

One-way: A check identity is a composite key. But we don't expect we ever need to decompose a check identity into it's parts.
> IMPORTANT! All this means that users have to be aware of the Soda Cloud correlation impact when they
> change the name! Changing the name will also change the identity and hence will get a new check and
> check history on Soda Cloud.In the future we envision a mechanism for renaming a check without loosing
> the history by introducing a `name_was` property on a check. When users want to change the name, they
> will have to rename the existing `name` property to `name_was` and create a new `name` property with
> the new name.
Checks automatically generate a unique identity if you have max 1 check in each scope.
A scope is defined by
* warehouse
* schema
* dataset
* column
* check type

So as long as you have only one check type in the same list of checks in the YAML, you're good.

In case of dataset checks like `metric_query` or `metric_expression`, it might be likely that
there are multiple checks with the same check type. To keep those unique, a `name` is mandatory.

### Implementation docs

The contract check identity will be a consistent hash (soda/contracts/soda/contracts/impl/consistent_hash_builder.py) based on:

For schema checks:
* warehouse
* schema
* dataset
* check type (=schema)

For all other checks:
* warehouse
* schema
* dataset
* column
* check type
* check name

The check identity will be used as explicit `identity` in the generated SodaCL

Soda Core is updated so that it will pass the identity back as the property `source_identity` in the scan results.
The `source_identity` property in the scan results will also be used to correlate the Soda scan check results with
the contract checks for reporting and logging the results.
91 changes: 0 additions & 91 deletions soda/contracts/docs/01_contract_basics.md

This file was deleted.

Loading

0 comments on commit 062b1e2

Please sign in to comment.