Skip to content

Commit

Permalink
Merge pull request #10 from kgmcquate/main
Browse files Browse the repository at this point in the history
Added string length tests and sampling
  • Loading branch information
kgmcquate authored Jan 3, 2024
2 parents 04f33b9 + a0a5c9a commit 6079d4b
Show file tree
Hide file tree
Showing 21 changed files with 503 additions and 35 deletions.
14 changes: 14 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"sqltools.connections": [
{
"previewLimit": 50,
"server": "lake-freeze-db.cu0bcthnum69.us-east-1.rds.amazonaws.com",
"port": 5432,
"driver": "PostgreSQL",
"name": "weather",
"database": "postgres",
"username": "postgres",
"password": "8bF6G!Wy"
}
]
}
41 changes: 21 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,27 @@
# dbt-testgen

dbt-testgen autogenerates dbt test yaml based on real data.
## About
`dbt-testgen` is a [dbt](https://github.com/dbt-labs/dbt) package that autogenerates dbt test YAML based on real data.

Inspired by [dbt-codegen]() and [deequ Constraint Suggestion](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md)
Code documentation available at [here](https://kgmcquate.github.io/dbt-testgen/)

Inspired by [dbt-codegen](https://github.com/dbt-labs/dbt-codegen) and [deequ Constraint Suggestion](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md).

## Install
`dbt-testgen` currently supports `dbt 1.2.x` or higher.

Include in `packages.yml`:
```yaml
packages:
- git: https://github.com/kgmcquate/dbt-testgen
```
## Supported Databases
The following databases are supported:
- Snowflake
- RedShift
- Postgres
- DuckDB
## Usage
The DBT config YAML is generated by a Jinja macro, `get_test_suggestions`, which you can run like this:
Expand All @@ -23,12 +42,6 @@ models:
max_value: 30
```

## Supported Databases
The following databases are supported:
- Snowflake
- Postgres
- DuckDB
## Test types
dbt-testgen can generate these types of tests:
- [uniqueness](#uniqueness)
Expand All @@ -38,15 +51,3 @@ dbt-testgen can generate these types of tests:
- [accepted_values](#accepted-values)
- [freshness](#freshness)

### Uniqueness
### Not null
### String length
### Range
### Mean and stddev
###
4 changes: 1 addition & 3 deletions dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
name: 'testgen'
version: '0.0.1'

profile: postgres

require-dbt-version: [">=1.2.0", "<2.0.0"]
config-version: 2

target-path: "target"
Expand Down
1 change: 1 addition & 0 deletions docs/catalog.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"metadata": {"dbt_schema_version": "https://schemas.getdbt.com/dbt/catalog/v1.json", "dbt_version": "1.7.4", "generated_at": "2024-01-03T06:03:44.780433Z", "invocation_id": "40a950e0-7023-41e7-bb8d-8483b6e8e8c4", "env": {}}, "nodes": {"seed.testgen_integration_tests.colnames_with_spaces": {"metadata": {"type": "BASE TABLE", "schema": "main_integration_test_data", "name": "colnames_with_spaces", "database": "integration_test_data", "comment": null, "owner": null}, "columns": {"first name": {"type": "VARCHAR", "index": 1, "name": "first name", "comment": null}, "age (years)": {"type": "INTEGER", "index": 2, "name": "age (years)", "comment": null}, "current city": {"type": "VARCHAR", "index": 3, "name": "current city", "comment": null}}, "stats": {"has_stats": {"id": "has_stats", "label": "Has Stats?", "value": false, "include": false, "description": "Indicates whether there are statistics for this table"}}, "unique_id": "seed.testgen_integration_tests.colnames_with_spaces"}, "seed.testgen_integration_tests.users": {"metadata": {"type": "BASE TABLE", "schema": "main_integration_test_data", "name": "users", "database": "integration_test_data", "comment": null, "owner": null}, "columns": {"user_id": {"type": "INTEGER", "index": 1, "name": "user_id", "comment": null}, "username": {"type": "VARCHAR", "index": 2, "name": "username", "comment": null}, "email": {"type": "VARCHAR", "index": 3, "name": "email", "comment": null}, "age": {"type": "INTEGER", "index": 4, "name": "age", "comment": null}, "user_status": {"type": "VARCHAR", "index": 5, "name": "user_status", "comment": null}}, "stats": {"has_stats": {"id": "has_stats", "label": "Has Stats?", "value": false, "include": false, "description": "Indicates whether there are statistics for this table"}}, "unique_id": "seed.testgen_integration_tests.users"}}, "sources": {}, "errors": null}
102 changes: 102 additions & 0 deletions docs/index.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/manifest.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/run_results.json

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions integration_tests/generate_docs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
dbt docs generate

mkdir -p ../docs
cp target/catalog.json ../docs
cp target/index.html ../docs
cp target/manifest.json ../docs
cp target/run_results.json ../docs
4 changes: 3 additions & 1 deletion integration_tests/packages.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
packages:
- local: ../
- package: dbt-labs/dbt_utils
version: 1.1.1
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.10.1
11 changes: 9 additions & 2 deletions integration_tests/profiles.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,14 @@ integration_tests:
user: postgres
password: postgres
port: 5432
dbname: postgres # or database instead of dbname
dbname: postgres
schema: public
mysql:
type: mysql
host: mysql
username: mysql
password: mysql
port: 3306
schema: public
snowflake:
type: snowflake
Expand All @@ -43,7 +50,7 @@ integration_tests:
type: redshift
host: dbt-testgen.117819748843.us-east-1.redshift-serverless.amazonaws.com
user: dbt_testgen
password: mw*gXe9JMvp!0v%E #"{{ env_var('REDSHIFT_PASSWORD') }}"
password: "{{ env_var('REDSHIFT_PASSWORD') }}"
dbname: dbt_testgen
schema: dbt_testgen
port: 5439
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@


{% set actual_yaml = testgen.to_yaml(
testgen.get_string_length_test_suggestions(
ref('colnames_with_spaces'),
sample=true,
limit=100
)
)
%}

{% set expected_yaml %}
models:
- name: colnames_with_spaces
columns:
- name: first name
description: String length test generated by dbt-testgen
tests:
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 3
max_value: 5
row_condition: '"first name" is not null'
- name: current city
description: String length test generated by dbt-testgen
tests:
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 7
max_value: 13
row_condition: '"current city" is not null'
{% endset %}

{{ assert_equal (actual_yaml | trim, expected_yaml | trim) }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@


{% set actual_yaml = testgen.to_yaml(
testgen.get_string_length_test_suggestions(
ref('users'),
sample=true,
limit=100
)
)
%}

{% set expected_yaml %}
models:
- name: users
columns:
- name: username
description: String length test generated by dbt-testgen
tests:
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 8
max_value: 15
row_condition: '"username" is not null'
- name: email
description: String length test generated by dbt-testgen
tests:
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 18
max_value: 25
row_condition: '"email" is not null'
- name: user_status
description: String length test generated by dbt-testgen
tests:
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 6
max_value: 8
row_condition: '"user_status" is not null'
{% endset %}

{{ assert_equal (actual_yaml | trim, expected_yaml | trim) }}
Original file line number Diff line number Diff line change
Expand Up @@ -21,22 +21,34 @@ models:
min_value: 1
max_value: 30
- name: username
description: Uniqueness test generated by dbt-testgen
description: String length test generated by dbt-testgen
tests:
- unique
- not_null
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 8
max_value: 15
row_condition: '"username" is not null'
- name: email
description: Uniqueness test generated by dbt-testgen
description: String length test generated by dbt-testgen
tests:
- unique
- not_null
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 18
max_value: 25
row_condition: '"email" is not null'
- name: user_status
description: Accepted values test generated by dbt-testgen
description: String length test generated by dbt-testgen
tests:
- accepted_values:
values:
- active
- inactive
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 6
max_value: 8
row_condition: '"user_status" is not null'
- name: age
description: Numeric range test generated by dbt-testgen
tests:
Expand Down
File renamed without changes.
8 changes: 8 additions & 0 deletions macros/helpers/sql_functions.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

{% macro get_random_function() %}
{{ return(adapter.dispatch('get_random_function', 'testgen')()) }}
{% endmacro %}

{% macro default__get_random_function(colname) %}
{{ return("RANDOM") }}
{% endmacro %}
52 changes: 52 additions & 0 deletions macros/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
version: 2

macros:
- name: get_test_suggestions
description: Generates YAML schema file that includes tests for your data
arguments:
- name: column_name
type: string
description: The name of the column you want to convert
- name: precision
type: integer
description: Number of decimal places. Defaults to 2.

- name: table_relation
type: Relation
description: |
The [dbt Relation](https://docs.getdbt.com/reference/dbt-classes#relation)
you wish to generate tests for.
Example: ref("mymodel")
- name: sample
type: bool
description: Take a random sample when using the `limit` argument
- name: limit
type: integer
description: Use only this number of records to generate tests.
- name: resource_type
type: string
description: The type of resource that `table_relation` is - 'models', 'seeds', or 'sources'
- name: column_config
type: dict
description: "Configurations to set on columns. Example - {'quote': true}"
- name: exclude_types
type: list
description: Column types to exclude from tests.
- name: exclude_cols
type: list
description: Columns to exclude from tests.
- name: tags
type: list
description: Tags to put on the tests.
- name: tests
type: list
description: "Types of tests to generate. Example: ['uniqueness', 'accepted_values', 'range']"
- name: composite_key_length
type: integer
description: Max length of the composite key for uniqueness tests.
- name: dbt_config
type: dict
description: Existing parsed DBT Schema file to add tests onto.
- name: return_object
type: bool
description: Return the DBT Schema file as a dict object instead of printing YAML.
16 changes: 15 additions & 1 deletion macros/test_aggregation/get_test_suggestions.sql
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
exclude_types = [],
exclude_cols = [],
tags = [],
tests = ["uniqueness", "accepted_values", "range"],
tests = ["uniqueness", "accepted_values", "range", "string_length"],
composite_key_length = 1,
dbt_config = None,
return_object = false
Expand Down Expand Up @@ -58,6 +58,20 @@
) %}
{% endif %}

{% if "string_length" in tests %}
{% set dbt_config = testgen.get_string_length_test_suggestions(
table_relation=table_relation,
sample=sample,
limit=limit,
resource_type=resource_type,
column_config=column_config,
exclude_types=exclude_types,
exclude_cols=exclude_cols,
tags=tags,
dbt_config=dbt_config
) %}
{% endif %}

{% if return_object %}
{{ return(dbt_config) }}
{% else %}
Expand Down
16 changes: 15 additions & 1 deletion macros/test_generation/get_accepted_values_test_suggestions.sql
Original file line number Diff line number Diff line change
Expand Up @@ -79,14 +79,28 @@
testgen.array_agg(column.column) ~ " AS UNIQUE_VALUES
from (
select " ~ adapter.quote(column.column) ~ "
from " ~ table_relation ~ "
from base
group by " ~ adapter.quote(column.column) ~ "
) t1
"
) %}
{% endfor %}

{% if limit != None %}
{% if sample == true %}
{% set limit_stmt = "ORDER BY " ~ testgen.get_random_function() ~ "() LIMIT " ~ limit %}
{% else %}
{% set limit_stmt = "LIMIT " ~ limit %}
{% endif %}
{% else %}
{% set limit_stmt = "" %}
{% endif %}

{% set count_distinct_sql %}
WITH base AS (
SELECT * FROM {{ table_relation }}
{{ limit_stmt }}
)
SELECT * FROM (
{{ count_distinct_exprs | join("\nUNION ALL\n") }}
) t2
Expand Down
Loading

0 comments on commit 6079d4b

Please sign in to comment.