Skip to content

Commit

Permalink
clarifies percentile in agg field for mesures (#5408)
Browse files Browse the repository at this point in the history
Updates the description for the agg in the top table by adding
percentile. Also adds add'l info on support sql engines for agg_params

Resolves  #5398


added table based on [pr
here](dbt-labs/metricflow#395) provided by
@tlento
  • Loading branch information
mirnawong1 authored May 2, 2024
2 parents 4d96040 + 4c10c9a commit b7b8772
Show file tree
Hide file tree
Showing 4 changed files with 71 additions and 36 deletions.
51 changes: 31 additions & 20 deletions website/docs/docs/build/dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ Dimensions is a way to group or filter information based on categories or time.

In a data platform, dimensions is part of a larger structure called a semantic model. It's created along with other elements like [entities](/docs/build/entities) and [measures](/docs/build/measures), and used to add more details to your data that can't be easily added up or combined. In SQL, dimensions is typically included in the `group by` clause of your SQL query.


<!--dimensions are non-aggregatable expressions that define the level of aggregation for a metric used to define how data is sliced or grouped in a metric. Since groups can't be aggregated, they're considered to be a property of the primary or unique entity of the table.
Groups are defined within semantic models, alongside entities and measures, and correspond to non-aggregatable columns in your dbt model that provides categorical or time-based context. In SQL, dimensions is typically included in the GROUP BY clause.-->
Expand All @@ -20,7 +19,7 @@ All dimensions require a `name`, `type` and in some cases, an `expr` parameter.
| Parameter | Description | Type |
| --------- | ----------- | ---- |
| `name` | Refers to the name of the group that will be visible to the user in downstream tools. It can also serve as an alias if the column name or SQL query reference is different and provided in the `expr` parameter. <br /><br /> Dimension names should be unique within a semantic model, but they can be non-unique across different models as MetricFlow uses [joins](/docs/build/join-logic) to identify the right dimension. | Required |
| `type` | Specifies the type of group created in the semantic model. There are three types:<br /><br />- **Categorical**: Group rows in a table by categories like geography, color, and so on. <br />- **Time**: Point to a date field in the data platform. Must be of type TIMESTAMP or equivalent in the data platform engine. <br />- **Slowly-changing dimensions**: Analyze metrics over time and slice them by groups that change over time, like sales trends by a customer's country. | Required |
| `type` | Specifies the type of group created in the semantic model. There are two types:<br /><br />- **Categorical**: Group rows in a table by categories like geography, color, and so on. <br />- **Time**: Point to a date field in the data platform. Must be of type TIMESTAMP or equivalent in the data platform engine. <br /> - You can also use time dimensions to specify time spans for [slowly changing dimensions](/docs/build/dimensions#scd-type-ii) tables. | Required |
| `type_params` | Specific type params such as if the time is primary or used as a partition | Required |
| `description` | A clear description of the dimension | Optional |
| `expr` | Defines the underlying column or SQL query for a dimension. If no `expr` is specified, MetricFlow will use the column with the same name as the group. You can use column name itself to input a SQL expression. | Optional |
Expand Down Expand Up @@ -55,7 +54,7 @@ semantic_models:
...
# --- dimensions ---
dimensions:
- name: metric_time
- name: order_date
type: time
label: "Date of transaction" # Recommend adding a label to define the value displayed in downstream tools
expr: date_trunc('day', ts)
Expand Down Expand Up @@ -84,11 +83,11 @@ semantic_model:
This section further explains the dimension definitions, along with examples. Dimensions have the following types:

- [Dimensions types](#dimensions-types)
- [Categorical](#categorical)
- [Time](#time)
- [Categorical](#categorical)
- [Time](#time)
- [SCD Type II](#scd-type-ii)

### Categorical
## Categorical

Categorical is used to group metrics by different categories such as product type, color, or geographical area. They can refer to existing columns in your dbt model or be calculated using a SQL expression with the `expr` parameter. An example of a category dimension is `is_bulk_transaction`, which is a group created by applying a case statement to the underlying column `quantity`. This allows users to group or filter the data based on bulk transactions.

Expand All @@ -99,7 +98,7 @@ dimensions:
expr: case when quantity > 10 then true else false end
```

### Time
## Time

:::tip use datetime data type if using BigQuery
To use BigQuery as your data platform, time dimensions columns need to be in the datetime data type. If they are stored in another type, you can cast them to datetime using the `expr` property. Time dimensions are used to group metrics by different levels of time, such as day, week, month, quarter, and year. MetricFlow supports these granularities, which can be specified using the `time_granularity` parameter.
Expand Down Expand Up @@ -144,14 +143,14 @@ dimensions:
- name: created_at
type: time
label: "Date of creation"
expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table
expr: date_trunc('day', ts_created) # ts_created is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
- name: deleted_at
type: time
label: "Date of deletion"
expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table
expr: date_trunc('day', ts_deleted) # ts_deleted is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
Expand Down Expand Up @@ -181,14 +180,14 @@ dimensions:
- name: created_at
type: time
label: "Date of creation"
expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table
expr: date_trunc('day', ts_created) # ts_created is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
- name: deleted_at
type: time
label: "Date of deletion"
expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table
expr: date_trunc('day', ts_deleted) # ts_deleted is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
Expand All @@ -207,30 +206,41 @@ measures:

</Tabs>

### SCD Type II
### SCD Type II

:::caution
Currently, there are limitations in supporting SCD's.
:::caution
Currently, there are limitations in supporting SCD's.
:::

MetricFlow supports joins against dimensions values in a semantic model built on top of an SCD Type II table (slowly changing dimension) Type II table. This is useful when you need a particular metric sliced by a group that changes over time, such as the historical trends of sales by a customer's country.
MetricFlow supports joins against dimensions values in a semantic model built on top of an SCD Type II table (slowly changing dimension) Type II table. This is useful when you need a particular metric organized by a group that changes over time, such as the historical trends of sales by a customer's country.


**Basic structure**

As their name suggests SCD Type II are groups that change values at a coarser time granularity. This results in a range of valid rows with different dimensions values for a given metric or measure. MetricFlow associates the metric with the first (minimum) available dimensions value within a coarser time window, such as month. By default, MetricFlow uses the group that is valid at the beginning of the time granularity.
SCD Type II are groups that change values at a coarser time granularity. This results in a range of valid rows with different dimensions values for a given metric or measure. MetricFlow associates the metric with the first (minimum) available dimensions value within a coarser time window, such as month. By default, MetricFlow uses the group that is valid at the beginning of the time granularity.

The following basic structure of an SCD Type II data platform table is supported:

| entity_key | dimensions_1 | dimensions_2 | ... | dimensions_x | valid_from | valid_to |
|------------|-------------|-------------|-----|-------------|------------|----------|

* `entity_key` (required): An entity_key (or some sort of identifier) must be present
* `entity_key` (required): An entity_key (or some sort of identifier) must be present.
* `valid_from` (required): A timestamp indicating the start of a changing dimensions value must be present
* `valid_to` (required): A timestamp indicating the end of a changing dimensions value must be present

**Note**: The SCD dimensions table must have `valid_to` and `valid_from` columns.
**Implementation**

This is an example of SQL code that shows how a sample metric called `num_events` is joined with versioned dimensions data (stored in a table called `scd_dimensions`) using a primary key made up of the `entity_key` and `timestamp` columns.
Here are some guidelines to follow when implementing SCD Type II tables:

- The SCD semantic model must have `valid_to` and `valid_from` time dimensions, which are logical constructs.
- The `valid_from` and `valid_to` properties must be specified exactly once per SCD semantic model.
- The `valid_from` and `valid_to` properties shouldn't be used or specified on the same time dimension.
- The `valid_from` and 'valid_to` time dimensions must cover a non-overlapping period where one row matches each natural key value (meaning they must not overlap and should be distinct).
- We recommend defining the underlying dbt model with [dbt snapshots](/docs/build/snapshots). This supports the SCD Type II table layout and ensures that the table is updated with the latest data.


This is an example of SQL code that shows how a sample metric called `num_events` is joined with versioned dimensions data (stored in a table called `scd_dimensions`) using a primary key made up of the `entity_key` and `timestamp` columns.

```sql
select metric_time, dimensions_1, sum(1) as num_events
from events a
Expand All @@ -242,6 +252,8 @@ on
group by 1, 2
```

**SCD example**

<Tabs>

<TabItem value="example" label="SCD table example 1">
Expand All @@ -256,7 +268,6 @@ This example shows how to create slowly changing dimensions (SCD) using a semant
| 333 | 2 | 2020-08-19 | 2021-10-22|
| 333 | 3 | 2021-10-22 | 2048-01-01|


The `validity_params` include two important arguments &mdash; `is_start` and `is_end`. These specify the columns in the SCD table that mark the start and end dates (or timestamps) for each tier or dimension. Additionally, the entity is tagged as `natural` to differentiate it from a `primary` entity. In a `primary` entity, each entity value has one row. In contrast, a `natural` entity has one row for each combination of entity value and its validity period.

```yaml
Expand Down
36 changes: 30 additions & 6 deletions website/docs/docs/build/measures.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Measure names must be unique across all semantic models in a project and can not

The description describes the calculated measure. It's strongly recommended you create verbose and human-readable descriptions in this field.

### Aggregation
### Aggregation

The aggregation determines how the field will be aggregated. For example, a `sum` aggregation type over a granularity of `day` would sum the values across a given day.

Expand All @@ -54,8 +54,32 @@ Supported aggregations include:
| sum_boolean | A sum for a boolean type |
| count_distinct | Distinct count of values |
| median | Median (p50) calculation across the values |
| percentile | Percentile calculation across the values |
| percentile | Percentile calculation across the values. |

#### Percentile aggregation example
If you're using the `percentile` aggregation, you must use the `agg_params` field to specify details for the percentile aggregation (such as what percentile to calculate and whether to use discrete or continuous calculations).

```yaml
name: p99_transaction_value
description: The 99th percentile transaction value
expr: transaction_amount_usd
agg: percentile
agg_params:
percentile: .99
use_discrete_percentile: False # False calculates the continuous percentile, True calculates the discrete percentile.
```

#### Percentile across supported engine types
The following table lists which SQL engine supports continuous, discrete, approximate, continuous, and approximate discrete percentiles.

| | Cont. | Disc. | Approx. cont | Approx. disc |
| -- | -- | -- | -- | -- |
|Snowflake | [Yes](https://docs.snowflake.com/en/sql-reference/functions/percentile_cont.html) | [Yes](https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html) | [Yes](https://docs.snowflake.com/en/sql-reference/functions/approx_percentile.html) (t-digest) | No |
| Bigquery | No (window) | No (window) | [Yes](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_quantiles) | No |
| Databricks | [Yes](https://docs.databricks.com/sql/language-manual/functions/percentile_cont.html) | [No](https://docs.databricks.com/sql/language-manual/functions/percentile_disc.html) | No | [Yes](https://docs.databricks.com/sql/language-manual/functions/approx_percentile.html) |
| Redshift | [Yes](https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html) | No (window) | No | [Yes](https://docs.aws.amazon.com/redshift/latest/dg/r_APPROXIMATE_PERCENTILE_DISC.html) |
| [Postgres](https://www.postgresql.org/docs/9.4/functions-aggregate.html) | Yes | Yes | No | No |
| [DuckDB](https://duckdb.org/docs/sql/aggregates.html) | Yes | Yes | Yes (t-digest) | No |

### Expr

Expand Down Expand Up @@ -121,7 +145,7 @@ semantic_models:
description: The average value of transactions
expr: transaction_amount_usd
agg: average
- name: transactions_amount_usd_valid #Notice here how we use expr to compute the aggregation based on a condition
- name: transactions_amount_usd_valid # Notice here how we use expr to compute the aggregation based on a condition
description: The total USD value of valid transactions only
expr: CASE WHEN is_valid = True then 1 else 0 end
agg: sum
Expand All @@ -135,7 +159,7 @@ semantic_models:
agg: percentile
agg_params:
percentile: .99
use_discrete_percentile: False #False will calculate the discrete percentile and True will calculate the continuous percentile
use_discrete_percentile: False # False calculates the continuous percentile, True calculates the discrete percentile.
- name: median_transaction_value
description: The median transaction value
expr: transaction_amount_usd
Expand All @@ -145,7 +169,7 @@ semantic_models:
dimensions:
- name: metric_time
type: time
expr: date_trunc('day', ts) #expr refers to underlying column ts
expr: date_trunc('day', ts) # expr refers to underlying column ts
type_params:
time_granularity: day
- name: is_bulk_transaction
Expand All @@ -161,7 +185,7 @@ Some measures cannot be aggregated over certain dimensions, like time, because i
To demonstrate the configuration for non-additive measures, consider a subscription table that includes one row per date of the registered user, the user's active subscription plan(s), and the plan's subscription value (revenue) with the following columns:

- `date_transaction`: The daily date-spine.
- `user_id`: The ID pertaining to the registered user.
- `user_id`: The ID of the registered user.
- `subscription_plan`: A column to indicate the subscription plan ID.
- `subscription_value`: A column to indicate the monthly subscription value (revenue) of a particular subscription plan ID.

Expand Down
16 changes: 8 additions & 8 deletions website/docs/docs/build/metrics-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,19 +185,19 @@ metrics:
type_params:
numerator: cancellations
denominator: transaction_amount
filter: |
{{ Dimension('customer__country') }} = 'MX'
filter: |
{{ Dimension('customer__country') }} = 'MX'
- name: enterprise_cancellation_rate
owners:
- [email protected]
type: ratio
type_params:
numerator:
name: cancellations
filter: {{ Dimension('company__tier' )}} = 'enterprise'
filter: {{ Dimension('company__tier') }} = 'enterprise'
denominator: transaction_amount
filter: |
{{ Dimension('customer__country') }} = 'MX'
filter: |
{{ Dimension('customer__country') }} = 'MX'
```

### Simple metrics
Expand All @@ -218,9 +218,9 @@ metrics:
measure:
name: cancellations_usd # Specify the measure you are creating a proxy for.
fill_nulls_with: 0
filter: |
{{ Dimension('order__value')}} > 100 and {{Dimension('user__acquisition')}}
join_to_timespine: true
filter: |
{{ Dimension('order__value')}} > 100 and {{Dimension('user__acquisition')}} is not null
join_to_timespine: true
```

## Filters
Expand Down
4 changes: 2 additions & 2 deletions website/snippets/_sl-measures-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
| --- | --- | --- |
| [`name`](/docs/build/measures#name) | Provide a name for the measure, which must be unique and can't be repeated across all semantic models in your dbt project. | Required |
| [`description`](/docs/build/measures#description) | Describes the calculated measure. | Optional |
| [`agg`](/docs/build/measures#description) | dbt supports the following aggregations: `sum`, `max`, `min`, `avg`, `median`, `count_distinct`, and `sum_boolean`. | Required |
| [`agg`](/docs/build/measures#aggregation) | dbt supports the following aggregations: `sum`, `max`, `min`, `avg`, `median`, `count_distinct`, `percentile`, and `sum_boolean`. | Required |
| [`expr`](/docs/build/measures#expr) | Either reference an existing column in the table or use a SQL expression to create or derive a new one. | Optional |
| [`non_additive_dimension`](/docs/build/measures#non-additive-dimensions) | Non-additive dimensions can be specified for measures that cannot be aggregated over certain dimensions, such as bank account balances, to avoid producing incorrect results. | Optional |
| `agg_params` | Specific aggregation properties such as a percentile. | Optional |
| `agg_params` | Specific aggregation properties, such as a percentile. | Optional |
| `agg_time_dimension` | The time field. Defaults to the default agg time dimension for the semantic model. | Optional | 1.6 and higher |
| `create_metric` | Create a `simple` metric from a measure by setting `create_metric: True`. Specify its display name with `create_metric_display_name`. Available in dbt version 1.7 or higher. | Optional |

0 comments on commit b7b8772

Please sign in to comment.