clarifies percentile in agg field for mesures (#5408)

Updates the description for the agg in the top table by adding percentile. Also adds add'l info on support sql engines for agg_params Resolves #5398 added table based on [pr here](dbt-labs/metricflow#395) provided by @tlento
dbt-labs · May 2, 2024 · b7b8772 · b7b8772
2 parents 4d96040 + 4c10c9a
commit b7b8772
Show file tree

Hide file tree

Showing 4 changed files with 71 additions and 36 deletions.
diff --git a/website/docs/docs/build/dimensions.md b/website/docs/docs/build/dimensions.md
@@ -10,7 +10,6 @@ Dimensions is a way to group or filter information based on categories or time.
 
 In a data platform, dimensions is part of a larger structure called a semantic model. It's created along with other elements like [entities](/docs/build/entities) and [measures](/docs/build/measures), and used to add more details to your data that can't be easily added up or combined.  In SQL, dimensions is typically included in the `group by` clause of your SQL query.
 
-
 <!--dimensions are non-aggregatable expressions that define the level of aggregation for a metric used to define how data is sliced or grouped in a metric. Since groups can't be aggregated, they're considered to be a property of the primary or unique entity of the table.
 
 Groups are defined within semantic models, alongside entities and measures, and correspond to non-aggregatable columns in your dbt model that provides categorical or time-based context. In SQL, dimensions  is typically included in the GROUP BY clause.-->
@@ -20,7 +19,7 @@ All dimensions require a `name`, `type` and in some cases, an `expr` parameter.
 | Parameter | Description | Type |
 | --------- | ----------- | ---- |
 | `name` |  Refers to the name of the group that will be visible to the user in downstream tools. It can also serve as an alias if the column name or SQL query reference is different and provided in the `expr` parameter. <br /><br /> Dimension names should be unique within a semantic model, but they can be non-unique across different models as MetricFlow uses [joins](/docs/build/join-logic) to identify the right dimension. | Required |
-| `type` | Specifies the type of group created in the semantic model. There are three types:<br /><br />- **Categorical**: Group rows in a table by categories like geography, color, and so on. <br />- **Time**: Point to a date field in the data platform. Must be of type TIMESTAMP or equivalent in the data platform engine. <br />- **Slowly-changing dimensions**: Analyze metrics over time and slice them by groups that change over time, like sales trends by a customer's country. | Required |
+| `type` | Specifies the type of group created in the semantic model. There are two types:<br /><br />- **Categorical**: Group rows in a table by categories like geography, color, and so on. <br />- **Time**: Point to a date field in the data platform. Must be of type TIMESTAMP or equivalent in the data platform engine. <br />      - You can also use time dimensions to specify time spans for [slowly changing dimensions](/docs/build/dimensions#scd-type-ii) tables. | Required |
 | `type_params` | Specific type params such as if the time is primary or used as a partition | Required |
 | `description` | A clear description of the dimension | Optional |
 | `expr` | Defines the underlying column or SQL query for a dimension. If no `expr` is specified, MetricFlow will use the column with the same name as the group. You can use column name itself to input a SQL expression. | Optional |
@@ -55,7 +54,7 @@ semantic_models:
       ... 
 # --- dimensions ---
   dimensions:
-    - name: metric_time
+    - name: order_date
       type: time
       label: "Date of transaction" # Recommend adding a label to define the value displayed in downstream tools
       expr: date_trunc('day', ts)
@@ -84,11 +83,11 @@ semantic_model:
 This section further explains the dimension definitions, along with examples. Dimensions have the following types:
 
 - [Dimensions types](#dimensions-types)
-  - [Categorical](#categorical)
-  - [Time](#time)
+- [Categorical](#categorical)
+- [Time](#time)
   - [SCD Type II](#scd-type-ii)
 
-### Categorical
+## Categorical
 
 Categorical is used to group metrics by different categories such as product type, color, or geographical area. They can refer to existing columns in your dbt model or be calculated using a SQL expression with the `expr` parameter. An example of a category dimension is `is_bulk_transaction`, which is a group created by applying a case statement to the underlying column `quantity`. This allows users to group or filter the data based on bulk transactions.
 
@@ -99,7 +98,7 @@ dimensions:
     expr: case when quantity > 10 then true else false end
 ```
 
-### Time
+## Time
 
 :::tip use datetime data type if using BigQuery
 To use BigQuery as your data platform, time dimensions columns need to be in the datetime data type. If they are stored in another type, you can cast them to datetime using the `expr` property. Time dimensions are used to group metrics by different levels of time, such as day, week, month, quarter, and year. MetricFlow supports these granularities, which can be specified using the `time_granularity` parameter.
@@ -144,14 +143,14 @@ dimensions:
   - name: created_at
     type: time
     label: "Date of creation"
-    expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table 
+    expr: date_trunc('day', ts_created) # ts_created is the underlying column name from the table 
     is_partition: True 
     type_params:
       time_granularity: day
   - name: deleted_at
     type: time
     label: "Date of deletion"
-    expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table 
+    expr: date_trunc('day', ts_deleted) # ts_deleted is the underlying column name from the table 
     is_partition: True 
     type_params:
       time_granularity: day
@@ -181,14 +180,14 @@ dimensions:
   - name: created_at
     type: time
     label: "Date of creation"
-    expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table 
+    expr: date_trunc('day', ts_created) # ts_created is the underlying column name from the table 
     is_partition: True 
     type_params:
       time_granularity: day
   - name: deleted_at
     type: time
     label: "Date of deletion"
-    expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table 
+    expr: date_trunc('day', ts_deleted) # ts_deleted is the underlying column name from the table 
     is_partition: True 
     type_params:
       time_granularity: day
@@ -207,30 +206,41 @@ measures:
 
 </Tabs>
 
-### SCD Type II 
+### SCD Type II
 
-:::caution 
-Currently, there are limitations in supporting SCD's. 
+:::caution
+Currently, there are limitations in supporting SCD's.
 :::
 
-MetricFlow supports joins against dimensions values in a semantic model built on top of an SCD Type II table (slowly changing dimension) Type II table. This is useful when you need a particular metric sliced by a group that changes over time, such as the historical trends of sales by a customer's country. 
+MetricFlow supports joins against dimensions values in a semantic model built on top of an SCD Type II table (slowly changing dimension) Type II table. This is useful when you need a particular metric organized by a group that changes over time, such as the historical trends of sales by a customer's country.
+
+
+**Basic structure**
 
-As their name suggests SCD Type II are groups that change values at a coarser time granularity. This results in a range of valid rows with different dimensions values for a given metric or measure. MetricFlow associates the metric with the first (minimum) available dimensions value within a coarser time window, such as month. By default, MetricFlow uses the group that is valid at the beginning of the time granularity.
+SCD Type II are groups that change values at a coarser time granularity. This results in a range of valid rows with different dimensions values for a given metric or measure. MetricFlow associates the metric with the first (minimum) available dimensions value within a coarser time window, such as month. By default, MetricFlow uses the group that is valid at the beginning of the time granularity.
 
 The following basic structure of an SCD Type II data platform table is supported:
 
 | entity_key | dimensions_1 | dimensions_2 | ... | dimensions_x | valid_from | valid_to |
 |------------|-------------|-------------|-----|-------------|------------|----------|  
 
-* `entity_key` (required): An entity_key (or some sort of identifier) must be present 
+* `entity_key` (required): An entity_key (or some sort of identifier) must be present.
 * `valid_from` (required): A timestamp indicating the start of a changing dimensions value must be present
 * `valid_to` (required): A timestamp indicating the end of a changing dimensions value must be present
 
-**Note**: The SCD dimensions table must have `valid_to` and `valid_from` columns.
+**Implementation**
 
-This is an example of SQL code that shows how a sample metric called `num_events` is joined with versioned dimensions data (stored in a table called `scd_dimensions`) using a primary key made up of the `entity_key` and `timestamp` columns. 
+Here are some guidelines to follow when implementing SCD Type II tables:
+
+- The SCD semantic model must have `valid_to` and `valid_from` time dimensions, which are logical constructs.
+- The `valid_from` and `valid_to` properties must be specified exactly once per SCD semantic model.
+- The `valid_from` and `valid_to` properties shouldn't be used or specified on the same time dimension.
+- The `valid_from` and 'valid_to` time dimensions must cover a non-overlapping period where one row matches each natural key value (meaning they must not overlap and should be distinct).
+- We recommend defining the underlying dbt model with [dbt snapshots](/docs/build/snapshots). This supports the SCD Type II table layout and ensures that the table is updated with the latest data.
 
 
+This is an example of SQL code that shows how a sample metric called `num_events` is joined with versioned dimensions data (stored in a table called `scd_dimensions`) using a primary key made up of the `entity_key` and `timestamp` columns. 
+
 ```sql
 select metric_time, dimensions_1, sum(1) as num_events
 from events a
@@ -242,6 +252,8 @@ on
 group by 1, 2
 ```
 
+**SCD example**
+
 <Tabs>
 
 <TabItem value="example" label="SCD table example 1">
@@ -256,7 +268,6 @@ This example shows how to create slowly changing dimensions (SCD) using a semant
 | 333             | 2    | 2020-08-19 | 2021-10-22| 
 | 333             | 3    | 2021-10-22 | 2048-01-01|  
 
-
 The `validity_params` include two important arguments &mdash; `is_start` and `is_end`. These specify the columns in the SCD table that mark the start and end dates (or timestamps) for each tier or dimension. Additionally, the entity is tagged as `natural` to differentiate it from a `primary` entity. In a `primary` entity, each entity value has one row. In contrast, a `natural` entity has one row for each combination of entity value and its validity period.
 
 ```yaml 

diff --git a/website/docs/docs/build/measures.md b/website/docs/docs/build/measures.md
@@ -39,7 +39,7 @@ Measure names must be unique across all semantic models in a project and can not
 
 The description describes the calculated measure. It's strongly recommended you create verbose and human-readable descriptions in this field.
 
-### Aggregation 
+### Aggregation
 
 The aggregation determines how the field will be aggregated. For example, a `sum` aggregation type over a granularity of `day` would sum the values across a given day.
 
@@ -54,8 +54,32 @@ Supported aggregations include:
 | sum_boolean       | A sum for a boolean type |
 | count_distinct    | Distinct count of values |
 | median           | Median (p50) calculation across the values |
-| percentile        | Percentile calculation across the values  |
+| percentile        | Percentile calculation across the values. |
 
+#### Percentile aggregation example
+If you're using the `percentile` aggregation, you must use the `agg_params` field to specify details for the percentile aggregation (such as what percentile to calculate and whether to use discrete or continuous calculations).
+
+```yaml
+name: p99_transaction_value
+description: The 99th percentile transaction value
+expr: transaction_amount_usd
+agg: percentile
+agg_params:
+  percentile: .99
+  use_discrete_percentile: False  # False calculates the continuous percentile, True calculates the discrete percentile.
+```
+
+#### Percentile across supported engine types
+The following table lists which SQL engine supports continuous, discrete, approximate, continuous, and approximate discrete percentiles.
+
+|  | Cont. | Disc. | Approx. cont | Approx. disc |
+| -- | -- | -- | -- | -- |
+|Snowflake | [Yes](https://docs.snowflake.com/en/sql-reference/functions/percentile_cont.html) | [Yes](https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html) | [Yes](https://docs.snowflake.com/en/sql-reference/functions/approx_percentile.html) (t-digest) | No |
+| Bigquery | No (window) | No (window) | [Yes](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_quantiles) | No |
+| Databricks | [Yes](https://docs.databricks.com/sql/language-manual/functions/percentile_cont.html) | [No](https://docs.databricks.com/sql/language-manual/functions/percentile_disc.html) | No | [Yes](https://docs.databricks.com/sql/language-manual/functions/approx_percentile.html) |
+| Redshift | [Yes](https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html) | No (window) | No | [Yes](https://docs.aws.amazon.com/redshift/latest/dg/r_APPROXIMATE_PERCENTILE_DISC.html) |
+| [Postgres](https://www.postgresql.org/docs/9.4/functions-aggregate.html) | Yes | Yes | No | No |
+| [DuckDB](https://duckdb.org/docs/sql/aggregates.html) | Yes | Yes | Yes (t-digest) | No |
 
 ### Expr
 
@@ -121,7 +145,7 @@ semantic_models:
         description: The average value of transactions 
         expr: transaction_amount_usd
         agg: average 
-      - name: transactions_amount_usd_valid #Notice here how we use expr to compute the aggregation based on a condition
+      - name: transactions_amount_usd_valid # Notice here how we use expr to compute the aggregation based on a condition
         description: The total USD value of valid transactions only
         expr: CASE WHEN is_valid = True then 1 else 0 end 
         agg: sum
@@ -135,7 +159,7 @@ semantic_models:
         agg: percentile
         agg_params:
           percentile: .99
-          use_discrete_percentile: False #False will calculate the discrete percentile and True will calculate the continuous percentile
+          use_discrete_percentile: False # False calculates the continuous percentile, True calculates the discrete percentile.
       - name: median_transaction_value
         description: The median transaction value
         expr: transaction_amount_usd
@@ -145,7 +169,7 @@ semantic_models:
     dimensions:
       - name: metric_time
         type: time
-        expr: date_trunc('day', ts) #expr refers to underlying column ts
+        expr: date_trunc('day', ts) # expr refers to underlying column ts
         type_params:
           time_granularity: day
       - name: is_bulk_transaction
@@ -161,7 +185,7 @@ Some measures cannot be aggregated over certain dimensions, like time, because i
 To demonstrate the configuration for non-additive measures, consider a subscription table that includes one row per date of the registered user, the user's active subscription plan(s), and the plan's subscription value (revenue) with the following columns:
 
 - `date_transaction`: The daily date-spine.
-- `user_id`: The ID pertaining to the registered user.
+- `user_id`: The ID of the registered user.
 - `subscription_plan`: A column to indicate the subscription plan ID.
 - `subscription_value`: A column to indicate the monthly subscription value (revenue) of a particular subscription plan ID.
 

diff --git a/website/docs/docs/build/metrics-overview.md b/website/docs/docs/build/metrics-overview.md
@@ -185,19 +185,19 @@ metrics:
     type_params:
       numerator: cancellations
       denominator: transaction_amount
-      filter: |   
-        {{ Dimension('customer__country') }} = 'MX'
+    filter: |   
+      {{ Dimension('customer__country') }} = 'MX'
   - name: enterprise_cancellation_rate
     owners:
       - [email protected]
     type: ratio
     type_params:
       numerator:
         name: cancellations
-        filter: {{ Dimension('company__tier' )}} = 'enterprise'  
+        filter: {{ Dimension('company__tier') }} = 'enterprise'  
       denominator: transaction_amount
-      filter: | 
-        {{ Dimension('customer__country') }} = 'MX'  
+    filter: | 
+      {{ Dimension('customer__country') }} = 'MX' 
 ```
 
 ### Simple metrics
@@ -218,9 +218,9 @@ metrics:
       measure:
         name: cancellations_usd  # Specify the measure you are creating a proxy for.
         fill_nulls_with: 0
-        filter: |
-        {{ Dimension('order__value')}} > 100 and {{Dimension('user__acquisition')}}
-        join_to_timespine: true
+    filter: |
+      {{ Dimension('order__value')}} > 100 and {{Dimension('user__acquisition')}} is not null
+    join_to_timespine: true
 ```
 
 ## Filters

diff --git a/website/snippets/_sl-measures-parameters.md b/website/snippets/_sl-measures-parameters.md
@@ -2,9 +2,9 @@
 | --- | --- | --- | 
 | [`name`](/docs/build/measures#name) | Provide a name for the measure, which must be unique and can't be repeated across all semantic models in your dbt project. | Required | 
 | [`description`](/docs/build/measures#description) | Describes the calculated measure. | Optional | 
-| [`agg`](/docs/build/measures#description) | dbt supports the following aggregations: `sum`, `max`, `min`, `avg`, `median`, `count_distinct`, and `sum_boolean`. | Required | 
+| [`agg`](/docs/build/measures#aggregation) | dbt supports the following aggregations: `sum`, `max`, `min`, `avg`, `median`, `count_distinct`, `percentile`, and `sum_boolean`. | Required |
 | [`expr`](/docs/build/measures#expr) | Either reference an existing column in the table or use a SQL expression to create or derive a new one. | Optional | 
 | [`non_additive_dimension`](/docs/build/measures#non-additive-dimensions) | Non-additive dimensions can be specified for measures that cannot be aggregated over certain dimensions, such as bank account balances, to avoid producing incorrect results. | Optional |
-| `agg_params` | Specific aggregation properties such as a percentile. | Optional |
+| `agg_params` | Specific aggregation properties, such as a percentile. | Optional | 
 | `agg_time_dimension` | The time field. Defaults to the default agg time dimension for the semantic model.  | Optional | 1.6 and higher |
 | `create_metric` | Create a `simple` metric from a measure by setting `create_metric: True`. Specify its display name with `create_metric_display_name`. Available in dbt version 1.7 or higher. | Optional |