diff --git a/docs/docs-beta/docs/integrations/libraries/deltalake.md b/docs/docs-beta/docs/integrations/libraries/deltalake/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/deltalake.md rename to docs/docs-beta/docs/integrations/libraries/deltalake/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/deltalake/reference.md b/docs/docs-beta/docs/integrations/libraries/deltalake/reference.md new file mode 100644 index 0000000000000..5bdd4b2af1b2b --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/deltalake/reference.md @@ -0,0 +1,225 @@ +--- +title: "dagster-deltalake integration reference" +description: Store your Dagster assets in Delta Lake +sidebar_position: 200 +--- + +This reference page provides information for working with [`dagster-deltalake`](/api/python-api/libraries/dagster-deltalake) features that are not covered as part of the [Using Delta Lake with Dagster tutorial](using-deltalake-with-dagster). + +- [Selecting specific columns in a downstream asset](#selecting-specific-columns-in-a-downstream-asset) +- [Storing partitioned assets](#storing-partitioned-assets) +- [Storing tables in multiple schemas](#storing-tables-in-multiple-schemas) +- [Using the Delta Lake I/O manager with other I/O managers](#using-the-delta-lake-io-manager-with-other-io-managers) +- [Storing and loading PyArrow Tables or Polars DataFrames in Delta Lake](#storing-and-loading-pyarrow-tables-or-polars-dataframes-in-delta-lake) +- [Configuring storage backends](#configuring-storage-backends) + +## Selecting specific columns in a downstream asset + +Sometimes you may not want to fetch an entire table as the input to a downstream asset. With the Delta Lake I/O manager, you can select specific columns to load by supplying metadata on the downstream asset. + + + +In this example, we only use the columns containing sepal data from the `iris_dataset` table created in [Step 2](using-deltalake-with-dagster#step-2-create-delta-lake-tables) of the [Using Dagster with Delta Lake tutorial](using-deltalake-with-dagster). To select specific columns, we can add metadata to the input asset. We do this in the `metadata` parameter of the `AssetIn` that loads the `iris_dataset` asset in the `ins` parameter. We supply the key `columns` with a list of names of the columns we want to fetch. + +When Dagster materializes `sepal_data` and loads the `iris_dataset` asset using the Delta Lake I/O manager, it will only fetch the `sepal_length_cm` and `sepal_width_cm` columns of the `iris/iris_dataset` table and pass them to `sepal_data` as a Pandas DataFrame. + +## Storing partitioned assets + +The Delta Lake I/O manager supports storing and loading partitioned data. To correctly store and load data from the Delta table, the Delta Lake I/O manager needs to know which column contains the data defining the partition bounds. The Delta Lake I/O manager uses this information to construct the correct queries to select or replace the data. + +In the following sections, we describe how the I/O manager constructs these queries for different types of partitions. + +::: + +For partitioning to work, the partition dimension needs to be one of the partition columns defined on the Delta table. Tables created via the I/O manager will be configured accordingly. + +::: + + + + +**Storing static partitioned assets** + +To store static partitioned assets in your Delta Lake, specify `partition_expr` metadata on the asset to tell the Delta Lake I/O manager which column contains the partition data: + + + +Dagster uses the `partition_expr` metadata to generate appropriate function parameters when loading the partition in the downstream asset. When loading a static partition this roughly corresponds to the following SQL statement: + +```sql +SELECT * + WHERE [partition_expr] in ([selected partitions]) +``` + +A partition must be selected when materializing the above assets, as described in the [Materializing partitioned assets](/guides/build/partitions-and-backfills/partitioning-assets) documentation. In this example, the query used when materializing the `Iris-setosa` partition of the above assets would be: + +```sql +SELECT * + WHERE species = 'Iris-setosa' +``` + + + + +**Storing time-partitioned assets** + +Like static partitioned assets, you can specify `partition_expr` metadata on the asset to tell the Delta Lake I/O manager which column contains the partition data: + + + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in the downstream asset. When loading a dynamic partition, the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] = [partition_start] +``` + +A partition must be selected when materializing the above assets, as described in the [Materializing partitioned assets](/guides/build/partitions-and-backfills/partitioning-assets) documentation. The `[partition_start]` and `[partition_end]` bounds are of the form `YYYY-MM-DD HH:MM:SS`. In this example, the query when materializing the `2023-01-02` partition of the above assets would be: + +```sql +SELECT * + WHERE time = '2023-01-02 00:00:00' +``` + + + + +**Storing multi-partitioned assets** + +The Delta Lake I/O manager can also store data partitioned on multiple dimensions. To do this, specify the column for each partition as a dictionary of `partition_expr` metadata: + + + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in a downstream asset. For multi-partitions, Dagster concatenates the `WHERE` statements described in the above sections to craft the correct `SELECT` statement. + +A partition must be selected when materializing the above assets, as described in the [Materializing partitioned assets](/todo) documentation. For example, when materializing the `2023-01-02|Iris-setosa` partition of the above assets, the following query will be used: + +```sql +SELECT * + WHERE species = 'Iris-setosa' + AND time = '2023-01-02 00:00:00' +``` + + + + + + +## Storing tables in multiple schemas + +You may want to have different assets stored in different Delta Lake schemas. The Delta Lake I/O manager allows you to specify the schema in several ways. + +If you want all of your assets to be stored in the same schema, you can specify the schema as configuration to the I/O manager, as we did in [Step 1](using-deltalake-with-dagster#step-1-configure-the-delta-lake-io-manager) of the [Using Dagster with Delta Lake tutorial](using-deltalake-with-dagster). + +If you want to store assets in different schemas, you can specify the schema as part of the asset's key: + + + +In this example, the `iris_dataset` asset will be stored in the `IRIS` schema, and the `daffodil_dataset` asset will be found in the `DAFFODIL` schema. + +::: + +The two options for specifying schema are mutually exclusive. If you provide `schema` configuration to the I/O manager, you cannot also provide it via the asset key and vice versa. If no `schema` is provided, either from configuration or asset keys, the default schema `public` will be used. + +::: + +## Using the Delta Lake I/O manager with other I/O managers + +You may have assets that you don't want to store in Delta Lake. You can provide an I/O manager to each asset using the `io_manager_key` parameter in the decorator: + + + +In this example: + +- The `iris_dataset` asset uses the I/O manager bound to the key `warehouse_io_manager` and `iris_plots` uses the I/O manager bound to the key `blob_io_manager` +- In the object, we supply the I/O managers for those keys +- When the assets are materialized, the `iris_dataset` will be stored in Delta Lake, and `iris_plots` will be saved in Amazon S3 + +## Storing and loading PyArrow tables or Polars DataFrames in Delta Lake + +The Delta Lake I/O manager also supports storing and loading PyArrow and Polars DataFrames. + + + + +**Storing and loading PyArrow Tables with Delta Lake** + +The `deltalake` package relies heavily on Apache Arrow for efficient data transfer, so PyArrow is natively supported. + +You can use the `DeltaLakePyArrowIOManager` in a object as in [Step 1](using-deltalake-with-dagster#step-1-configure-the-delta-lake-io-manager) of the [Using Dagster with Delta Lake tutorial](using-deltalake-with-dagster). + + + + + + +## Configuring storage backends + +The deltalake library comes with support for many storage backends out of the box. Which exact storage is to be used, is derived from the URL of a storage location. + +### S3 compatible storages + +The S3 APIs are implemented by a number of providers and it is possible to interact with many of them. However, most S3 implementations do not offer support for atomic operations, which is a requirement for multi writer support. As such some additional setup and configuration is required. + + + + +In case there will always be only a single writer to a table - this includes no concurrent dagster jobs writing to the same table - you can allow unsafe writes to the table. + +```py +from dagster_deltalake import S3Config + +config = S3Config(allow_unsafe_rename=True) +``` + + + + + +To use DynamoDB, set the `AWS_S3_LOCKING_PROVIDER` variable to `dynamodb` and create a table named delta_rs_lock_table in Dynamo. An example DynamoDB table creation snippet using the aws CLI follows, and should be customized for your environment’s needs (e.g. read/write capacity modes): + +```bash +aws dynamodb create-table --table-name delta_rs_lock_table \ + --attribute-definitions \ + AttributeName=key,AttributeType=S \ + --key-schema \ + AttributeName=key,KeyType=HASH \ + --provisioned-throughput \ + ReadCapacityUnits=10,WriteCapacityUnits=10 +``` + +::: + +The delta-rs community is actively working on extending the available options for locking backends. This includes locking backends compatible with Databricks to allow concurrent writes from Databricks and external environments. + +::: + + + + + +Cloudflare R2 storage has built-in support for atomic copy operations. This can be leveraged by sending additional headers with the copy requests. + +```py +from dagster_deltalake import S3Config + +config = S3Config(copy_if_not_exists="header: cf-copy-destination-if-none-match: *") +``` + + + + + +In cases where non-AWS S3 implementations are used, the endpoint URL or the S3 service needs to be provided. + +```py +config = S3Config(endpoint="https://") +``` + +### Working with locally running storage (emulators) + +A common pattern for e.g. integration tests is to run a storage emulator like Azurite, Localstack, o.a. If not configured to use TLS, we need to configure the http client, to allow for http traffic. + +```py +config = AzureConfig(use_emulator=True, client=ClientConfig(allow_http=True)) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/deltalake/using-deltalake-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/deltalake/using-deltalake-with-dagster.md new file mode 100644 index 0000000000000..3e83c08e46c2f --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/deltalake/using-deltalake-with-dagster.md @@ -0,0 +1,95 @@ +--- +title: "Using Delta Lake with Dagster" +description: Store your Dagster assets in a Delta Lake +sidebar_position: 100 +--- + +This tutorial focuses on how to store and load Dagster [asset definitions](/guides/build/assets/defining-assets) in a Delta Lake. + +By the end of the tutorial, you will: + +- Configure a Delta Lake I/O manager +- Create a table in Delta Lake using a Dagster asset +- Make a Delta Lake table available in Dagster +- Load Delta tables in downstream assets + +While this guide focuses on storing and loading Pandas DataFrames in Delta Lakes, Dagster also supports using PyArrow Tables and Polars DataFrames. Learn more about setting up and using the Delta Lake I/O manager with PyArrow Tables and Polars DataFrames in the [Delta Lake reference](reference). + +## Prerequisites + +To complete this tutorial, you'll need to install the `dagster-deltalake` and `dagster-deltalake-pandas` libraries: + +```shell +pip install dagster-deltalake dagster-deltalake-pandas +``` + +## Step 1: Configure the Delta Lake I/O manager + +The Delta Lake I/O manager requires some configuration to set up your Delta Lake. You must provide a root path where your Delta tables will be created. Additionally, you can specify a `schema` where the Delta Lake I/O manager will create tables. + + + +With this configuration, if you materialized an asset called `iris_dataset`, the Delta Lake I/O manager would store the data within a folder `iris/iris_dataset` under the provided root directory `path/to/deltalake`. + +Finally, in the object, we assign the to the `io_manager` key. `io_manager` is a reserved key to set the default I/O manager for your assets. + +## Step 2: Create Delta Lake tables + +The Delta Lake I/O manager can create and update tables for your Dagster-defined assets, but you can also make existing Delta Lake tables available to Dagster. + + + + + +**Store a Dagster asset as a table in Delta Lake** + +To store data in Delta Lake using the Delta Lake I/O manager, the definitions of your assets don't need to change. You can tell Dagster to use the Delta Lake I/O manager, like in [Step 1](#step-1-configure-the-delta-lake-io-manager), and Dagster will handle storing and loading your assets in Delta Lake. + + + +In this example, we first define an [asset](/guides/build/assets/defining-assets). Here, we fetch the Iris dataset as a Pandas DataFrame and rename the columns. The type signature of the function tells the I/O manager what data type it is working with, so it's important to include the return type `pd.DataFrame`. + +When Dagster materializes the `iris_dataset` asset using the configuration from [Step 1](#step-1-configure-the-delta-lake-io-manager), the Delta Lake I/O manager will create the table `iris/iris_dataset` if it doesn't exist and replace the contents of the table with the value returned from the `iris_dataset` asset. + + + + + +### Make an existing table available in Dagster + +If you already have tables in your Delta Lake, you may want to make them available to other Dagster assets. You can accomplish this by defining [external assets](/guides/build/assets/external-assets) for these tables. By creating an external asset for the existing table, you tell Dagster how to find the table so it can be fetched for downstream assets. + + + +In this example, we create a for an existing table containing iris harvest data. To make the data available to other Dagster assets, we need to tell the Delta Lake I/O manager how to find the data. + +Because we already supplied the database and schema in the I/O manager configuration in [Step 1](#step-1-configure-the-delta-lake-io-manager), we only need to provide the table name. We do this with the `key` parameter in `AssetSpec`. When the I/O manager needs to load the `iris_harvest_data` in a downstream asset, it will select the data in the `iris/iris_harvest_data` folder as a Pandas DataFrame and provide it to the downstream asset. + + + + +## Step 3: Load Delta Lake tables in downstream assets + +Once you've created an asset that represents a table in your Delta Lake, you will likely want to create additional assets that work with the data. Dagster and the Delta Lake I/O manager allow you to load the data stored in Delta tables into downstream assets. + + + +In this example, we want to provide the `iris_dataset` asset to the `iris_cleaned` asset. Refer to the Store a Dagster asset as a table in Delta Lake example in [step 2](#step-2-create-delta-lake-tables) for a look at the `iris_dataset` asset. + +In `iris_cleaned`, the `iris_dataset` parameter tells Dagster that the value for the `iris_dataset` asset should be provided as input to `iris_cleaned`. If this feels too magical for you, refer to the docs for explicitly specifying dependencies. + +When materializing these assets, Dagster will use the `DeltaLakePandasIOManager` to fetch the `iris/iris_dataset` as a Pandas DataFrame and pass the DataFrame as the `iris_dataset` parameter to `iris_cleaned`. When `iris_cleaned` returns a Pandas DataFrame, Dagster will use the `DeltaLakePandasIOManager` to store the DataFrame as the `iris/iris_cleaned` table in your Delta Lake. + +## Completed code example + +When finished, your code should look like the following: + + + +## Related + +For more Delta Lake features, refer to the [Delta Lake reference](reference). + +For more information on asset definitions, see the [Assets documentation](/guides/build/assets/defining-assets). + +For more information on I/O managers, refer to the [I/O manager documentation](/guides/build/io-managers/). diff --git a/docs/docs-beta/docs/integrations/libraries/dlt.md b/docs/docs-beta/docs/integrations/libraries/dlt/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/dlt.md rename to docs/docs-beta/docs/integrations/libraries/dlt/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/dlt/using-dlt-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/dlt/using-dlt-with-dagster.md new file mode 100644 index 0000000000000..e52e1aff43885 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/dlt/using-dlt-with-dagster.md @@ -0,0 +1,364 @@ +--- +title: "Using dlt with Dagster" +description: Ingest data with ease using Dagster and dlt +--- + +::: + +This feature is considered **experimental** + +::: + +The [data load tool (dlt)](https://dlthub.com/) open-source library defines a standardized approach for creating data pipelines that load often messy data sources into well-structured data sets. It offers many advanced features, such as: + +- Handling connection secrets +- Converting data into the structure required for a destination +- Incremental updates and merges + +dlt also provides a large collection of [pre-built, verified sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/), allowing you to write less code (if any!) by leveraging the work of the dlt community. + +In this guide, we'll explain how the dlt integration works, how to set up a Dagster project for dlt, and how to use a pre-defined dlt source. + +## How it works + +The Dagster dlt integration uses [multi-assets](/guides/build/assets/defining-assets#multi-asset), a single definition that results in multiple assets. These assets are derived from the `DltSource`. + +The following is an example of a dlt source definition where a source is made up of two resources: + +```python +@dlt.source +def example(api_key: str = dlt.secrets.value): + @dlt.resource(primary_key="id", write_disposition="merge") + def courses(): + response = requests.get(url=BASE_URL + "courses") + response.raise_for_status() + yield response.json().get("items") + + @dlt.resource(primary_key="id", write_disposition="merge") + def users(): + for page in _paginate(BASE_URL + "users"): + yield page + + return courses, users +``` + +Each resource queries an API endpoint and yields the data that we wish to load into our data warehouse. The two resources defined on the source will map to Dagster assets. + +Next, we defined a dlt pipeline that specifies how we want the data to be loaded: + +```python +pipeline = dlt.pipeline( + pipeline_name="example_pipeline", + destination="snowflake", + dataset_name="example_data", + progress="log", +) +``` + +A dlt source and pipeline are the two components required to load data using dlt. These will be the parameters of our multi-asset, which will integrate dlt and Dagster. + +## Prerequisites + +To follow the steps in this guide, you'll need: + +- **To read the [dlt introduction](https://dlthub.com/docs/intro)**, if you've never worked with dlt before. +- **[To install](/getting-started/installation) the following libraries**: + + ```bash + pip install dagster dagster-dlt + ``` + + Installing `dagster-dlt` will also install the `dlt` package. + +## Step 1: Configure your Dagster project to support dlt + +The first step is to define a location for the `dlt` code used for ingesting data. We recommend creating a `dlt_sources` directory at the root of your Dagster project, but this code can reside anywhere within your Python project. + +Run the following to create the `dlt_sources` directory: + +```bash +cd $DAGSTER_HOME && mkdir dlt_sources +``` + +## Step 2: Initialize dlt ingestion code + +In the `dlt_sources` directory, you can write ingestion code following the [dlt tutorial](https://dlthub.com/docs/tutorial/load-data-from-an-api) or you can use a verified source. + +In this example, we'll use the [GitHub source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/github) provided by dlt. + +1. Run the following to create a location for the dlt source code and initialize the GitHub source: + + ```bash + cd dlt_sources + + dlt init github snowflake + ``` + + At which point you'll see the following in the command line: + + ```bash + Looking up the init scripts in https://github.com/dlt-hub/verified-sources.git... + Cloning and configuring a verified source github (Source that load github issues, pull requests and reactions for a specific repository via customizable graphql query. Loads events incrementally.) + ``` + +2. When prompted to proceed, enter `y`. You should see the following confirming that the GitHub source was added to the project: + + ```bash + Verified source github was added to your project! + * See the usage examples and code snippets to copy from github_pipeline.py + * Add credentials for snowflake and other secrets in ./.dlt/secrets.toml + * requirements.txt was created. Install it with: + pip3 install -r requirements.txt + * Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information + ``` + +This downloaded the code required to collect data from the GitHub API. It also created a `requirements.txt` and a `.dlt/` configuration directory. These files can be removed, as we will configure our pipelines through Dagster, however, you may still find it informative to reference. + +```bash +$ tree -a +. +├── .dlt # can be removed +│   ├── .sources +│   ├── config.toml +│   └── secrets.toml +├── .gitignore +├── github +│   ├── README.md +│   ├── __init__.py +│   ├── helpers.py +│   ├── queries.py +│   └── settings.py +├── github_pipeline.py +└── requirements.txt # can be removed +``` + +## Step 3: Define dlt environment variables + +This integration manages connections and secrets using environment variables as `dlt`. The `dlt` library can infer required environment variables used by its sources and resources. Refer to [dlt's Secrets and Configs](https://dlthub.com/docs/general-usage/credentials/configuration) documentation for more information. + +In the example we've been using: + +- The `github_reactions` source requires a GitHub access token +- The Snowflake destination requires database connection details + +This results in the following required environment variables: + +```bash +SOURCES__GITHUB__ACCESS_TOKEN="" +DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE="" +DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD="" +DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME="" +DESTINATION__SNOWFLAKE__CREDENTIALS__HOST="" +DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE="" +DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE="" +``` + +Ensure that these variables are defined in your environment, either in your `.env` file when running locally or in the [Dagster deployment's environment variables](/guides/deploy/using-environment-variables-and-secrets). + +## Step 4: Define a DagsterDltResource + +Next, we'll define a , which provides a wrapper of a dlt pipeline runner. Use the following to define the resource, which can be shared across all dlt pipelines: + +```python +from dagster_dlt import DagsterDltResource + +dlt_resource = DagsterDltResource() +``` + +We'll add the resource to our in a later step. + +## Step 5: Create a dlt_assets definition for GitHub + +The decorator takes a `dlt_source` and `dlt_pipeline` parameter. In this example, we used the `github_reactions` source and created a `dlt_pipeline` to ingest data from Github to Snowflake. + +In the same file containing your Dagster assets, you can create an instance of your by doing something like the following: + +::: + +If you are using the [sql_database](https://dlthub.com/docs/api_reference/sources/sql_database/__init__#sql_database) source, consider setting `defer_table_reflect=True` to reduce database reads. By default, the Dagster daemon will refresh definitions roughly every minute, which will query the database for resource definitions. + +::: + +```python +from dagster import AssetExecutionContext, Definitions +from dagster_dlt import DagsterDltResource, dlt_assets +from dlt import pipeline +from dlt_sources.github import github_reactions + + +@dlt_assets( + dlt_source=github_reactions( + "dagster-io", "dagster", max_items=250 + ), + dlt_pipeline=pipeline( + pipeline_name="github_issues", + dataset_name="github", + destination="snowflake", + progress="log", + ), + name="github", + group_name="github", +) +def dagster_github_assets(context: AssetExecutionContext, dlt: DagsterDltResource): + yield from dlt.run(context=context) +``` + +## Step 6: Create the Definitions object + +The last step is to include the assets and resource in a object. This enables Dagster tools to load everything we've defined: + +```python +defs = Definitions( + assets=[ + dagster_github_assets, + ], + resources={ + "dlt": dlt_resource, + }, +) +``` + +And that's it! You should now have two assets that load data to corresponding Snowflake tables: one for issues and the other for pull requests. + +## Advanced usage + +### Overriding the translator to customize dlt assets + +The object can be used to customize how dlt properties map to Dagster concepts. + +For example, to change how the name of the asset is derived, or if you would like to change the key of the upstream source asset, you can override the method. + +{/* TODO convert to */} +```python file=/integrations/dlt/dlt_dagster_translator.py +import dlt +from dagster_dlt import DagsterDltResource, DagsterDltTranslator, dlt_assets +from dagster_dlt.translator import DltResourceTranslatorData + +from dagster import AssetExecutionContext, AssetKey, AssetSpec + + +@dlt.source +def example_dlt_source(): + def example_resource(): ... + + return example_resource + + +class CustomDagsterDltTranslator(DagsterDltTranslator): + def get_asset_spec(self, data: DltResourceTranslatorData) -> AssetSpec: + """Overrides asset spec to: + - Override asset key to be the dlt resource name, + - Override upstream asset key to be a single source asset. + """ + default_spec = super().get_asset_spec(data) + return default_spec.replace_attributes( + key=AssetKey(f"{data.resource.name}"), + deps=[AssetKey("common_upstream_dlt_dependency")], + ) + + +@dlt_assets( + name="example_dlt_assets", + dlt_source=example_dlt_source(), + dlt_pipeline=dlt.pipeline( + pipeline_name="example_pipeline_name", + dataset_name="example_dataset_name", + destination="snowflake", + progress="log", + ), + dagster_dlt_translator=CustomDagsterDltTranslator(), +) +def dlt_example_assets(context: AssetExecutionContext, dlt: DagsterDltResource): + yield from dlt.run(context=context) +``` + +In this example, we customized the translator to change how the dlt assets' names are defined. We also hard-coded the asset dependency upstream of our assets to provide a fan-out model from a single dependency to our dlt assets. + +### Assigning metadata to upstream external assets + +A common question is how to define metadata on the external assets upstream of the dlt assets. + +This can be accomplished by defining a with a key that matches the one defined in the method. + +For example, let's say we have defined a set of dlt assets named `thinkific_assets`, we can iterate over those assets and derive a with attributes like `group_name`. + +{/* TODO convert to */} +```python file=/integrations/dlt/dlt_source_assets.py +import dlt +from dagster_dlt import DagsterDltResource, dlt_assets + +from dagster import AssetExecutionContext, AssetSpec + + +@dlt.source +def example_dlt_source(): + def example_resource(): ... + + return example_resource + + +@dlt_assets( + dlt_source=example_dlt_source(), + dlt_pipeline=dlt.pipeline( + pipeline_name="example_pipeline_name", + dataset_name="example_dataset_name", + destination="snowflake", + progress="log", + ), +) +def example_dlt_assets(context: AssetExecutionContext, dlt: DagsterDltResource): + yield from dlt.run(context=context) + + +thinkific_source_assets = [ + AssetSpec(key, group_name="thinkific") for key in example_dlt_assets.dependency_keys +] +``` + +### Using partitions in your dlt assets + +While still an experimental feature, it is possible to use partitions within your dlt assets. However, it should be noted that this may result in concurrency related issues as state is managed by dlt. For this reason, it is recommended to set concurrency limits for your partitioned dlt assets. See the [Limiting concurrency in data pipelines](/guides/operate/managing-concurrency) guide for more details. + +That said, here is an example of using static named partitions from a dlt source. + +{/* TODO convert to */} +```python file=/integrations/dlt/dlt_partitions.py +from typing import Optional + +import dlt +from dagster_dlt import DagsterDltResource, dlt_assets + +from dagster import AssetExecutionContext, StaticPartitionsDefinition + +color_partitions = StaticPartitionsDefinition(["red", "green", "blue"]) + + +@dlt.source +def example_dlt_source(color: Optional[str] = None): + def load_colors(): + if color: + # partition-specific processing + ... + else: + # non-partitioned processing + ... + + +@dlt_assets( + dlt_source=example_dlt_source(), + name="example_dlt_assets", + dlt_pipeline=dlt.pipeline( + pipeline_name="example_pipeline_name", + dataset_name="example_dataset_name", + destination="snowflake", + ), + partitions_def=color_partitions, +) +def compute(context: AssetExecutionContext, dlt: DagsterDltResource): + color = context.partition_key + yield from dlt.run(context=context, dlt_source=example_dlt_source(color=color)) +``` + +## What's next? + +Want to see real-world examples of dlt in production? Check out how we use it internally at Dagster in the [Dagster Open Platform](https://github.com/dagster-io/dagster-open-platform) project. diff --git a/docs/docs-beta/docs/integrations/libraries/duckdb.md b/docs/docs-beta/docs/integrations/libraries/duckdb/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/duckdb.md rename to docs/docs-beta/docs/integrations/libraries/duckdb/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/duckdb/reference.md b/docs/docs-beta/docs/integrations/libraries/duckdb/reference.md new file mode 100644 index 0000000000000..127de67758b18 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/duckdb/reference.md @@ -0,0 +1,590 @@ +--- +title: "dagster-duckdb integration reference" +description: Store your Dagster assets in DuckDB +sidebar_position: 200 +--- + +This reference page provides information for working with [`dagster-duckdb`](/api/python-api/libraries/dagster-duckdb) features that are not covered as part of the [Using Dagster with DuckDB tutorial](using-duckdb-with-dagster). + +DuckDB resource: + +- [Executing custom SQL queries](#executing-custom-sql-queries) + +DuckDB I/O manager: + +- [Selecting specific columns in a downstream asset](#selecting-specific-columns-in-a-downstream-asset) +- [Storing partitioned assets](#storing-partitioned-assets) +- [Storing tables in multiple schemas](#storing-tables-in-multiple-schemas) +- [Using the DuckDB I/O manager with other I/O managers](#using-the-duckdb-io-manager-with-other-io-managers) +- [Storing and loading PySpark or Polars DataFrames in DuckDB](#storing-and-loading-pyspark-or-polars-dataframes-in-duckdb) +- [Storing multiple DataFrame types in DuckDB](#storing-multiple-dataframe-types-in-duckdb) + +## DuckDB resource + +The DuckDB resource provides access to a [`duckdb.DuckDBPyConnection`](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection) object. This allows you full control over how your data is stored and retrieved in your database. + +For further information on the DuckDB resource, see the [DuckDB resource API docs](/api/python-api/libraries/dagster-duckdb#dagster_duckdb.DuckDBResource). + +### Executing custom SQL queries + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/resource.py startafter=start endbefore=end +from dagster_duckdb import DuckDBResource + +from dagster import asset + +# this example executes a query against the iris_dataset table created in Step 2 of the +# Using Dagster with DuckDB tutorial + + +@asset(deps=[iris_dataset]) +def small_petals(duckdb: DuckDBResource) -> None: + with duckdb.get_connection() as conn: # conn is a DuckDBPyConnection + conn.execute( + "CREATE TABLE iris.small_petals AS SELECT * FROM iris.iris_dataset WHERE" + " 'petal_length_cm' < 1 AND 'petal_width_cm' < 1" + ) +``` + +In this example, we attach the DuckDB resource to the `small_petals` asset. In the body of the asset function, we use the `get_connection` context manager on the resource to get a [`duckdb.DuckDBPyConnection`](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection). We can use this connection to execute a custom SQL query against the `iris_dataset` table created in [Step 2: Create tables in DuckDB](using-duckdb-with-dagster#option-1-step-2) of the [Using Dagster with DuckDB tutorial](using-duckdb-with-dagster). When the `duckdb.get_connection` context is exited, the DuckDB connection will be closed. + +## DuckDB I/O manager + +The DuckDB I/O manager provides several ways to customize how your data is stored and loaded in DuckDB. However, if you find that these options do not provide enough customization for your use case, we recommend using the DuckDB resource to save and load your data. By using the resource, you will have more fine-grained control over how your data is handled, since you have full control over the SQL queries that are executed. + +### Selecting specific columns in a downstream asset + +Sometimes you may not want to fetch an entire table as the input to a downstream asset. With the DuckDB I/O manager, you can select specific columns to load by supplying metadata on the downstream asset. + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/downstream_columns.py +import pandas as pd + +from dagster import AssetIn, asset + +# this example uses the iris_dataset asset from Step 2 of the Using Dagster with DuckDB tutorial + + +@asset( + ins={ + "iris_sepal": AssetIn( + key="iris_dataset", + metadata={"columns": ["sepal_length_cm", "sepal_width_cm"]}, + ) + } +) +def sepal_data(iris_sepal: pd.DataFrame) -> pd.DataFrame: + iris_sepal["sepal_area_cm2"] = ( + iris_sepal["sepal_length_cm"] * iris_sepal["sepal_width_cm"] + ) + return iris_sepal +``` + +In this example, we only use the columns containing sepal data from the `IRIS_DATASET` table created in [Step 2: Create tables in DuckDB](using-duckdb-with-dagster#option-2-step-2) of the [Using Dagster with DuckDB tutorial](using-duckdb-with-dagster). To select specific columns, we can add metadata to the input asset. We do this in the `metadata` parameter of the `AssetIn` that loads the `iris_dataset` asset in the `ins` parameter. We supply the key `columns` with a list of names of the columns we want to fetch. + +When Dagster materializes `sepal_data` and loads the `iris_dataset` asset using the DuckDB I/O manager, it will only fetch the `sepal_length_cm` and `sepal_width_cm` columns of the `IRIS.IRIS_DATASET` table and pass them to `sepal_data` as a Pandas DataFrame. + +### Storing partitioned assets + +The DuckDB I/O manager supports storing and loading partitioned data. To correctly store and load data from the DuckDB table, the DuckDB I/O manager needs to know which column contains the data defining the partition bounds. The DuckDB I/O manager uses this information to construct the correct queries to select or replace the data. + +In the following sections, we describe how the I/O manager constructs these queries for different types of partitions. + + + + +To store static partitioned assets in DuckDB, specify `partition_expr` metadata on the asset to tell the DuckDB I/O manager which column contains the partition data: + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/static_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import AssetExecutionContext, StaticPartitionsDefinition, asset + + +@asset( + partitions_def=StaticPartitionsDefinition( + ["Iris-setosa", "Iris-virginica", "Iris-versicolor"] + ), + metadata={"partition_expr": "SPECIES"}, +) +def iris_dataset_partitioned(context: AssetExecutionContext) -> pd.DataFrame: + species = context.partition_key + + full_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + return full_df[full_df["Species"] == species] + + +@asset +def iris_cleaned(iris_dataset_partitioned: pd.DataFrame): + return iris_dataset_partitioned.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the partition in the downstream asset. When loading a static partition (or multiple static partitions), the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] in ([selected partitions]) +``` + +When the `partition_expr` value is injected into this statement, the resulting SQL query must follow DuckDB's SQL syntax. Refer to the [DuckDB documentation](https://duckdb.org/docs/sql/query_syntax/select) for more information. + +{/* TODO fix link A partition must be selected when materializing the above assets, as described in the [Materializing partitioned assets](/concepts/partitions-schedules-sensors/partitioning-assets#materializing-partitioned-assets) documentation. */} A partition must be selected when materializing the above assets. In this example, the query used when materializing the `Iris-setosa` partition of the above assets would be: + +```sql +SELECT * + WHERE SPECIES in ('Iris-setosa') +``` + + + + +Like static partitioned assets, you can specify `partition_expr` metadata on the asset to tell the DuckDB I/O manager which column contains the partition data: + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/time_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import AssetExecutionContext, DailyPartitionsDefinition, asset + + +@asset( + partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"), + metadata={"partition_expr": "TO_TIMESTAMP(TIME)"}, +) +def iris_data_per_day(context: AssetExecutionContext) -> pd.DataFrame: + partition = context.partition_key + + # get_iris_data_for_date fetches all of the iris data for a given date, + # the returned dataframe contains a column named 'time' with that stores + # the time of the row as an integer of seconds since epoch + return get_iris_data_for_date(partition) + + +@asset +def iris_cleaned(iris_data_per_day: pd.DataFrame): + return iris_data_per_day.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in the downstream asset. When loading a dynamic partition, the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] >= [partition_start] + AND [partition_expr] < [partition_end] +``` + +When the `partition_expr` value is injected into this statement, the resulting SQL query must follow DuckDB's SQL syntax. Refer to the [DuckDB documentation](https://duckdb.org/docs/sql/query_syntax/select) for more information. + +{/* TODO fix link: A partition must be selected when materializing the above assets, as described in the [Materializing partitioned assets](/concepts/partitions-schedules-sensors/partitioning-assets#materializing-partitioned-assets) documentation. */} A partition must be selected when materializing assets. The `[partition_start]` and `[partition_end]` bounds are of the form `YYYY-MM-DD HH:MM:SS`. In this example, the query when materializing the `2023-01-02` partition of the above assets would be: + +```sql +SELECT * + WHERE TO_TIMESTAMP(TIME) >= '2023-01-02 00:00:00' + AND TO_TIMESTAMP(TIME) < '2023-01-03 00:00:00' +``` + +In this example, the data in the `TIME` column are integers, so the `partition_expr` metadata includes a SQL statement to convert integers to timestamps. A full list of DuckDB functions can be found [here](https://duckdb.org/docs/sql/functions/overview). + + + + +The DuckDB I/O manager can also store data partitioned on multiple dimensions. To do this, specify the column for each partition as a dictionary of `partition_expr` metadata: + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/multi_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import ( + AssetExecutionContext, + DailyPartitionsDefinition, + MultiPartitionsDefinition, + StaticPartitionsDefinition, + asset, +) + + +@asset( + partitions_def=MultiPartitionsDefinition( + { + "date": DailyPartitionsDefinition(start_date="2023-01-01"), + "species": StaticPartitionsDefinition( + ["Iris-setosa", "Iris-virginica", "Iris-versicolor"] + ), + } + ), + metadata={"partition_expr": {"date": "TO_TIMESTAMP(TIME)", "species": "SPECIES"}}, +) +def iris_dataset_partitioned(context: AssetExecutionContext) -> pd.DataFrame: + partition = context.partition_key.keys_by_dimension + species = partition["species"] + date = partition["date"] + + # get_iris_data_for_date fetches all of the iris data for a given date, + # the returned dataframe contains a column named 'time' with that stores + # the time of the row as an integer of seconds since epoch + full_df = get_iris_data_for_date(date) + + return full_df[full_df["species"] == species] + + +@asset +def iris_cleaned(iris_dataset_partitioned: pd.DataFrame): + return iris_dataset_partitioned.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in a downstream asset. For multi-partitions, Dagster concatenates the `WHERE` statements described in the above sections to craft the correct `SELECT` statement. + +{/* TODO fix link: A partition must be selected when materializing the above assets, as described in the [Materializing partitioned assets](/concepts/partitions-schedules-sensors/partitioning-assets#materializing-partitioned-assets) documentation. */} A partition must be selected when materializing assets. For example, when materializing the `2023-01-02|Iris-setosa` partition of the above assets, the following query will be used: + +```sql +SELECT * + WHERE SPECIES in ('Iris-setosa') + AND TO_TIMESTAMP(TIME) >= '2023-01-02 00:00:00' + AND TO_TIMESTAMP(TIME) < '2023-01-03 00:00:00' +``` + +In this example, the data in the `TIME` column are integers, so the `partition_expr` metadata includes a SQL statement to convert integers to timestamps. A full list of DuckDB functions can be found [here](https://duckdb.org/docs/sql/functions/overview). + + + + +### Storing tables in multiple schemas + +You may want to have different assets stored in different DuckDB schemas. The DuckDB I/O manager allows you to specify the schema in several ways. + +You can specify the default schema where data will be stored as configuration to the I/O manager, as we did in [Step 1: Configure the DuckDB I/O manager](using-duckdb-with-dagster#step-1-configure-the-duckdb-io-manager) of the [Using Dagster with DuckDB tutorial](using-duckdb-with-dagster). + +If you want to store assets in different schemas, you can specify the schema as metadata: + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/schema.py startafter=start_metadata endbefore=end_metadata dedent=4 +daffodil_dataset = AssetSpec( + key=["daffodil_dataset"], metadata={"schema": "daffodil"} +) + +@asset(metadata={"schema": "iris"}) +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +You can also specify the schema as part of the asset's key: + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/schema.py startafter=start_asset_key endbefore=end_asset_key dedent=4 +daffodil_dataset = AssetSpec(key=["daffodil", "daffodil_dataset"]) + +@asset(key_prefix=["iris"]) +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +In this example, the `iris_dataset` asset will be stored in the `IRIS` schema, and the `daffodil_dataset` asset will be found in the `DAFFODIL` schema. + +::: + + The schema is determined in this order: +
    +
  1. If the schema is set via metadata, that schema will be used
  2. +
  3. + Otherwise, the schema set as configuration on the I/O manager will be used +
  4. +
  5. + Otherwise, if there is a key_prefix, that schema will be used +
  6. +
  7. + If none of the above are provided, the default schema will be PUBLIC +
  8. +
+ +::: + +### Using the DuckDB I/O manager with other I/O managers + +You may have assets that you don't want to store in DuckDB. You can provide an I/O manager to each asset using the `io_manager_key` parameter in the decorator: + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/multiple_io_managers.py startafter=start_example endbefore=end_example +import pandas as pd +from dagster_aws.s3.io_manager import s3_pickle_io_manager +from dagster_duckdb_pandas import DuckDBPandasIOManager + +from dagster import Definitions, asset + + +@asset(io_manager_key="warehouse_io_manager") +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + +@asset(io_manager_key="blob_io_manager") +def iris_plots(iris_dataset): + # plot_data is a function we've defined somewhere else + # that plots the data in a DataFrame + return plot_data(iris_dataset) + + +defs = Definitions( + assets=[iris_dataset, iris_plots], + resources={ + "warehouse_io_manager": DuckDBPandasIOManager( + database="path/to/my_duckdb_database.duckdb", + schema="IRIS", + ), + "blob_io_manager": s3_pickle_io_manager, + }, +) +``` + +In this example: + +- The `iris_dataset` asset uses the I/O manager bound to the key `warehouse_io_manager` and `iris_plots` uses the I/O manager bound to the key `blob_io_manager` +- In the object, we supply the I/O managers for those keys +- When the assets are materialized, the `iris_dataset` will be stored in DuckDB, and `iris_plots` will be saved in Amazon S3 + +### Storing and loading PySpark or Polars DataFrames in DuckDB + +The DuckDB I/O manager also supports storing and loading PySpark and Polars DataFrames. + + + + +To use the , first install the package: + +```shell +pip install dagster-duckdb-pyspark +``` + +Then you can use the `DuckDBPySparkIOManager` in your as in [Step 1: Configure the DuckDB I/O manager](using-duckdb-with-dagster#step-1-configure-the-duckdb-io-manager) of the [Using Dagster with DuckDB tutorial](using-duckdb-with-dagster). + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/pyspark_configuration.py startafter=start_configuration endbefore=end_configuration +from dagster_duckdb_pyspark import DuckDBPySparkIOManager + +from dagster import Definitions + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": DuckDBPySparkIOManager( + database="path/to/my_duckdb_database.duckdb", # required + schema="IRIS", # optional, defaults to PUBLIC + ) + }, +) +``` + +The `DuckDBPySparkIOManager` requires an active `SparkSession`. You can either create your own `SparkSession` or use the . + + + + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/pyspark_with_spark_resource.py +from dagster_duckdb_pyspark import DuckDBPySparkIOManager +from dagster_pyspark import pyspark_resource +from pyspark import SparkFiles +from pyspark.sql import DataFrame +from pyspark.sql.types import DoubleType, StringType, StructField, StructType + +from dagster import AssetExecutionContext, Definitions, asset + + +@asset(required_resource_keys={"pyspark"}) +def iris_dataset(context: AssetExecutionContext) -> DataFrame: + spark = context.resources.pyspark.spark_session + + schema = StructType( + [ + StructField("sepal_length_cm", DoubleType()), + StructField("sepal_width_cm", DoubleType()), + StructField("petal_length_cm", DoubleType()), + StructField("petal_width_cm", DoubleType()), + StructField("species", StringType()), + ] + ) + + url = "https://docs.dagster.io/assets/iris.csv" + spark.sparkContext.addFile(url) + + return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.csv")) + + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": DuckDBPySparkIOManager( + database="path/to/my_duckdb_database.duckdb", + schema="IRIS", + ), + "pyspark": pyspark_resource, + }, +) +``` + + + + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/pyspark_with_spark_session.py startafter=start endbefore=end +from dagster_duckdb_pyspark import DuckDBPySparkIOManager +from pyspark import SparkFiles +from pyspark.sql import DataFrame, SparkSession +from pyspark.sql.types import DoubleType, StringType, StructField, StructType + +from dagster import Definitions, asset + + +@asset +def iris_dataset() -> DataFrame: + spark = SparkSession.builder.getOrCreate() + + schema = StructType( + [ + StructField("sepal_length_cm", DoubleType()), + StructField("sepal_width_cm", DoubleType()), + StructField("petal_length_cm", DoubleType()), + StructField("petal_width_cm", DoubleType()), + StructField("species", StringType()), + ] + ) + + url = "https://docs.dagster.io/assets/iris.csv" + spark.sparkContext.addFile(url) + + return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.csv")) + + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": DuckDBPySparkIOManager( + database="path/to/my_duckdb_database.duckdb", + schema="IRIS", + ) + }, +) +``` + + + + + + + +To use the , first install the package: + +```shell +pip install dagster-duckdb-polars +``` + +Then you can use the `DuckDBPolarsIOManager` in your as in [Step 1: Configure the DuckDB I/O manager](using-duckdb-with-dagster#step-1-configure-the-duckdb-io-manager) of the [Using Dagster with DuckDB tutorial](using-duckdb-with-dagster). + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/polars_configuration.py startafter=start_configuration endbefore=end_configuration +from dagster_duckdb_polars import DuckDBPolarsIOManager + +from dagster import Definitions + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": DuckDBPolarsIOManager( + database="path/to/my_duckdb_database.duckdb", # required + schema="IRIS", # optional, defaults to PUBLIC + ) + }, +) +``` + + + + +### Storing multiple DataFrame types in DuckDB + +If you work with several DataFrame libraries and want a single I/O manager to handle storing and loading these DataFrames in DuckDB, you can write a new I/O manager that handles the DataFrame types. + +To do this, inherit from the base class and implement the `type_handlers` and `default_load_type` methods. The resulting I/O manager will inherit the configuration fields of the base `DuckDBIOManager`. + +{/* TODO convert to */} +```python file=/integrations/duckdb/reference/multiple_dataframe_types.py startafter=start_example endbefore=end_example +from typing import Optional, Type + +import pandas as pd +from dagster_duckdb import DuckDBIOManager +from dagster_duckdb_pandas import DuckDBPandasTypeHandler +from dagster_duckdb_polars import DuckDBPolarsTypeHandler +from dagster_duckdb_pyspark import DuckDBPySparkTypeHandler + +from dagster import Definitions + + +class DuckDBPandasPySparkPolarsIOManager(DuckDBIOManager): + @staticmethod + def type_handlers(): + """type_handlers should return a list of the TypeHandlers that the I/O manager can use. + Here we return the DuckDBPandasTypeHandler, DuckDBPySparkTypeHandler, and DuckDBPolarsTypeHandler so that the I/O + manager can store Pandas DataFrames, PySpark DataFrames, and Polars DataFrames. + """ + return [ + DuckDBPandasTypeHandler(), + DuckDBPySparkTypeHandler(), + DuckDBPolarsTypeHandler(), + ] + + @staticmethod + def default_load_type() -> Optional[type]: + """If an asset is not annotated with an return type, default_load_type will be used to + determine which TypeHandler to use to store and load the output. + In this case, unannotated assets will be stored and loaded as Pandas DataFrames. + """ + return pd.DataFrame + + +defs = Definitions( + assets=[iris_dataset, rose_dataset], + resources={ + "io_manager": DuckDBPandasPySparkPolarsIOManager( + database="path/to/my_duckdb_database.duckdb", + schema="IRIS", + ) + }, +) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/duckdb/using-duckdb-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/duckdb/using-duckdb-with-dagster.md new file mode 100644 index 0000000000000..2cace1a56e5a4 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/duckdb/using-duckdb-with-dagster.md @@ -0,0 +1,361 @@ +--- +title: "Using DucDB with Dagster" +description: Store your Dagster assets in DuckDB +sidebar_position: 100 +--- + +This tutorial focuses on creating and interacting with DuckDB tables using Dagster's [asset definitions](/guides/build/assets/defining-assets). + +The `dagster-duckdb` library provides two ways to interact with DuckDB tables: + +- [Resource](/guides/build/external-resources/): The resource allows you to directly run SQL queries against tables within an asset's compute function. Available resources: . +- [I/O manager](/guides/build/io-managers/): The I/O manager transfers the responsibility of storing and loading DataFrames as DuckdB tables to Dagster. Available I/O managers: , , . + +This tutorial is divided into two sections to demonstrate the differences between the DuckDB resource and the DuckDB I/O manager. Each section will create the same assets, but the first section will use the DuckDB resource to store data in DuckDB, whereas the second section will use the DuckDB I/O manager. When writing your own assets, you may choose one or the other (or both) approaches depending on your storage requirements. {/* TODO fix link See [When to use I/O managers](/guides/build/io-managers/#when-to-use-io-managers) to learn more about when to use I/O managers and when to use resources. */} + +In [Option 1](#option-1-using-the-duckdb-resource) you will: + +- Set up and configure the DuckDB resource. +- Use the DuckDB resource to execute a SQL query to create a table. +- Use the DuckDB resource to execute a SQL query to interact with the table. + +In [Option 2](#option-2-using-the-duckdb-io-manager) you will: + +- Set up and configure the DuckDB I/O manager. +- Use Pandas to create a DataFrame, then delegate responsibility creating a table to the DuckDB I/O manager. +- Use the DuckDB I/O manager to load the table into memory so that you can interact with it using the Pandas library. + +When writing your own assets, you may choose one or the other (or both) approaches depending on your storage requirements. {/* See [When to use I/O managers](/guides/build/io-managers/#when-to-use-io-managers) to learn more. */} + +By the end of the tutorial, you will: + +- Understand how to interact with a DuckDB database using the DuckDB resource. +- Understand how to use the DuckDB I/O manager to store and load DataFrames as DuckDB tables. +- Understand how to define dependencies between assets corresponding to tables in a DuckDB database. + +## Prerequisites + +To complete this tutorial, you'll need: + +- **To install the `dagster-duckdb` and `dagster-duckdb-pandas` libraries**: + + ```shell + pip install dagster-duckdb dagster-duckdb-pandas + ``` + +## Option 1: Using the DuckDB resource + +### Step 1: Configure the DuckDB resource + +To use the DuckDB resource, you'll need to add it to your `Definitions` object. The DuckDB resource requires some configuration. You must set a path to a DuckDB database as the `database` configuration value. If the database does not already exist, it will be created for you: + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/resource/configuration.py startafter=start_example endbefore=end_example +from dagster_duckdb import DuckDBResource + +from dagster import Definitions + +defs = Definitions( + assets=[iris_dataset], + resources={ + "duckdb": DuckDBResource( + database="path/to/my_duckdb_database.duckdb", # required + ) + }, +) +``` + +### Step 2: Create tables in DuckDB \{#option-1-step-2} + + + + + +**Create DuckDB tables in Dagster** + +Using the DuckDB resource, you can create DuckDB tables using the DuckDB Python API: + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/resource/create_table.py startafter=start_example endbefore=end_example +import pandas as pd +from dagster_duckdb import DuckDBResource + +from dagster import asset + + +@asset +def iris_dataset(duckdb: DuckDBResource) -> None: + iris_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + with duckdb.get_connection() as conn: + conn.execute("CREATE TABLE iris.iris_dataset AS SELECT * FROM iris_df") +``` + +In this example, you're defining an asset that fetches the Iris dataset as a Pandas DataFrame and renames the columns. Then, using the DuckDB resource, the DataFrame is stored in DuckDB as the `iris.iris_dataset` table. + + + + + +**Making Dagster aware of existing tables** + +If you already have existing tables in DuckDB and other assets defined in Dagster depend on those tables, you may want Dagster to be aware of those upstream dependencies. Making Dagster aware of these tables will allow you to track the full data lineage in Dagster. You can accomplish this by defining [external assets](/guides/build/assets/external-assets) for these tables. + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/io_manager/source_asset.py +from dagster import AssetSpec + +iris_harvest_data = AssetSpec(key="iris_harvest_data") +``` + +In this example, you're creating a for a pre-existing table called `iris_harvest_data`. + + + + + +Now you can run `dagster dev` and materialize the `iris_dataset` asset from the Dagster UI. + +### Step 3: Define downstream assets + +Once you have created an asset that represents a table in DuckDB, you will likely want to create additional assets that work with the data. + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/resource/downstream.py startafter=start_example endbefore=end_example +from dagster import asset + +# this example uses the iris_dataset asset from Step 1 + + +@asset(deps=[iris_dataset]) +def iris_setosa(duckdb: DuckDBResource) -> None: + with duckdb.get_connection() as conn: + conn.execute( + "CREATE TABLE iris.iris_setosa AS SELECT * FROM iris.iris_dataset WHERE" + " species = 'Iris-setosa'" + ) +``` + +In this asset, you're creating second table that only contains the data for the _Iris Setosa_ species. This asset has a dependency on the `iris_dataset` asset. To define this dependency, you provide the `iris_dataset` asset as the `deps` parameter to the `iris_setosa` asset. You can then run the SQL query to create the table of _Iris Setosa_ data. + +### Completed code example + +When finished, your code should look like the following: + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/resource/full_example.py +import pandas as pd +from dagster_duckdb import DuckDBResource + +from dagster import AssetSpec, Definitions, asset + +iris_harvest_data = AssetSpec(key="iris_harvest_data") + + +@asset +def iris_dataset(duckdb: DuckDBResource) -> None: + iris_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + with duckdb.get_connection() as conn: + conn.execute("CREATE TABLE iris.iris_dataset AS SELECT * FROM iris_df") + + +@asset(deps=[iris_dataset]) +def iris_setosa(duckdb: DuckDBResource) -> None: + with duckdb.get_connection() as conn: + conn.execute( + "CREATE TABLE iris.iris_setosa AS SELECT * FROM iris.iris_dataset WHERE" + " species = 'Iris-setosa'" + ) + + +defs = Definitions( + assets=[iris_dataset], + resources={ + "duckdb": DuckDBResource( + database="path/to/my_duckdb_database.duckdb", + ) + }, +) +``` + +## Option 2: Using the DuckDB I/O manager + +You may want to use an I/O manager to handle storing DataFrames as tables in DuckDB and loading DuckDB tables as DataFrames in downstream assets. You may want to use an I/O manager if: + +- You want your data to be loaded in memory so that you can interact with it using Python. +- You'd like to have Dagster manage how you store the data and load it as an input in downstream assets. + +{/* Using an I/O manager is not required, and you can reference [When to use I/O managers](/guides/build/io-managers/#when-to-use-io-managers) to learn more. */} + +This section of the guide focuses on storing and loading Pandas DataFrames in DuckDB, but Dagster also supports using PySpark and Polars DataFrames with DuckDB. The concepts from this guide apply to working with PySpark and Polars DataFrames, and you can learn more about setting up and using the DuckDB I/O manager with PySpark and Polars DataFrames in the [reference guide](reference). + +### Step 1: Configure the DuckDB I/O manager + +To use the DuckDB I/O, you'll need to add it to your `Definitions` object. The DuckDB I/O manager requires some configuration to connect to your database. You must provide a path where a DuckDB database will be created. Additionally, you can specify a `schema` where the DuckDB I/O manager will create tables. + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/io_manager/configuration.py startafter=start_example endbefore=end_example +from dagster_duckdb_pandas import DuckDBPandasIOManager + +from dagster import Definitions + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": DuckDBPandasIOManager( + database="path/to/my_duckdb_database.duckdb", # required + schema="IRIS", # optional, defaults to PUBLIC + ) + }, +) +``` + +### Step 2: Create tables in DuckDB \{#option-2-step-2} + +The DuckDB I/O manager can create and update tables for your Dagster-defined assets, but you can also make existing DuckDB tables available to Dagster. + + + + + +#### Store a Dagster asset as a table in DuckDB + +To store data in DuckDB using the DuckDB I/O manager, you can simply return a Pandas DataFrame from your asset. Dagster will handle storing and loading your assets in DuckDB. + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/io_manager/basic_example.py +import pandas as pd + +from dagster import asset + + +@asset +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +In this example, you're defining an asset that fetches the Iris dataset as a Pandas DataFrame, renames the columns, then returns the DataFrame. The type signature of the function tells the I/O manager what data type it is working with, so it is important to include the return type `pd.DataFrame`. + +When Dagster materializes the `iris_dataset` asset using the configuration from [Step 1: Configure the DuckDB I/O manager](#step-1-configure-the-duckdb-io-manager), the DuckDB I/O manager will create the table `IRIS.IRIS_DATASET` if it does not exist and replace the contents of the table with the value returned from the `iris_dataset` asset. + + + + + +**Make an existing table available in Dagster** + +If you already have existing tables in DuckDB and other assets defined in Dagster depend on those tables, you may want Dagster to be aware of those upstream dependencies. Making Dagster aware of these tables will allow you to track the full data lineage in Dagster. You can accomplish this by defining [external assets](/guides/build/assets/external-assets) for these tables. + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/io_manager/source_asset.py +from dagster import AssetSpec + +iris_harvest_data = AssetSpec(key="iris_harvest_data") +``` + +In this example, you're creating a for a pre-existing table containing iris harvests data. To make the data available to other Dagster assets, you need to tell the DuckDB I/O manager how to find the data. + +Because you already supplied the database and schema in the I/O manager configuration in [Step 1: Configure the DuckDB I/O manager](#step-1-configure-the-duckdb-io-manager), you only need to provide the table name. This is done with the `key` parameter in `AssetSpec`. When the I/O manager needs to load the `iris_harvest_data` in a downstream asset, it will select the data in the `IRIS.IRIS_HARVEST_DATA` table as a Pandas DataFrame and provide it to the downstream asset. + + + + +### Step 3: Load DuckDB tables in downstream assets + +Once you have created an asset that represents a table in DuckDB, you will likely want to create additional assets that work with the data. Dagster and the DuckDB I/O manager allow you to load the data stored in DuckDB tables into downstream assets. + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/io_manager/load_downstream.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import asset + +# this example uses the iris_dataset asset from Step 2 + + +@asset +def iris_setosa(iris_dataset: pd.DataFrame) -> pd.DataFrame: + return iris_dataset[iris_dataset["species"] == "Iris-setosa"] +``` + +In this asset, you're providing the `iris_dataset` asset as a dependency to `iris_setosa`. By supplying `iris_dataset` as a parameter to `iris_setosa`, Dagster knows to use the `DuckDBPandasIOManager` to load this asset into memory as a Pandas DataFrame and pass it as an argument to `iris_setosa`. Next, a DataFrame that only contains the data for the _Iris Setosa_ species is created and returned. Then the `DuckDBPandasIOManager` will store the DataFrame as the `IRIS.IRIS_SETOSA` table in DuckDB. + +### Completed code example + +When finished, your code should look like the following: + +{/* TODO convert to */} +```python file=/integrations/duckdb/tutorial/io_manager/full_example.py +import pandas as pd +from dagster_duckdb_pandas import DuckDBPandasIOManager + +from dagster import AssetSpec, Definitions, asset + +iris_harvest_data = AssetSpec(key="iris_harvest_data") + + +@asset +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + +@asset +def iris_setosa(iris_dataset: pd.DataFrame) -> pd.DataFrame: + return iris_dataset[iris_dataset["species"] == "Iris-setosa"] + + +defs = Definitions( + assets=[iris_dataset, iris_harvest_data, iris_setosa], + resources={ + "io_manager": DuckDBPandasIOManager( + database="path/to/my_duckdb_database.duckdb", + schema="IRIS", + ) + }, +) +``` + +## Related + +For more DuckDB features, refer to the [DuckDB reference](reference). + +For more information on asset definitions, see the [Assets documentation](/guides/build/assets/). + +For more information on I/O managers, see the [I/O manager documentation](/guides/build/io-managers/). diff --git a/docs/docs-beta/docs/integrations/libraries/gcp/bigquery.md b/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/gcp/bigquery.md rename to docs/docs-beta/docs/integrations/libraries/gcp/bigquery/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/reference.md b/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/reference.md new file mode 100644 index 0000000000000..821f289136a45 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/reference.md @@ -0,0 +1,603 @@ +--- +title: "BigQuery integration reference" +description: Store your Dagster assets in BigQuery +sidebar_position: 200 +--- + +This reference page provides information for working with features that are not covered as part of the [Using Dagster with BigQuery tutorial](using-bigquery-with-dagster). + +- [Providing credentials as configuration](#providing-credentials-as-configuration) +- [Selecting specific columns in a downstream asset](#selecting-specific-columns-in-a-downstream-asset) +- [Storing partitioned assets](#storing-partitioned-assets) +- [Storing tables in multiple datasets](#storing-tables-in-multiple-datasets) +- [Using the BigQuery I/O manager with other I/O managers](#using-the-bigquery-io-manager-with-other-io-managers) +- [Storing and loading PySpark DataFrames in BigQuery](#storing-and-loading-pyspark-dataframes-in-bigquery) +- [Using Pandas and PySpark DataFrames with BigQuery](#using-pandas-and-pyspark-dataframes-with-bigquery) +- [Executing custom SQL commands with the BigQuery resource](#executing-custom-sql-commands-with-the-bigquery-resource) + +## Providing credentials as configuration + +In most cases, you will authenticate with Google Cloud Project (GCP) using one of the methods outlined in the [GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc). However, in some cases you may find that you need to provide authentication credentials directly to the BigQuery I/O manager. For example, if you are using [Dagster+ Serverless](/dagster-plus/deployment/deployment-types/serverless) you cannot upload a credential file, so must provide your credentials as an environment variable. + +You can provide credentials directly to the BigQuery I/O manager by using the `gcp_credentials` configuration value. The BigQuery I/O manager will create a temporary file to store the credential and will set `GOOGLE_APPLICATION_CREDENTIALS` to point to this file. When the Dagster run is completed, the temporary file is deleted and `GOOGLE_APPLICATION_CREDENTIALS` is unset. + +To avoid issues with newline characters in the GCP credential key, you must base64 encode the key. For example, if your GCP key is stored at `~/.gcp/key.json` you can base64 encode the key by using the following shell command: + +```shell +cat ~/.gcp/key.json | base64 +``` + +Then you can [set an environment variable](/guides/deploy/using-environment-variables-and-secrets) in your Dagster deployment (for example `GCP_CREDS`) to the encoded key and provide it to the BigQuery I/O manager: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/config_auth.py startafter=start_example endbefore=end_example +from dagster_gcp_pandas import BigQueryPandasIOManager + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_data], + resources={ + "io_manager": BigQueryPandasIOManager( + project="my-gcp-project", + location="us-east5", + dataset="IRIS", + timeout=15.0, + gcp_credentials=EnvVar("GCP_CREDS"), + ) + }, +) +``` + +## Selecting specific columns in a downstream asset + +Sometimes you may not want to fetch an entire table as the input to a downstream asset. With the BigQuery I/O manager, you can select specific columns to load by supplying metadata on the downstream asset. + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/downstream_columns.py +import pandas as pd + +from dagster import AssetIn, asset + +# this example uses the iris_data asset from Step 2 of the Using Dagster with BigQuery tutorial + + +@asset( + ins={ + "iris_sepal": AssetIn( + key="iris_data", + metadata={"columns": ["sepal_length_cm", "sepal_width_cm"]}, + ) + } +) +def sepal_data(iris_sepal: pd.DataFrame) -> pd.DataFrame: + iris_sepal["sepal_area_cm2"] = ( + iris_sepal["sepal_length_cm"] * iris_sepal["sepal_width_cm"] + ) + return iris_sepal +``` + +In this example, we only use the columns containing sepal data from the `IRIS_DATA` table created in [Step 2: Create tables in BigQuery](using-bigquery-with-dagster#step-2-create-tables-in-bigquery) of the [Using Dagster with BigQuery tutorial](using-bigquery-with-dagster). Fetching the entire table would be unnecessarily costly, so to select specific columns, we can add metadata to the input asset. We do this in the `metadata` parameter of the `AssetIn` that loads the `iris_data` asset in the `ins` parameter. We supply the key `columns` with a list of names of the columns we want to fetch. + +When Dagster materializes `sepal_data` and loads the `iris_data` asset using the BigQuery I/O manager, it will only fetch the `sepal_length_cm` and `sepal_width_cm` columns of the `IRIS.IRIS_DATA` table and pass them to `sepal_data` as a Pandas DataFrame. + +## Storing partitioned assets + +The BigQuery I/O manager supports storing and loading partitioned data. In order to correctly store and load data from the BigQuery table, the BigQuery I/O manager needs to know which column contains the data defining the partition bounds. The BigQuery I/O manager uses this information to construct the correct queries to select or replace the data. In the following sections, we describe how the I/O manager constructs these queries for different types of partitions. + + + + +**Storing static partitioned assets** + +In order to store static partitioned assets in BigQuery, you must specify `partition_expr` metadata on the asset to tell the BigQuery I/O manager which column contains the partition data: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/static_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import AssetExecutionContext, StaticPartitionsDefinition, asset + + +@asset( + partitions_def=StaticPartitionsDefinition( + ["Iris-setosa", "Iris-virginica", "Iris-versicolor"] + ), + metadata={"partition_expr": "SPECIES"}, +) +def iris_data_partitioned(context: AssetExecutionContext) -> pd.DataFrame: + species = context.partition_key + + full_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + return full_df[full_df["species"] == species] + + +@asset +def iris_cleaned(iris_data_partitioned: pd.DataFrame): + return iris_data_partitioned.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the partition in the downstream asset. When loading a static partition, the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] = ([selected partitions]) +``` + +When the `partition_expr` value is injected into this statement, the resulting SQL query must follow BigQuery's SQL syntax. Refer to the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax) for more information. + +When materializing the above assets, a partition must be selected, as described in [Materializing partitioned assets](/todo). In this example, the query used when materializing the `Iris-setosa` partition of the above assets would be: + +```sql +SELECT * + WHERE SPECIES in ('Iris-setosa') +``` + + + + +**Storing time partitioned assets** + +Like static partitioned assets, you can specify `partition_expr` metadata on the asset to tell the BigQuery I/O manager which column contains the partition data: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/time_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import AssetExecutionContext, DailyPartitionsDefinition, asset + + +@asset( + partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"), + metadata={"partition_expr": "TIMESTAMP_SECONDS(TIME)"}, +) +def iris_data_per_day(context: AssetExecutionContext) -> pd.DataFrame: + partition = context.partition_key + + # get_iris_data_for_date fetches all of the iris data for a given date, + # the returned dataframe contains a column named 'TIME' with that stores + # the time of the row as an integer of seconds since epoch + return get_iris_data_for_date(partition) + + +@asset +def iris_cleaned(iris_data_per_day: pd.DataFrame): + return iris_data_per_day.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in the downstream asset. When loading a dynamic partition, the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] >= [partition_start] + AND [partition_expr] < [partition_end] +``` + +When the `partition_expr` value is injected into this statement, the resulting SQL query must follow BigQuery's SQL syntax. Refer to the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax) for more information. + +When materializing the above assets, a partition must be selected, as described in [Materializing partitioned assets](/guides/build/partitions-and-backfills/partitioning-assets). The `[partition_start]` and `[partition_end]` bounds are of the form `YYYY-MM-DD HH:MM:SS`. In this example, the query when materializing the `2023-01-02` partition of the above assets would be: + +```sql +SELECT * + WHERE TIMESTAMP_SECONDS(TIME) >= '2023-01-02 00:00:00' + AND TIMESTAMP_SECONDS(TIME) < '2023-01-03 00:00:00' +``` + +In this example, the data in the `TIME` column are integers, so the `partition_expr` metadata includes a SQL statement to convert integers to timestamps. A full list of BigQuery functions can be found [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators). + + + + +**Storing multi-partitioned assets** + +The BigQuery I/O manager can also store data partitioned on multiple dimensions. To do this, you must specify the column for each partition as a dictionary of `partition_expr` metadata: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/multi_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import ( + AssetExecutionContext, + DailyPartitionsDefinition, + MultiPartitionsDefinition, + StaticPartitionsDefinition, + asset, +) + + +@asset( + partitions_def=MultiPartitionsDefinition( + { + "date": DailyPartitionsDefinition(start_date="2023-01-01"), + "species": StaticPartitionsDefinition( + ["Iris-setosa", "Iris-virginica", "Iris-versicolor"] + ), + } + ), + metadata={ + "partition_expr": {"date": "TIMESTAMP_SECONDS(TIME)", "species": "SPECIES"} + }, +) +def iris_data_partitioned(context: AssetExecutionContext) -> pd.DataFrame: + partition = context.partition_key.keys_by_dimension + species = partition["species"] + date = partition["date"] + + # get_iris_data_for_date fetches all of the iris data for a given date, + # the returned dataframe contains a column named 'TIME' with that stores + # the time of the row as an integer of seconds since epoch + full_df = get_iris_data_for_date(date) + + return full_df[full_df["species"] == species] + + +@asset +def iris_cleaned(iris_data_partitioned: pd.DataFrame): + return iris_data_partitioned.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in a downstream asset. For multi-partitions, Dagster concatenates the `WHERE` statements described in the static partition and time-window partition sections to craft the correct `SELECT` statement. + +When materializing the above assets, a partition must be selected, as described in [Materializing partitioned assets](/guides/build/partitions-and-backfills/partitioning-assets). For example, when materializing the `2023-01-02|Iris-setosa` partition of the above assets, the following query will be used: + +```sql +SELECT * + WHERE SPECIES in ('Iris-setosa') + AND TIMESTAMP_SECONDS(TIME) >= '2023-01-02 00:00:00' + AND TIMESTAMP_SECONDS(TIME) < '2023-01-03 00:00:00'` +``` + + + + +## Storing tables in multiple datasets + +You may want to have different assets stored in different BigQuery datasets. The BigQuery I/O manager allows you to specify the dataset in several ways. + +You can specify the default dataset where data will be stored as configuration to the I/O manager, like we did in [Step 1: Configure the BigQuery I/O manager](using-bigquery-with-dagster#step-1-configure-the-bigquery-io-manager) of the [Using Dagster with BigQuery tutorial](using-bigquery-with-dagster). + +If you want to store assets in different datasets, you can specify the dataset as metadata: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/dataset.py startafter=start_metadata endbefore=end_metadata dedent=4 +daffodil_data = AssetSpec(key=["daffodil_data"], metadata={"schema": "daffodil"}) + +@asset(metadata={"schema": "iris"}) +def iris_data() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +You can also specify the dataset as part of the asset's asset key: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/dataset.py startafter=start_asset_key endbefore=end_asset_key dedent=4 +daffodil_data = AssetSpec(key=["gcp", "bigquery", "daffodil", "daffodil_data"]) + +@asset(key_prefix=["gcp", "bigquery", "iris"]) +def iris_data() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +The dataset will be the last prefix before the asset's name. In this example, the `iris_data` asset will be stored in the `IRIS` dataset, and the `daffodil_data` asset will be found in the `DAFFODIL` dataset. + +::: + + The dataset is determined in this order: +
    +
  1. If the dataset is set via metadata, that dataset will be used
  2. +
  3. + Otherwise, the dataset set as configuration on the I/O manager will be + used +
  4. +
  5. + Otherwise, if there is a key_prefix, that dataset will be + used +
  6. +
  7. + If none of the above are provided, the default dataset will be PUBLIC +
  8. +
+ +::: + +## Using the BigQuery I/O manager with other I/O managers + +You may have assets that you don't want to store in BigQuery. You can provide an I/O manager to each asset using the `io_manager_key` parameter in the `asset` decorator: + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/multiple_io_managers.py startafter=start_example endbefore=end_example +import pandas as pd +from dagster_aws.s3.io_manager import s3_pickle_io_manager +from dagster_gcp_pandas import BigQueryPandasIOManager + +from dagster import Definitions, asset + + +@asset(io_manager_key="warehouse_io_manager") +def iris_data() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + +@asset(io_manager_key="blob_io_manager") +def iris_plots(iris_data): + # plot_data is a function we've defined somewhere else + # that plots the data in a DataFrame + return plot_data(iris_data) + + +defs = Definitions( + assets=[iris_data, iris_plots], + resources={ + "warehouse_io_manager": BigQueryPandasIOManager( + project="my-gcp-project", + dataset="IRIS", + ), + "blob_io_manager": s3_pickle_io_manager, + }, +) +``` + +In this example, the `iris_data` asset uses the I/O manager bound to the key `warehouse_io_manager` and `iris_plots` will use the I/O manager bound to the key `blob_io_manager`. In the object, we supply the I/O managers for those keys. When the assets are materialized, the `iris_data` will be stored in BigQuery, and `iris_plots` will be saved in Amazon S3. + +## Storing and loading PySpark DataFrames in BigQuery + +The BigQuery I/O manager also supports storing and loading PySpark DataFrames. To use the , first install the package: + +```shell +pip install dagster-gcp-pyspark +``` + +Then you can use the `gcp_pyspark_io_manager` in your `Definitions` as in [Step 1: Configure the BigQuery I/O manager](using-bigquery-with-dagster#step-1-configure-the-bigquery-io-manager) of the [Using Dagster with BigQuery tutorial](using-bigquery-with-dagster). + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/pyspark_configuration.py startafter=start_configuration endbefore=end_configuration +from dagster_gcp_pyspark import BigQueryPySparkIOManager + +from dagster import Definitions + +defs = Definitions( + assets=[iris_data], + resources={ + "io_manager": BigQueryPySparkIOManager( + project="my-gcp-project", # required + location="us-east5", # optional, defaults to the default location for the project - see https://cloud.google.com/bigquery/docs/locations for a list of locations + dataset="IRIS", # optional, defaults to PUBLIC + temporary_gcs_bucket="my-gcs-bucket", # optional, defaults to None, which will result in a direct write to BigQuery + ) + }, +) +``` + +::: + +When using the `BigQueryPySparkIOManager` you may provide the `temporary_gcs_bucket` configuration. This will store the data is a temporary GCS bucket, then all of the data into BigQuery in one operation. If not provided, data will be directly written to BigQuery. If you choose to use a temporary GCS bucket, you must include the [GCS Hadoop connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/master/gcs) in your Spark Session, in addition to the BigQuery connector (described below). + +::: + +The `BigQueryPySparkIOManager` requires that a `SparkSession` be active and configured with the [BigQuery connector for Spark](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example). You can either create your own `SparkSession` or use the . + + + + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/pyspark_with_spark_resource.py +from dagster_gcp_pyspark import BigQueryPySparkIOManager +from dagster_pyspark import pyspark_resource +from pyspark import SparkFiles +from pyspark.sql import DataFrame, SparkSession +from pyspark.sql.types import DoubleType, StringType, StructField, StructType + +from dagster import AssetExecutionContext, Definitions, asset + +BIGQUERY_JARS = "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.28.0" + + +@asset(required_resource_keys={"pyspark"}) +def iris_data(context: AssetExecutionContext) -> DataFrame: + spark = context.resources.pyspark.spark_session + + schema = StructType( + [ + StructField("sepal_length_cm", DoubleType()), + StructField("sepal_width_cm", DoubleType()), + StructField("petal_length_cm", DoubleType()), + StructField("petal_width_cm", DoubleType()), + StructField("species", StringType()), + ] + ) + + url = "https://docs.dagster.io/assets/iris.csv" + spark.sparkContext.addFile(url) + + return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.csv")) + + +defs = Definitions( + assets=[iris_data], + resources={ + "io_manager": BigQueryPySparkIOManager( + project="my-gcp-project", + location="us-east5", + ), + "pyspark": pyspark_resource.configured( + {"spark_conf": {"spark.jars.packages": BIGQUERY_JARS}} + ), + }, +) +``` + + + + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/pyspark_with_spark_session.py +from dagster_gcp_pyspark import BigQueryPySparkIOManager +from pyspark import SparkFiles +from pyspark.sql import DataFrame, SparkSession +from pyspark.sql.types import DoubleType, StringType, StructField, StructType + +from dagster import Definitions, asset + +BIGQUERY_JARS = "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.28.0" + + +@asset +def iris_data() -> DataFrame: + spark = SparkSession.builder.config( + key="spark.jars.packages", + value=BIGQUERY_JARS, + ).getOrCreate() + + schema = StructType( + [ + StructField("sepal_length_cm", DoubleType()), + StructField("sepal_width_cm", DoubleType()), + StructField("petal_length_cm", DoubleType()), + StructField("petal_width_cm", DoubleType()), + StructField("species", StringType()), + ] + ) + + url = "https://docs.dagster.io/assets/iris.csv" + spark.sparkContext.addFile(url) + + return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.csv")) + + +defs = Definitions( + assets=[iris_data], + resources={ + "io_manager": BigQueryPySparkIOManager( + project="my-gcp-project", + location="us-east5", + ), + }, +) +``` + + + + +::: + +**Note:** In order to load data from BigQuery as a PySpark DataFrame, the BigQuery PySpark connector will create a view containing the data. This will result in the creation of a temporary table in your BigQuery dataset. For more details, see the [BigQuery PySpark connector documentation](https://github.com/GoogleCloudDataproc/spark-bigquery-connector#reading-data-from-a-bigquery-query). + +::: + +## Using Pandas and PySpark DataFrames with BigQuery + +If you work with both Pandas and PySpark DataFrames and want a single I/O manager to handle storing and loading these DataFrames in BigQuery, you can write a new I/O manager that handles both types. To do this, inherit from the base class and implement the `type_handlers` and `default_load_type` methods. The resulting I/O manager will inherit the configuration fields of the base `BigQueryIOManager`. + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/pandas_and_pyspark.py startafter=start_example endbefore=end_example +from collections.abc import Sequence +from typing import Optional, Type + +import pandas as pd +from dagster_gcp import BigQueryIOManager +from dagster_gcp_pandas import BigQueryPandasTypeHandler +from dagster_gcp_pyspark import BigQueryPySparkTypeHandler + +from dagster import DbTypeHandler, Definitions + + +class MyBigQueryIOManager(BigQueryIOManager): + @staticmethod + def type_handlers() -> Sequence[DbTypeHandler]: + """type_handlers should return a list of the TypeHandlers that the I/O manager can use. + + Here we return the BigQueryPandasTypeHandler and BigQueryPySparkTypeHandler so that the I/O + manager can store Pandas DataFrames and PySpark DataFrames. + """ + return [BigQueryPandasTypeHandler(), BigQueryPySparkTypeHandler()] + + @staticmethod + def default_load_type() -> Optional[type]: + """If an asset is not annotated with an return type, default_load_type will be used to + determine which TypeHandler to use to store and load the output. + + In this case, unannotated assets will be stored and loaded as Pandas DataFrames. + """ + return pd.DataFrame + + +defs = Definitions( + assets=[iris_data, rose_data], + resources={ + "io_manager": MyBigQueryIOManager(project="my-gcp-project", dataset="FLOWERS") + }, +) +``` + +## Executing custom SQL commands with the BigQuery resource + +In addition to the BigQuery I/O manager, Dagster also provides a BigQuery [resource](/guides/build/external-resources/) for executing custom SQL queries. + +{/* TODO convert to */} +```python file=/integrations/bigquery/reference/resource.py +from dagster_gcp import BigQueryResource + +from dagster import Definitions, asset + +# this example executes a query against the IRIS.IRIS_DATA table created in Step 2 of the +# Using Dagster with BigQuery tutorial + + +@asset +def small_petals(bigquery: BigQueryResource): + with bigquery.get_client() as client: + return client.query( + 'SELECT * FROM IRIS.IRIS_DATA WHERE "petal_length_cm" < 1 AND' + ' "petal_width_cm" < 1', + ).result() + + +defs = Definitions( + assets=[small_petals], + resources={ + "bigquery": BigQueryResource( + project="my-gcp-project", + location="us-east5", + ) + }, +) +``` + +In this example, we attach the BigQuery resource to the `small_petals` asset. In the body of the asset function, we use the `get_client` context manager method of the resource to get a [`bigquery.client.Client`](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client). We can use the client to execute a custom SQL query against the `IRIS_DATA` table created in [Step 2: Create tables in BigQuery](using-bigquery-with-dagster#step-2-create-tables-in-bigquery) of the [Using Dagster with BigQuery tutorial](using-bigquery-with-dagster). diff --git a/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/using-bigquery-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/using-bigquery-with-dagster.md new file mode 100644 index 0000000000000..fe1f03e348483 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/gcp/bigquery/using-bigquery-with-dagster.md @@ -0,0 +1,404 @@ +--- +title: "Using Google BigQuery with Dagster" +description: Store your Dagster assets in BigQuery +sidebar_position: 100 +--- + +This tutorial focuses on creating and interacting with BigQuery tables using Dagster's [asset definitions](/guides/build/assets/defining-assets). + +The `dagster-gcp` library provides two ways to interact with BigQuery tables: + +- [Resource](/guides/build/external-resources/): The resource allows you to directly run SQL queries against tables within an asset's compute function. Available resources: +- [I/O manager](/guides/build/io-managers/): The I/O manager transfers the responsibility of storing and loading DataFrames as BigQuery tables to Dagster. Available I/O managers: , + +This tutorial is divided into two sections to demonstrate the differences between the BigQuery resource and the BigQuery I/O manager. Each section will create the same assets, but the first section will use the BigQuery resource to store data in BigQuery, whereas the second section will use the BigQuery I/O manager. When writing your own assets, you may choose one or the other (or both) approaches depending on your storage requirements. {/* See [When to use I/O managers](/guides/build/io-managers/#when-to-use-io-managers) to learn more about when to use I/O managers and when to use resources. */} + +In [Option 1](#option-1-using-the-bigquery-resource) you will: + +- Set up and configure the BigQuery resource. +- Use the BigQuery resource to execute a SQL query to create a table. +- Use the BigQuery resource to execute a SQL query to interact with the table. + +In [Option 2](#option-2-using-the-bigquery-io-manager) you will: + +- Set up and configure the BigQuery I/O manager. +- Use Pandas to create a DataFrame, then delegate responsibility creating a table to the BigQuery I/O manager. +- Use the BigQuery I/O manager to load the table into memory so that you can interact with it using the Pandas library. + +By the end of the tutorial, you will: + +- Understand how to interact with a BigQuery database using the BigQuery resource. +- Understand how to use the BigQuery I/O manager to store and load DataFrames as BigQuery tables. +- Understand how to define dependencies between assets corresponding to tables in a BigQuery database. + +## Prerequisites + +To complete this tutorial, you'll need: + +- **To install the `dagster-gcp` and `dagster-gcp-pandas` libraries**: + + ```shell + pip install dagster-gcp dagster-gcp-pandas + ``` + +- **To gather the following information**: + + - **Google Cloud Project (GCP) project name**: You can find this by logging into GCP and choosing one of the project names listed in the dropdown in the top left corner. + + - **GCP credentials**: You can authenticate with GCP two ways: by following GCP authentication instructions [here](https://cloud.google.com/docs/authentication/provide-credentials-adc), or by providing credentials directly to the BigQuery I/O manager. + + In this guide, we assume that you have run one of the `gcloud auth` commands or have set `GOOGLE_APPLICATION_CREDENTIALS` as specified in the linked instructions. For more information on providing credentials directly to the BigQuery resource and I/O manager, see [Providing credentials as configuration](reference#providing-credentials-as-configuration) in the BigQuery reference guide. + +## Option 1: Using the BigQuery resource + +### Step 1: Configure the BigQuery resource + +To use the BigQuery resource, you'll need to add it to your `Definitions` object. The BigQuery resource requires some configuration: + +- A `project` +- One method of authentication. You can follow the GCP authentication instructions [here](https://cloud.google.com/docs/authentication/provide-credentials-adc), or see [Providing credentials as configuration](reference#providing-credentials-as-configuration) in the BigQuery reference guide. + +You can also specify a `location` where computation should take place. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/resource/configuration.py startafter=start_example endbefore=end_example +from dagster_gcp import BigQueryResource + +from dagster import Definitions + +defs = Definitions( + assets=[iris_data], + resources={ + "bigquery": BigQueryResource( + project="my-gcp-project", # required + location="us-east5", # optional, defaults to the default location for the project - see https://cloud.google.com/bigquery/docs/locations for a list of locations + ) + }, +) +``` + +### Step 2: Create tables in BigQuery + + + + + +**Create BigQuery tables in Dagster** + +Using the BigQuery resource, you can create BigQuery tables using the BigQuery Python API: + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/resource/create_table.py startafter=start_example endbefore=end_example +import pandas as pd +from dagster_gcp import BigQueryResource + +from dagster import asset + + +@asset +def iris_data(bigquery: BigQueryResource) -> None: + iris_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + with bigquery.get_client() as client: + job = client.load_table_from_dataframe( + dataframe=iris_df, + destination="iris.iris_data", + ) + job.result() +``` + +In this example, you're defining an asset that fetches the Iris dataset as a Pandas DataFrame and renames the columns. Then, using the BigQuery resource, the DataFrame is stored in BigQuery as the `iris.iris_data` table. + +Now you can run `dagster dev` and materialize the `iris_data` asset from the Dagster UI. + + + + + +**Making Dagster aware of existing tables** + +If you already have existing tables in BigQuery and other assets defined in Dagster depend on those tables, you may want Dagster to be aware of those upstream dependencies. Making Dagster aware of these tables will allow you to track the full data lineage in Dagster. You can accomplish this by defining [external assets](/guides/build/assets/external-assets) for these tables. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/resource/source_asset.py +from dagster import AssetSpec + +iris_harvest_data = AssetSpec(key="iris_harvest_data") +``` + +In this example, you're creating an for a pre-existing table called `iris_harvest_data`. + + + + + +### Step 3: Define downstream assets + +Once you have created an asset that represents a table in BigQuery, you will likely want to create additional assets that work with the data. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/resource/downstream.py startafter=start_example endbefore=end_example +from dagster import asset + +from .create_table import iris_data + +# this example uses the iris_dataset asset from Step 2 + + +@asset(deps=[iris_data]) +def iris_setosa(bigquery: BigQueryResource) -> None: + job_config = bq.QueryJobConfig(destination="iris.iris_setosa") + sql = "SELECT * FROM iris.iris_data WHERE species = 'Iris-setosa'" + + with bigquery.get_client() as client: + job = client.query(sql, job_config=job_config) + job.result() +``` + +In this asset, you're creating second table that only contains the data for the _Iris Setosa_ species. This asset has a dependency on the `iris_data` asset. To define this dependency, you provide the `iris_data` asset as the `deps` parameter to the `iris_setosa` asset. You can then run the SQL query to create the table of _Iris Setosa_ data. + +### Completed code example + +When finished, your code should look like the following: + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/resource/full_example.py +import pandas as pd +from dagster_gcp import BigQueryResource +from google.cloud import bigquery as bq + +from dagster import AssetSpec, Definitions, asset + +iris_harvest_data = AssetSpec(key="iris_harvest_data") + + +@asset +def iris_data(bigquery: BigQueryResource) -> None: + iris_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + with bigquery.get_client() as client: + job = client.load_table_from_dataframe( + dataframe=iris_df, + destination="iris.iris_data", + ) + job.result() + + +@asset(deps=[iris_data]) +def iris_setosa(bigquery: BigQueryResource) -> None: + job_config = bq.QueryJobConfig(destination="iris.iris_setosa") + sql = "SELECT * FROM iris.iris_data WHERE species = 'Iris-setosa'" + + with bigquery.get_client() as client: + job = client.query(sql, job_config=job_config) + job.result() + + +defs = Definitions( + assets=[iris_data, iris_setosa, iris_harvest_data], + resources={ + "bigquery": BigQueryResource( + project="my-gcp-project", + location="us-east5", + ) + }, +) +``` + +## Option 2: Using the BigQuery I/O manager + +While using an I/O manager is not required, you may want to use an I/O manager to handle storing DataFrames as tables in BigQuery and loading BigQuery tables as DataFrames in downstream assets. You may want to use an I/O manager if: + +- You want your data to be loaded in memory so that you can interact with it using Python. +- You'd like to have Dagster manage how you store the data and load it as an input in downstream assets. + +{/* TODO fix link: Using an I/O manager is not required, and you can reference [When to use I/O managers](/guides/build/io-managers/#when-to-use-io-managers) to learn more. */} + +This section of the guide focuses on storing and loading Pandas DataFrames in BigQuery, but Dagster also supports using PySpark DataFrames with BigQuery. The concepts from this guide apply to working with PySpark DataFrames, and you can learn more about setting up and using the BigQuery I/O manager with PySpark DataFrames in the [reference guide](reference). + +### Step 1: Configure the BigQuery I/O manager + +To use the BigQuery I/O manager, you'll need to add it to your `Definitions` object. The BigQuery I/O manager requires some configuration to connect to your Bigquery instance: + +- A `project` +- One method of authentication. You can follow the GCP authentication instructions [here](https://cloud.google.com/docs/authentication/provide-credentials-adc), or see [Providing credentials as configuration](reference#providing-credentials-as-configuration) in the BigQuery reference guide. + +You can also specify a `location` where data should be stored and processed and `dataset` that should hold the created tables. You can also set a `timeout` when working with Pandas DataFrames. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/io_manager/configuration.py startafter=start_example endbefore=end_example +from dagster_gcp_pandas import BigQueryPandasIOManager + +from dagster import Definitions + +defs = Definitions( + assets=[iris_data], + resources={ + "io_manager": BigQueryPandasIOManager( + project="my-gcp-project", # required + location="us-east5", # optional, defaults to the default location for the project - see https://cloud.google.com/bigquery/docs/locations for a list of locations + dataset="IRIS", # optional, defaults to PUBLIC + timeout=15.0, # optional, defaults to None + ) + }, +) +``` + +With this configuration, if you materialized an asset called `iris_data`, the BigQuery I/O manager would store the data in the `IRIS.IRIS_DATA` table in the `my-gcp-project` project. The BigQuery instance would be located in `us-east5`. + +Finally, in the object, we assign the to the `io_manager` key. `io_manager` is a reserved key to set the default I/O manager for your assets. + +For more info about each of the configuration values, refer to the API documentation. + +### Step 2: Create tables in BigQuery \{#option-2-step-2} + +The BigQuery I/O manager can create and update tables for your Dagster defined assets, but you can also make existing BigQuery tables available to Dagster. + + + + + +**Store a Dagster asset as a table in BigQuery** + +To store data in BigQuery using the BigQuery I/O manager, you can simply return a Pandas DataFrame from your asset. Dagster will handle storing and loading your assets in BigQuery. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/io_manager/basic_example.py +import pandas as pd + +from dagster import asset + + +@asset +def iris_data() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +In this example, you're defining an [asset](/guides/build/assets/defining-assets) that fetches the Iris dataset as a Pandas DataFrame, renames the columns, then returns the DataFrame. The type signature of the function tells the I/O manager what data type it is working with, so it is important to include the return type `pd.DataFrame`. + +When Dagster materializes the `iris_data` asset using the configuration from [Step 1: Configure the BigQuery I/O manager](#step-1-configure-the-bigquery-io-manager), the BigQuery I/O manager will create the table `IRIS.IRIS_DATA` if it does not exist and replace the contents of the table with the value returned from the `iris_data` asset. + + + + + +**Making Dagster aware of existing tables** + +If you already have existing tables in BigQuery and other assets defined in Dagster depend on those tables, you may want Dagster to be aware of those upstream dependencies. Making Dagster aware of these tables will allow you to track the full data lineage in Dagster. You can define [external assets](/guides/build/assets/external-assets) for these tables. When using an I/O manager, defining an external asset for an existing table also allows you to tell Dagster how to find the table so it can be fetched for downstream assets. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/io_manager/source_asset.py +from dagster import AssetSpec + +iris_harvest_data = AssetSpec(key="iris_harvest_data") +``` + +In this example, you're creating a for a pre-existing table - perhaps created by an external data ingestion tool - that contains data about iris harvests. To make the data available to other Dagster assets, you need to tell the BigQuery I/O manager how to find the data, so that the I/O manager can load the data into memory. + +Because you already supplied the project and dataset in the I/O manager configuration in [Step 1: Configure the BigQuery I/O manager](#step-1-configure-the-bigquery-io-manager), you only need to provide the table name. This is done with the `key` parameter in `AssetSpec`. When the I/O manager needs to load the `iris_harvest_data` in a downstream asset, it will select the data in the `IRIS.IRIS_HARVEST_DATA` table as a Pandas DataFrame and provide it to the downstream asset. + + + + +### Step 3: Load BigQuery tables in downstream assets + +Once you have created an asset that represents a table in BigQuery, you will likely want to create additional assets that work with the data. Dagster and the BigQuery I/O manager allow you to load the data stored in BigQuery tables into downstream assets. + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/io_manager/load_downstream.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import asset + +# this example uses the iris_data asset from Step 2 + + +@asset +def iris_setosa(iris_data: pd.DataFrame) -> pd.DataFrame: + return iris_data[iris_data["species"] == "Iris-setosa"] +``` + +In this asset, you're providing the `iris_data` asset from the [Store a Dagster asset as a table in BigQuery](#option-2-step-2) example to the `iris_setosa` asset. + +In this asset, you're providing the `iris_data` asset as a dependency to `iris_setosa`. By supplying `iris_data` as a parameter to `iris_setosa`, Dagster knows to use the `BigQueryPandasIOManager` to load this asset into memory as a Pandas DataFrame and pass it as an argument to `iris_setosa`. Next, a DataFrame that only contains the data for the _Iris Setosa_ species is created and returned. Then the `BigQueryPandasIOManager` will store the DataFrame as the `IRIS.IRIS_SETOSA` table in BigQuery. + +### Completed code example + +When finished, your code should look like the following: + +{/* TODO convert to */} +```python file=/integrations/bigquery/tutorial/io_manager/full_example.py +import pandas as pd +from dagster_gcp_pandas import BigQueryPandasIOManager + +from dagster import AssetSpec, Definitions, asset + +iris_harvest_data = AssetSpec(key="iris_harvest_data") + + +@asset +def iris_data() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + +@asset +def iris_setosa(iris_data: pd.DataFrame) -> pd.DataFrame: + return iris_data[iris_data["species"] == "Iris-setosa"] + + +defs = Definitions( + assets=[iris_data, iris_harvest_data, iris_setosa], + resources={ + "io_manager": BigQueryPandasIOManager( + project="my-gcp-project", + location="us-east5", + dataset="IRIS", + timeout=15.0, + ) + }, +) +``` + +## Related + +For more BigQuery features, refer to the [BigQuery reference](reference). + +For more information on asset definitions, see the [Assets documentation](/guides/build/assets/). + +For more information on I/O managers, see the [I/O manager documentation](/guides/build/io-managers/). diff --git a/docs/docs-beta/docs/integrations/libraries/jupyter.md b/docs/docs-beta/docs/integrations/libraries/jupyter/index.md similarity index 61% rename from docs/docs-beta/docs/integrations/libraries/jupyter.md rename to docs/docs-beta/docs/integrations/libraries/jupyter/index.md index 74dd9c7c9080c..3747445f5131b 100644 --- a/docs/docs-beta/docs/integrations/libraries/jupyter.md +++ b/docs/docs-beta/docs/integrations/libraries/jupyter/index.md @@ -19,6 +19,16 @@ sidebar_custom_props: logo: images/integrations/jupyter.svg --- +Dagstermill eliminates the tedious "productionization" of Jupyter notebooks. + +Using the Dagstermill library enables you to: + +- View notebooks directly in the Dagster UI without needing to set up a Jupyter kernel +- Define data dependencies to flow inputs and outputs from assets/ops to notebooks, between notebooks, and from notebooks to other assets/ops +- Use Dagster resources and the Dagster config system inside notebooks +- Aggregate notebook logs with logs from other Dagster assets and ops +- Yield custom materializations and other Dagster events from your notebook code + ### About Jupyter Fast iteration, the literate combination of arbitrary code with markdown blocks, and inline plotting make notebooks an indispensable tool for data science. The **Dagstermill** package makes it easy to run notebooks using the Dagster tools and to integrate them into data jobs with heterogeneous ops: for instance, Spark jobs, SQL statements run against a data warehouse, or arbitrary Python code. diff --git a/docs/docs-beta/docs/integrations/libraries/jupyter/reference.md b/docs/docs-beta/docs/integrations/libraries/jupyter/reference.md new file mode 100644 index 0000000000000..c4ceccd3259ef --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/jupyter/reference.md @@ -0,0 +1,163 @@ +--- +title: "dagstermill integration reference" +description: The Dagstermill package lets you run notebooks using the Dagster tools and integrate them into your data pipelines. +sidebar_position: 200 +--- + +This reference provides a high-level look at working with Jupyter notebooks using the [`dagstermill` integration library](/api/python-api/libraries/dagstermill). + +For a step-by-step implementation walkthrough, refer to the [Using notebooks with Dagster tutorial](using-notebooks-with-dagster). + +## Notebooks as assets + +To load a Jupyter notebook as a Dagster [asset](/guides/build/assets/defining-assets), use : + +{/* TODO convert to */} +```python file=/integrations/dagstermill/iris_notebook_asset.py +from dagstermill import define_dagstermill_asset + +from dagster import file_relative_path + +iris_kmeans_notebook = define_dagstermill_asset( + name="iris_kmeans", + notebook_path=file_relative_path(__file__, "../notebooks/iris-kmeans.ipynb"), +) +``` + +In this code block, we use `define_dagstermill_asset` to create a Dagster asset. We provide the name for the asset with the `name` parameter and the path to our `.ipynb` file with the `notebook_path` parameter. The resulting asset will execute our notebook and store the resulting `.ipynb` file in a persistent location. + +## Notebooks as ops + +Dagstermill also supports running Jupyter notebooks as [ops](/guides/build/ops). We can use to turn a notebook into an op: + +{/* TODO convert to */} +```python file=/integrations/dagstermill/iris_notebook_op.py startafter=start +from dagstermill import ConfigurableLocalOutputNotebookIOManager, define_dagstermill_op + +from dagster import file_relative_path, job + +k_means_iris = define_dagstermill_op( + name="k_means_iris", + notebook_path=file_relative_path(__file__, "./notebooks/iris-kmeans.ipynb"), + output_notebook_name="iris_kmeans_output", +) + + +@job( + resource_defs={ + "output_notebook_io_manager": ConfigurableLocalOutputNotebookIOManager(), + } +) +def iris_classify(): + k_means_iris() +``` + +In this code block, we use `define_dagstermill_op` to create an op that will execute the Jupyter notebook. We give the op the name `k_means_iris`, and provide the path to the notebook file. We also specify `output_notebook_name=iris_kmeans_output`. This means that the executed notebook will be returned in a buffered file object as one of the outputs of the op, and that output will have the name `iris_kmeans_output`. We then include the `k_means_iris` op in the `iris_classify` [job](/guides/build/jobs) and specify the `ConfigurableLocalOutputNotebookIOManager` as the `output_notebook_io_manager` to store the executed notebook file. + +## Notebook context + +If you look at one of the notebooks executed by Dagster, you'll notice that the `injected-parameters` cell in your output notebooks defines a variable called `context`. This context object mirrors the execution context object that's available in the body of any other asset or op's compute function. + +As with the parameters that `dagstermill` injects, you can also construct a context object for interactive exploration and development by using the `dagstermill.get_context` API in the tagged `parameters` cell of your input notebook. When Dagster executes your notebook, this development context will be replaced with the injected runtime context. + +You can use the development context to access asset and op config and resources, to log messages, and to yield results and other Dagster events just as you would in production. When the runtime context is injected by Dagster, none of your other code needs to change. + +For instance, suppose we want to make the number of clusters (the _k_ in k-means) configurable. We'll change our asset definition to include a config field: + +{/* TODO convert to */} +```python file=/integrations/dagstermill/iris_notebook_config.py startafter=start endbefore=end +from dagstermill import define_dagstermill_asset + +from dagster import AssetIn, Field, Int, file_relative_path + +iris_kmeans_jupyter_notebook = define_dagstermill_asset( + name="iris_kmeans_jupyter", + notebook_path=file_relative_path(__file__, "./notebooks/iris-kmeans.ipynb"), + group_name="template_tutorial", + ins={"iris": AssetIn("iris_dataset")}, + config_schema=Field( + Int, + default_value=3, + is_required=False, + description="The number of clusters to find", + ), +) +``` + +You can also provide `config_schema` to `define_dagstermill_op` in the same way demonstrated in this code snippet. + +In our notebook, we'll stub the context as follows (in the `parameters` cell): + +```python +import dagstermill + +context = dagstermill.get_context(op_config=3) +``` + +Now we can use our config value in our estimator. In production, this will be replaced by the config value provided to the job: + +```python +estimator = sklearn.cluster.KMeans(n_clusters=context.op_config) +``` + +## Results and custom materializations + +::: + +The functionality described in this section only works for notebooks run with `define_dagstermill_op`. If you'd like adding this feature to `define_dagstermill_asset` to be prioritized, give this [GitHub issue](https://github.com/dagster-io/dagster/issues/10557) a thumbs up. + +::: + +If you are using `define_dagstermill_op` and you'd like to yield a result to be consumed downstream of a notebook, you can call with the value of the result and its name. In interactive execution, this is a no-op, so you don't need to change anything when moving from interactive exploration and development to production. + +{/* TODO convert to */} +```python file=/integrations/dagstermill/notebook_outputs.py startafter=start_notebook endbefore=end_notebook +# my_notebook.ipynb +import dagstermill + +dagstermill.yield_result(3, output_name="my_output") +``` + +And then: + +{/* TODO convert to */} +```python file=/integrations/dagstermill/notebook_outputs.py startafter=start_py_file endbefore=end_py_file +from dagstermill import ConfigurableLocalOutputNotebookIOManager, define_dagstermill_op + +from dagster import Out, file_relative_path, job, op + +my_notebook_op = define_dagstermill_op( + name="my_notebook", + notebook_path=file_relative_path(__file__, "./notebooks/my_notebook.ipynb"), + output_notebook_name="output_notebook", + outs={"my_output": Out(int)}, +) + + +@op +def add_two(x): + return x + 2 + + +@job( + resource_defs={ + "output_notebook_io_manager": ConfigurableLocalOutputNotebookIOManager(), + } +) +def my_job(): + three, _ = my_notebook_op() + add_two(three) +``` + +## Dagster events + +You can also yield Dagster events from your notebook using . + +For example, if you'd like to yield a custom object (for instance, to tell the Dagster UI where you've saved a plot), you can do the following: + +```python +import dagstermill +from dagster import AssetMaterialization + +dagstermill.yield_event(AssetMaterialization(asset_key="marketing_data_plotted")) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/jupyter/using-notebooks-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/jupyter/using-notebooks-with-dagster.md new file mode 100644 index 0000000000000..3cd6722ef99a3 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/jupyter/using-notebooks-with-dagster.md @@ -0,0 +1,339 @@ +--- +title: "Using Jupyter notebooks with Papermill and Dagster" +description: The Dagstermill package lets you run notebooks using the Dagster tools and integrate them into your data pipelines. +--- + +{/* TODO add back when implemented */} + +:::tip + +You can find the code for this example on [GitHub](https://github.com/dagster-io/dagster/tree/master/examples/tutorial_notebook_assets/). + +::: + +In this tutorial, we'll walk you through integrating a Jupyter notebook with Dagster using an example project. Before we get started, let's cover some common approaches to writing and integrating Jupyter notebooks with Dagster: + +- **Doing standalone development in a Jupyter notebook**. You could then create two Dagster assets: one for the notebook itself and another for data-fetching logic. This approach, which we'll use to start the tutorial, allows you to configure existing notebooks to work with Dagster. + +- **Using existing Dagster assets as input to notebooks**. If the data you want to analyze is already a Dagster asset, you can directly load the asset's value into the notebook. When the notebook is complete, you can create a Dagster asset for the notebook and factor any data-fetching logic into a second asset, if applicable. This approach allows you to develop new notebooks that work with assets that are already a part of your Dagster project. + +By the end of this tutorial, you will: + +- Explore a Jupyter notebook that fetches and explores a dataset +- Create a Dagster asset from the notebook +- Create a second Dagster asset that only fetches the dataset +- Load existing Dagster assets into a new Jupyter notebook + +## Dagster concepts + +In this guide, we'll use the following Dagster concepts: + +- [Assets](/guides/build/assets/defining-assets) - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage. An executed Jupyter notebook file can also be an asset! That's what we'll be creating in this guide. +- [Definitions](/api/python-api/definitions) - A Dagster `Definitions` object is a collection of Dagster objects, including assets. +- [I/O managers](/guides/build/io-managers/) - An I/O manager handles storing and loading assets. In this guide, we'll be using a special I/O manager to store executed Jupyter notebook files. + +## Prerequisites + +To complete this tutorial, you'll need: + +- **To install Dagster and Jupyter**. Run the following to install using pip: + + ```shell + pip install dagster notebook + ``` + + Refer to the [Dagster](/getting-started/installation) installation docs for more info. + +- **To download the [`tutorial_notebook_assets`](https://github.com/dagster-io/dagster/tree/master/examples/tutorial_notebook_assets) Dagster example and install its dependencies:** + + ```shell + dagster project from-example --name tutorial_notebook_assets --example tutorial_notebook_assets + cd tutorial_notebook_assets + pip install -e ".[dev]" + ``` + + This example includes: + + - **A finished version of the tutorial project**, which you can use to check out the finished project. This is the `tutorial_finished` subfolder. + + - **A template version of the tutorial project**, which you can use to follow along with the tutorial. This is the `tutorial_template` subfolder. In this folder, you'll also find: + + - `assets`, a subfolder containing Dagster assets. We'll use `/assets.py` to write these. + - `notebooks`, a subfolder containing Jupyter notebooks. We'll use `/notebooks/iris-kmeans.ipynb` to write a Jupyter notebook. + + + +## Step 1: Explore the Jupyter notebook + +In this tutorial, we'll analyze the Iris dataset, collected in 1936 by the American botanist Edgar Anderson and made famous by statistician Ronald Fisher. The Iris dataset is a basic example of machine learning because it contains three classes of observation: one class is straightforwardly linearly separable from the other two, which can only be distinguished by more sophisticated methods. + +The `/tutorial_template/notebooks/iris-kmeans.ipynb` Jupyter notebook, which is already completed for you, does some analysis on the Iris dataset. + +In the Jupyter notebook, we first fetch the Iris dataset: + +```python +# /tutorial_template/notebooks/iris-kmeans.ipynb + +iris = pd.read_csv( + "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", + names=[ + "Sepal length (cm)", + "Sepal width (cm)", + "Petal length (cm)", + "Petal width (cm)", + "Species", + ], +) +``` + +Next, we'll perform some descriptive analysis to explore the dataset. If you execute these cells, several plots of the Iris dataset will be created: + +![Iris dataset plots](/images/integrations/jupyter/descriptive-plots.png) + +Next, we conduct our K-means analysis: + +```python +estimator = sklearn.cluster.KMeans(n_clusters=3) +estimator.fit( + iris[["Sepal length (cm)", "Sepal width (cm)", "Petal length (cm)", "Petal width (cm)"]] +) +``` + +Lastly, we plot the results of the K-means analysis. From the plots, we can see that one species of Iris is separable from the other two, but a more sophisticated model will be required to distinguish the other two species: + +![kmeans plots](/images/integrations/jupyter/kmeans-plots.png) + +Like many notebooks, this example does some fairly sophisticated work, including producing diagnostic plots and a statistical model. For now, this work is locked away in the `.ipynb` format, only reproducible using a complex Jupyter setup, and only programmatically accessible within the notebook context. We'll address this in the remainder of the tutorial. + +## Step 2: Create a Dagster asset from the Jupyter Notebook + +By creating a Dagster asset from our notebook, we can integrate the notebook as part of our data platform. This enables us to make its contents more accessible to developers, stakeholders, and other assets in Dagster. + +To create a Dagster asset from a Jupyter notebook, we can use the function. In `/tutorial_template/assets.py` add the following code snippet: + +```python +# /tutorial_template/assets.py + +from dagstermill import define_dagstermill_asset +from dagster import file_relative_path + +iris_kmeans_jupyter_notebook = define_dagstermill_asset( + name="iris_kmeans_jupyter", + notebook_path=file_relative_path(__file__, "notebooks/iris-kmeans.ipynb"), + group_name="template_tutorial", +) +``` + +If you are following along in the template code, uncomment the code block under the `TODO 1` comment. + +Using `define_dagstermill_asset`, we've created and returned a Dagster asset. Let's take a look at the arguments we provided: + +- `name` - This argument names the asset, in this case `iris_kmeans_jupyter` +- `notebook_path` - This argument tells Dagster where to find the notebook the asset should use as a source. In this case, that's our `/notebooks/iris-kmeans.ipynb` file. +- `group_name` - This optional argument places the asset into a group named `template_tutorial`, which is helpful for organizating your assets in the UI. + +When materialized, the `iris_kmeans_jupyter` asset will execute the notebook (`/notebooks/iris-kmeans.ipynb`) and store the resulting `.ipynb` file in a persistent location. + +## Step 3: Add a Dagster Definitions object and supply an I/O manager + +We want to execute our Dagster asset and save the resulting notebook to a persistent location. This is called materializing the asset and to do this, we need to add the asset to a Dagster object. + +Additionally, we need to provide a [resource](/guides/build/external-resources/) to the notebook to tell Dagster how to store the resulting `.ipynb` file. We'll use an [I/O manager](/guides/build/io-managers/) to accomplish this. + +Open the `/tutorial_template/definitions.py` file and add the following code: + +```python +# tutorial_template/definitions.py + +from dagster import load_assets_from_modules, Definitions +from dagstermill import ConfigurableLocalOutputNotebookIOManager + +from . import assets + +defs = Definitions( + assets=load_assets_from_modules([assets]), + resources={ + "output_notebook_io_manager": ConfigurableLocalOutputNotebookIOManager() + } +) + +``` + +Let's take a look at what's happening here: + +- Using , we've imported all assets in the `assets` module. This approach allows any new assets we create to be automatically added to the `Definitions` object instead of needing to manually add them one by one. + +- We provided a dictionary of resources to the `resources` parameter. In this example, that's the resource. + + This I/O manager, bound to the `output_notebook_io_manager` key, is responsible for handling the storage of the notebook asset's resulting `.ipynb` file. + +## Step 4: Materialize the notebook asset + +Now that you've created an asset, a resource, and a `Definitions` object, it's time to materialize the notebook asset! Materializing an asset runs the op it contains and saves the results to persistent storage. + +1. To start the Dagster UI, run the following in `/tutorial_template`: + + ```shell + dagster dev + ``` + + Which will result in output similar to: + + ```shell + Serving dagster-webserver on http://127.0.0.1:3000 in process 70635 + ``` + +2. In your browser, navigate to [http://127.0.0.1:3000](http://127.0.0.1:3000). The page will display the notebook asset in the **Asset Graph**. + + If you click the notebook asset, a sidebar containing info about the asset will slide out from the right side of the page. In the **Description** section of the panel is a **View Source Notebook** button: + + ![Notebook asset in UI](/images/integrations/jupyter/ui-one.png) + + This button allows you to view the notebook directly in the UI. When clicked, Dagster will render the notebook - referenced in the `notebook_path` parameter - that'll be executed when the `iris_kmeans_jupyter` asset is materialized: + + ![View Source Notebook display in the Dagster UI](/images/integrations/jupyter/view-source-notebook.png) + +3. Click the **Materialize** button. To view the execution as it happens, click the **View** button in the alert that displays. + +After the run completes successfully, you can view the executed notebook in the UI. Click the asset again and locate the **View Notebook** button in the **Materialization in Last Run** section of the sidebar: + +![View notebook button in materialization in last run area](/images/integrations/jupyter/ui-two.png) + +Click the button to display the executed notebook - specifically, the notebook that was executed and written to a persistent location: + +![Executed notebook display in the Dagster UI](/images/integrations/jupyter/view-executed-notebook.png) + +## Step 5: Add an upstream asset + +While our `iris-kmeans` notebook asset now materializes successfully, there are still some improvements we can make. The beginning of the notebook fetches the Iris dataset, which means that every time the notebook is materialized, the data is re-fetched. + +To address this, we can factor the Iris dataset into its own asset. This will allow us to: + +- **Use the asset as input to additional notebooks.** This means all notebooks analyzing the Iris dataset will use the same source data, which we only have to fetch once. + +- **Materialize notebooks without fetching data for each materialization.** Instead of making potentially expensive API calls, Dagster can fetch the data from the previous materialization of the Iris dataset and provide that data as input to the notebook. + +In this step, you'll: + +- [Create the Iris dataset asset](#step-51-create-the-iris-dataset-asset) +- [Provide the Iris dataset as input to the notebook](#step-52-provide-the-iris_dataset-asset-to-the-notebook-asset) +- [Modify the notebook](#step-53-modify-the-notebook) + +### Step 5.1: Create the Iris dataset asset + +To create an asset for the Iris dataset, add the following code to `/tutorial_template/assets.py`: + +```python +# /tutorial_template/assets.py + +from dagstermill import define_dagstermill_asset +from dagster import asset, file_relative_path +import pandas as pd + +@asset( + group_name="template_tutorial" +) +def iris_dataset(): + return pd.read_csv( + "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", + names=[ + "Sepal length (cm)", + "Sepal width (cm)", + "Petal length (cm)", + "Petal width (cm)", + "Species", + ], + ) +``` + +If you're following along in the template tutorial, uncomment the code block under the `TODO 2` comment. + +Let's go over what's happening in this code block: + +- Using , we create a standard Dagster asset. The name of the Python function (`iris_dataset`) is the name of the asset. +- As with the `iris_kmeans_jupyter` asset, we set the `group_name` parameter to organize our assets in the UI. +- The body of the Python function fetches the Iris dataset, renames the columns, and outputs a Pandas DataFrame. + +### Step 5.2: Provide the iris_dataset asset to the notebook asset + +Next, we need to tell Dagster that the `iris_dataset` asset is input data for the `iris-kmeans` notebook. To do this, add the `ins` parameter to the notebook asset: + +```python +# tutorial_template/assets.py +from dagstermill import define_dagstermill_asset +from dagster import asset, file_relative_path, AssetIn +import pandas as pd + +# iris_dataset asset removed for clarity + +iris_kmeans_jupyter_notebook = define_dagstermill_asset( + name="iris_kmeans_jupyter", + notebook_path=file_relative_path(__file__, "notebooks/iris-kmeans.ipynb"), + group_name="template_tutorial", + ins={"iris": AssetIn("iris_dataset")}, # this is the new parameter! +) +``` + +If you are following along with the template tutorial, uncomment the line with the `TODO 3` comment. + +The `ins` parameter tells Dagster that the `iris_dataset` asset should be mapped to a variable named `iris` in our notebook. Recall that in our `iris-kmeans` notebook, the Iris dataset is assigned to a variable named `iris`. + +### Step 5.3: Modify the notebook + +We need to make a small change in our Jupyter notebook to allow Dagster to supply the `iris_dataset` asset as input. Behind the scenes, Dagster uses `papermill` to inject parameters into notebooks. `papermill` works by replacing a notebook cell with the `parameters` tag with a custom cell that can fetch the desired data. + +To accomplish this, we need to tag the cell in the `iris-kmeans` notebook that fetches the Iris dataset. This allows us to replace the cell with the data-fetching logic that loads the `iris_dataset` asset and retain the ability to run the Jupyter notebook in a standalone context. We'll cover this in more detail later in the tutorial. + +To add the `parameters` tag, you may need to turn on the display of cell tags in Jupyter: + +1. In Jupyter, navigate to **View > Cell Toolbar > Tags**: + + ![Jupyer turn on display of cell tags](/images/integrations/jupyter/jupyter-view-menu.png) + +2. Click **Add Tag** to add a `parameters` tag: + + ![Jupyer add tag button](/images/integrations/jupyter/jupyter-tags.png) + +## Step 6: Materialize the assets + +Next, we'll materialize our `iris_dataset` and notebook assets. + +1. In the UI, open the **Asset Graph** page. + +2. Click the **Reload definitions** button. This ensures that the UI picks up the changes you made in the previous steps. + + At this point, the `iris_dataset` asset should display above the `iris_kmeans_jupyter` asset as an upstream dependency: + + ![Upstream Iris dataset asset](/images/integrations/jupyter/ui-three.png) + +3. Click the **Materialize all** button near the top right corner of the page, which will launch a run to materialize the assets. + +That's it! You now have working Jupyter and Dagster assets! + +## Extra credit: Fetch a Dagster asset in a Jupyter notebook + +What if you want to do additional analysis of the Iris dataset and create a new notebook? How can you accomplish this without duplicating code or re-fetching data? + +The answer is simple: use the `iris_dataset` Dagster asset! + +In the Jupyter notebook, import the Dagster `Definitions` object and use the function to load the data for the `iris_dataset` asset we created in [Step 5.1: Create the Iris dataset asset](#step-51-create-the-iris-dataset-asset): + +```python +from tutorial_template import template_tutorial + +iris = template_tutorial.load_asset_value("iris_dataset") +``` + +Then, whenever you run the notebook using Jupyter, you'll be able to work with the `iris_dataset` asset: + +```shell +jupyter notebook /path/to/new/notebook.ipynb +``` + +Behind the scenes, when `load_asset_value` is called, Dagster fetches the value of `iris_dataset` that was most recently materialized and stored by an I/O manager. + +To integrate the new notebook, follow the steps from [Step 5.3](#step-53-modify-the-notebook) to add the `parameters` tag to the cell that fetches the `iris_dataset` asset via `load_asset_value`. + +## Conclusion + +Now we have successfully created an asset from a Jupyter notebook and integrated it with our Dagster project! To learn about additional `dagstermill` features, refer to the [Dagstermill integration reference](reference). diff --git a/docs/docs-beta/docs/integrations/libraries/looker.md b/docs/docs-beta/docs/integrations/libraries/looker/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/looker.md rename to docs/docs-beta/docs/integrations/libraries/looker/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/looker/using-looker-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/looker/using-looker-with-dagster.md new file mode 100644 index 0000000000000..1431f5e34faac --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/looker/using-looker-with-dagster.md @@ -0,0 +1,185 @@ +--- +title: "Using Looker with Dagster" +description: Represent your Looker assets in Dagster +--- + +::: + +This feature is considered **experimental** + +::: + + +This guide provides instructions for using Dagster with Looker using the `dagster-looker` library. Your Looker assets, such as views, explores, and dashboards, can be represented in the Dagster asset graph, allowing you to track lineage and dependencies between Looker assets. You can also use Dagster to orchestrate Looker PDTs, allowing you to trigger refreshes of these materialized tables on a cadence or based on upstream data changes. + +## What you'll learn + +- How to represent Looker assets in the Dagster asset graph. +- How to customize asset definition metadata for these Looker assets. +- How to materialize Looker PDTs from Dagster. + +
+ Prerequisites + +- The `dagster-looker` library installed in your environment +- Familiarity with asset definitions and the Dagster asset graph +- Familiarity with Dagster resources +- Familiarity with Looker concepts, like views, explores, and dashboards +- A Looker instance +- Looker API credentials to access your Looker instance. For more information, see [Looker API authentication](https://cloud.google.com/looker/docs/api-auth) in the Looker documentation. + +
+ +## Set up your environment + +To get started, you'll need to install the `dagster` and `dagster-looker` Python packages: + +```bash +pip install dagster dagster-looker +``` + +## Represent Looker assets in the asset graph + +To load Looker assets into the Dagster asset graph, you must first construct a , which allows Dagster to communicate with your Looker instance. You'll need to supply your Looker instance URL and API credentials, which can be passed directly or accessed from the environment using . + +Dagster can automatically load all views, explores, and dashboards from your Looker instance as asset specs. Call the function, which returns a list of representing your Looker assets. You can then include these asset specs in your object: + +{/* TODO convert to */} +```python file=/integrations/looker/representing-looker-assets.py +from dagster_looker import LookerResource, load_looker_asset_specs + +import dagster as dg + +looker_resource = LookerResource( + client_id=dg.EnvVar("LOOKERSDK_CLIENT_ID"), + client_secret=dg.EnvVar("LOOKERSDK_CLIENT_SECRET"), + base_url=dg.EnvVar("LOOKERSDK_HOST_URL"), +) + +looker_specs = load_looker_asset_specs(looker_resource=looker_resource) +defs = dg.Definitions(assets=[*looker_specs], resources={"looker": looker_resource}) +``` + +## Load Looker assets from filtered dashboards and explores + +It is possible to load a subset of your Looker assets by providing a to the function. All dashboards contained in the folders provided to your will be fetched. Additionally, only the explores used in these dashboards will be fetched by passing `only_fetch_explores_used_in_dashboards=True` to your . + +Note that the content and size of Looker instance may affect the performance of your Dagster deployments. Filtering the dashboards and explores selection from which your Looker assets will be loaded is particularly useful for improving loading times. + +{/* TODO convert to */} +```python file=/integrations/looker/filtering-looker-assets.py +from dagster_looker import LookerFilter, LookerResource, load_looker_asset_specs + +import dagster as dg + +looker_resource = LookerResource( + client_id=dg.EnvVar("LOOKERSDK_CLIENT_ID"), + client_secret=dg.EnvVar("LOOKERSDK_CLIENT_SECRET"), + base_url=dg.EnvVar("LOOKERSDK_HOST_URL"), +) + +looker_specs = load_looker_asset_specs( + looker_resource=looker_resource, + looker_filter=LookerFilter( + dashboard_folders=[ + ["my_folder", "my_subfolder"], + ["my_folder", "my_other_subfolder"], + ], + only_fetch_explores_used_in_dashboards=True, + ), +) +defs = dg.Definitions(assets=[*looker_specs], resources={"looker": looker_resource}) +``` + +### Customize asset definition metadata for Looker assets + +By default, Dagster will generate asset specs for each Looker asset based on its type, and populate default metadata. You can further customize asset properties by passing a custom subclass to the function. This subclass can implement methods to customize the asset specs for each Looker asset type. + +{/* TODO convert to */} +```python file=/integrations/looker/customize-looker-assets.py +from dagster_looker import ( + DagsterLookerApiTranslator, + LookerApiTranslatorStructureData, + LookerResource, + LookerStructureType, + load_looker_asset_specs, +) + +import dagster as dg + +looker_resource = LookerResource( + client_id=dg.EnvVar("LOOKERSDK_CLIENT_ID"), + client_secret=dg.EnvVar("LOOKERSDK_CLIENT_SECRET"), + base_url=dg.EnvVar("LOOKERSDK_HOST_URL"), +) + + +class CustomDagsterLookerApiTranslator(DagsterLookerApiTranslator): + def get_asset_spec( + self, looker_structure: LookerApiTranslatorStructureData + ) -> dg.AssetSpec: + # We create the default asset spec using super() + default_spec = super().get_asset_spec(looker_structure) + # We customize the team owner tag for all assets, + # and we customize the asset key prefix only for dashboards. + return default_spec.replace_attributes( + key=( + default_spec.key.with_prefix("looker") + if looker_structure.structure_type == LookerStructureType.DASHBOARD + else default_spec.key + ), + owners=["team:my_team"], + ) + + +looker_specs = load_looker_asset_specs( + looker_resource, dagster_looker_translator=CustomDagsterLookerApiTranslator() +) +defs = dg.Definitions(assets=[*looker_specs], resources={"looker": looker_resource}) +``` + +Note that `super()` is called in each of the overridden methods to generate the default asset spec. It is best practice to generate the default asset spec before customizing it. + +### Materialize Looker PDTs from Dagster + +You can use Dagster to orchestrate the materialization of Looker PDTs. To model PDTs as assets, build their asset definitions by passing a list of to function. + +{/* TODO convert to */} +```python file=/integrations/looker/materializing-looker-pdts.py +from dagster_looker import ( + LookerResource, + RequestStartPdtBuild, + build_looker_pdt_assets_definitions, + load_looker_asset_specs, +) + +import dagster as dg + +looker_resource = LookerResource( + client_id=dg.EnvVar("LOOKERSDK_CLIENT_ID"), + client_secret=dg.EnvVar("LOOKERSDK_CLIENT_SECRET"), + base_url=dg.EnvVar("LOOKERSDK_HOST_URL"), +) + +looker_specs = load_looker_asset_specs(looker_resource=looker_resource) + +pdts = build_looker_pdt_assets_definitions( + resource_key="looker", + request_start_pdt_builds=[ + RequestStartPdtBuild(model_name="my_model", view_name="my_view") + ], +) + + +defs = dg.Definitions( + assets=[*pdts, *looker_specs], + resources={"looker": looker_resource}, +) +``` + +### Related + +- [`dagster-looker` API reference](/api/python-api/libraries/dagster-looker) +- [Asset definitions](/guides/build/assets/defining-assets) +- [Resources](/guides/build/external-resources/) +- [Using environment variables and secrets](/guides/deploy/using-environment-variables-and-secrets) diff --git a/docs/docs-beta/docs/integrations/libraries/mssql-bulk-copy-tool b/docs/docs-beta/docs/integrations/libraries/mssql-bulk-copy-tool.md similarity index 94% rename from docs/docs-beta/docs/integrations/libraries/mssql-bulk-copy-tool rename to docs/docs-beta/docs/integrations/libraries/mssql-bulk-copy-tool.md index baf5e15fd0adc..98d13168ad8b3 100644 --- a/docs/docs-beta/docs/integrations/libraries/mssql-bulk-copy-tool +++ b/docs/docs-beta/docs/integrations/libraries/mssql-bulk-copy-tool.md @@ -14,7 +14,7 @@ enabledBy: enables: tags: [community-supported, etl] sidebar_custom_props: - logo: + logo: images/integrations/mssql.png --- The community-supported `dagster-mssql-bcp` package is a custom Dagster I/O manager for loading data into SQL Server using the bcp utility. diff --git a/docs/docs-beta/docs/integrations/libraries/openai.md b/docs/docs-beta/docs/integrations/libraries/openai/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/openai.md rename to docs/docs-beta/docs/integrations/libraries/openai/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/openai/using-openai-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/openai/using-openai-with-dagster.md new file mode 100644 index 0000000000000..4a34c4df37a19 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/openai/using-openai-with-dagster.md @@ -0,0 +1,125 @@ +--- +title: "OpenAI & Dagster | Dagster Docs" +description: "The dagster-openai library provides the ability to build OpenAI pipelines with Dagster and log OpenAI API usage metadata in Dagster Insights." +--- + +::: + +This feature is considered **experimental** + +::: + +The `dagster-openai` library allows you to build OpenAI pipelines with Dagster and log OpenAI API usage metadata in [Dagster Insights](/dagster-plus/features/insights). + +Using this library's , you can easily interact with the [OpenAI REST API](https://platform.openai.com/docs/introduction) via the [OpenAI Python API](https://github.com/openai/openai-python). + +When used with Dagster's [asset definitions](/guides/build/assets/defining-assets), the resource automatically logs OpenAI usage metadata in [asset metadata](/guides/build/assets/metadata-and-tags/). + +## Getting started + +Before you get started with the `dagster-openai` library, we recommend familiarizing yourself with the [OpenAI Python API library](https://github.com/openai/openai-python), which this integration uses to interact with the [OpenAI REST API](https://platform.openai.com/docs/introduction). + +## Prerequisites + +To get started, install the `dagster` and `dagster-openai` Python packages: + +```bash +pip install dagster dagster-openai +``` + +Note that you will need an OpenAI [API key](https://platform.openai.com/api-keys) to use the resource, which can be generated in your OpenAI account. + + +## Connecting to OpenAI + +The first step in using OpenAI with Dagster is to tell Dagster how to connect to an OpenAI client using an OpenAI [resource](/guides/build/external-resources/). This resource contains the credentials needed to interact with OpenAI API. + +We will supply our credentials as environment variables by adding them to a `.env` file. For more information on setting environment variables in a production setting, see [Using environment variables and secrets](/guides/deploy/using-environment-variables-and-secrets). + +```bash +# .env + +OPENAI_API_KEY=... +``` + +Then, we can instruct Dagster to authorize the OpenAI resource using the environment variables: + +```python startafter=start_example endbefore=end_example file=/integrations/openai/resource.py +from dagster_openai import OpenAIResource + +from dagster import EnvVar + +# Pull API key from environment variables +openai = OpenAIResource( + api_key=EnvVar("OPENAI_API_KEY"), +) +``` + +## Using the OpenAI resource with assets + +The OpenAI resource can be used in assets in order to interact with the OpenAI API. Note that in this example, we supply our credentials as environment variables directly when instantiating the object. + +{/* TODO convert to */} +```python startafter=start_example endbefore=end_example file=/integrations/openai/assets.py +from dagster_openai import OpenAIResource + +from dagster import AssetExecutionContext, Definitions, EnvVar, asset, define_asset_job + + +@asset(compute_kind="OpenAI") +def openai_asset(context: AssetExecutionContext, openai: OpenAIResource): + with openai.get_client(context) as client: + client.chat.completions.create( + model="gpt-3.5-turbo", + messages=[{"role": "user", "content": "Say this is a test."}], + ) + + +openai_asset_job = define_asset_job(name="openai_asset_job", selection="openai_asset") + +defs = Definitions( + assets=[openai_asset], + jobs=[openai_asset_job], + resources={ + "openai": OpenAIResource(api_key=EnvVar("OPENAI_API_KEY")), + }, +) +``` + +After materializing your asset, your OpenAI API usage metadata will be available in the **Events** and **Plots** tabs of your asset in the Dagster UI. If you are using [Dagster+](/dagster-plus), your usage metadata will also be available in [Dagster Insights](/dagster-plus/features/insights). {/* Refer to the [Viewing and materializing assets in the UI guide](https://docs.dagster.io/guides/build/assets/defining-assets#viewing-and-materializing-assets-in-the-ui) for more information. */} + +## Using the OpenAI resource with ops + +The OpenAI resource can also be used in [ops](/guides/build/ops). + +:::note + +Currently, the OpenAI resource doesn't (out-of-the-box) log OpenAI usage metadata when used in ops. + +::: + +{/* TODO convert to */} +```python startafter=start_example endbefore=end_example file=/integrations/openai/ops.py +from dagster_openai import OpenAIResource + +from dagster import Definitions, EnvVar, GraphDefinition, OpExecutionContext, op + + +@op +def openai_op(context: OpExecutionContext, openai: OpenAIResource): + with openai.get_client(context) as client: + client.chat.completions.create( + model="gpt-3.5-turbo", + messages=[{"role": "user", "content": "Say this is a test"}], + ) + + +openai_op_job = GraphDefinition(name="openai_op_job", node_defs=[openai_op]).to_job() + +defs = Definitions( + jobs=[openai_op_job], + resources={ + "openai": OpenAIResource(api_key=EnvVar("OPENAI_API_KEY")), + }, +) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/pandas.md b/docs/docs-beta/docs/integrations/libraries/pandas/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/pandas.md rename to docs/docs-beta/docs/integrations/libraries/pandas/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/pandas/using-pandas-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/pandas/using-pandas-with-dagster.md new file mode 100644 index 0000000000000..7fc308ad0ead0 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/pandas/using-pandas-with-dagster.md @@ -0,0 +1,146 @@ +--- +title: "Pandas & Dagster | Dagster Docs" +description: "The dagster-pandas library provides the ability to perform data validation, emit summary statistics, and enable reliable dataframe serialization/deserialization." +--- + +# Pandas & Dagster + +:::note + +This page describes the `dagster-pandas` library, which is used for performing data validation. To simply use pandas with Dagster, start with the [Dagster Quickstart](/getting-started/quickstart) + +Dagster makes it easy to use pandas code to manipulate data and then store +that data in other systems such as [files on Amazon S3](/api/python-api/libraries/dagster-aws#dagster_aws.s3.s3_pickle_io_manager) or [tables in Snowflake](/integrations/libraries/snowflake/using-snowflake-with-dagster) + +::: + +- [Creating Dagster DataFrame Types](#creating-dagster-dataframe-types) +- [Dagster DataFrame Level Validation](#dagster-dataframe-level-validation) +- [Dagster DataFrame Summary Statistics](#dagster-dataframe-summary-statistics) + +The `dagster_pandas` library provides the ability to perform data validation, emit summary statistics, and enable reliable dataframe serialization/deserialization. On top of this, the Dagster type system generates documentation of your dataframe constraints and makes it accessible in the Dagster UI. + +## Creating Dagster DataFrame Types + +To create a custom `dagster_pandas` type, use `create_dagster_pandas_dataframe_type` and provide a list of `PandasColumn` objects which specify column-level schema and constraints. For example, we can construct a custom dataframe type to represent a set of e-bike trips in the following way: + +{/* TODO convert to */} +```python file=/legacy/dagster_pandas_guide/core_trip.py startafter=start_core_trip_marker_0 endbefore=end_core_trip_marker_0 +TripDataFrame = create_dagster_pandas_dataframe_type( + name="TripDataFrame", + columns=[ + PandasColumn.integer_column("bike_id", min_value=0), + PandasColumn.categorical_column("color", categories={"red", "green", "blue"}), + PandasColumn.datetime_column( + "start_time", min_datetime=Timestamp(year=2020, month=2, day=10) + ), + PandasColumn.datetime_column( + "end_time", min_datetime=Timestamp(year=2020, month=2, day=10) + ), + PandasColumn.string_column("station"), + PandasColumn.exists("amount_paid"), + PandasColumn.boolean_column("was_member"), + ], +) +``` + +Once our custom data type is defined, we can use it as the type declaration for the inputs / outputs of our ops: + +{/* TODO convert to */} +```python file=/legacy/dagster_pandas_guide/core_trip.py startafter=start_core_trip_marker_1 endbefore=end_core_trip_marker_1 +@op(out=Out(TripDataFrame)) +def load_trip_dataframe() -> DataFrame: + return read_csv( + file_relative_path(__file__, "./ebike_trips.csv"), + parse_dates=["start_time", "end_time"], + date_parser=lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f"), + dtype={"color": "category"}, + ) +``` + +By passing in these `PandasColumn` objects, we are expressing the schema and constraints we expect our dataframes to follow when Dagster performs type checks for our ops. Moreover, if we go to the op viewer, we can follow our schema documented in the UI: + +![tutorial2](/images/integrations/pandas/tutorial2.png) + +## Dagster DataFrame Level Validation + +Now that we have a custom dataframe type that performs schema validation during a run, we can express dataframe level constraints (e.g number of rows, or columns). + +To do this, we provide a list of dataframe constraints to `create_dagster_pandas_dataframe_type`; for example, using `RowCountConstraint`. More information on the available constraints can be found in the `dagster_pandas` [API docs](/api/python-api/libraries/dagster-pandas). + +This looks like: + +{/* TODO convert to */} +```python file=/legacy/dagster_pandas_guide/shape_constrained_trip.py startafter=start_create_type endbefore=end_create_type +ShapeConstrainedTripDataFrame = create_dagster_pandas_dataframe_type( + name="ShapeConstrainedTripDataFrame", dataframe_constraints=[RowCountConstraint(4)] +) +``` + +If we rerun the above example with this dataframe, nothing should change. However, if we pass in 100 to the row count constraint, we can watch our job fail that type check. + +## Dagster DataFrame Summary Statistics + +Aside from constraint validation, `create_dagster_pandas_dataframe_type` also takes in a summary statistics function that emits metadata dictionaries which are surfaced during runs. Since data systems seldom control the quality of the data they receive, it becomes important to monitor data as it flows through your systems. In complex jobs, this can help debug and monitor data drift over time. Let's illustrate how this works in our example: + +{/* TODO convert to */} +```python file=/legacy/dagster_pandas_guide/summary_stats.py startafter=start_summary endbefore=end_summary +def compute_trip_dataframe_summary_statistics(dataframe): + return { + "min_start_time": min(dataframe["start_time"]).strftime("%Y-%m-%d"), + "max_end_time": max(dataframe["end_time"]).strftime("%Y-%m-%d"), + "num_unique_bikes": str(dataframe["bike_id"].nunique()), + "n_rows": len(dataframe), + "columns": str(dataframe.columns), + } + + +SummaryStatsTripDataFrame = create_dagster_pandas_dataframe_type( + name="SummaryStatsTripDataFrame", + metadata_fn=compute_trip_dataframe_summary_statistics, +) +``` + +Now if we run this job in the UI launchpad, we can see that the `SummaryStatsTripDataFrame` type is displayed in the logs along with the emitted metadata. + +![tutorial1.png](/images/integrations/pandas/tutorial1.png) + +# Dagster DataFrame Custom Validation + +`PandasColumn` is user-pluggable with custom constraints. They can be constructed directly and passed a list of `ColumnConstraint` objects. + +To tie this back to our example, let's say that we want to validate that the amount paid for a e-bike must be in 5 dollar increments because that is the price per mile rounded up. As a result, let's implement a `DivisibleByFiveConstraint`. To do this, all it needs is a `markdown_description` for the UI which accepts and renders markdown syntax, an `error_description` for error logs, and a validation method which throws a `ColumnConstraintViolationException` if a row fails validation. This would look like the following: + +{/* TODO convert to */} +```python file=/legacy/dagster_pandas_guide/custom_column_constraint.py startafter=start_custom_col endbefore=end_custom_col +class DivisibleByFiveConstraint(ColumnConstraint): + def __init__(self): + message = "Value must be divisible by 5" + super().__init__(error_description=message, markdown_description=message) + + def validate(self, dataframe, column_name): + rows_with_unexpected_buckets = dataframe[ + dataframe[column_name].apply(lambda x: x % 5 != 0) + ] + if not rows_with_unexpected_buckets.empty: + raise ColumnConstraintViolationException( + constraint_name=self.name, + constraint_description=self.error_description, + column_name=column_name, + offending_rows=rows_with_unexpected_buckets, + ) + + +CustomTripDataFrame = create_dagster_pandas_dataframe_type( + name="CustomTripDataFrame", + columns=[ + PandasColumn( + "amount_paid", + constraints=[ + ColumnDTypeInSetConstraint({"int64"}), + DivisibleByFiveConstraint(), + ], + ) + ], +) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/pandera.md b/docs/docs-beta/docs/integrations/libraries/pandera/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/pandera.md rename to docs/docs-beta/docs/integrations/libraries/pandera/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/pandera/using-pandera-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/pandera/using-pandera-with-dagster.md new file mode 100644 index 0000000000000..9420109470cc4 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/pandera/using-pandera-with-dagster.md @@ -0,0 +1,89 @@ +--- +title: "Pandera & Dagster | Dagster Docs" +description: Generate Dagster types for dataframes with Pandera. +--- + +# Pandera & Dagster + +The `dagster-pandera` integration library provides an API for generating [Dagster Types](/api/python-api/types) from [Pandera](https://github.com/pandera-dev/pandera) dataframe schemas. Like all Dagster types, `dagster-pandera`-generated types can be used to annotate [op](/guides/build/ops) inputs and outputs. + +Using Pandera with Dagster allows you to: + +- Visualize the shape of the data by displaying dataframe structure information in the Dagster UI +- Implement runtime type-checking with rich error reporting + +## Limitations + +Currently, `dagster-pandera` only supports pandas and Polars dataframes, despite Pandera supporting validation on other dataframe backends. + +## Prerequisites + +To get started, you'll need: + +- [To install](/getting-started/installation) the `dagster` and `dagster-pandera` Python packages: + + ```bash + pip install dagster dagster-pandera + ``` + +- Familiarity with [Dagster Types](/api/python-api/types + +## Usage + +The `dagster-pandera` library exposes only a single public function, `pandera_schema_to_dagster_type`, which generates Dagster types from Pandera schemas. The Dagster type wraps the Pandera schema and invokes the schema's `validate()` method inside its type check function. + +{/* TODO convert to */} +```python file=/integrations/pandera/example.py +import random + +import pandas as pd +import pandera as pa +from dagster_pandera import pandera_schema_to_dagster_type +from pandera.typing import Series + +from dagster import Out, job, op + +APPLE_STOCK_PRICES = { + "name": ["AAPL", "AAPL", "AAPL", "AAPL", "AAPL"], + "date": ["2018-01-22", "2018-01-23", "2018-01-24", "2018-01-25", "2018-01-26"], + "open": [177.3, 177.3, 177.25, 174.50, 172.0], + "close": [177.0, 177.04, 174.22, 171.11, 171.51], +} + + +class StockPrices(pa.DataFrameModel): + """Open/close prices for one or more stocks by day.""" + + name: Series[str] = pa.Field(description="Ticker symbol of stock") + date: Series[str] = pa.Field(description="Date of prices") + open: Series[float] = pa.Field(ge=0, description="Price at market open") + close: Series[float] = pa.Field(ge=0, description="Price at market close") + + +@op(out=Out(dagster_type=pandera_schema_to_dagster_type(StockPrices))) +def apple_stock_prices_dirty(): + prices = pd.DataFrame(APPLE_STOCK_PRICES) + i = random.choice(prices.index) + prices.loc[i, "open"] = pd.NA + prices.loc[i, "close"] = pd.NA + return prices + + +@job +def stocks_job(): + apple_stock_prices_dirty() +``` + +In the above example, we defined a toy job (`stocks_job`) with a single asset, `apple_stock_prices_dirty`. This asset returns a pandas `DataFrame` containing the opening and closing prices of Apple stock (AAPL) for a random week. The `_dirty` suffix is included because we've corrupted the data with a few random nulls. + +Let's look at this job in the UI: + +![Pandera job in the Dagster UI](/images/integrations/pandera/schema.png) + +Notice that information from the `StockPrices` Pandera schema is rendered in the asset detail area of the right sidebar. This is possible because `pandera_schema_to_dagster_type` extracts this information from the Pandera schema and attaches it to the returned Dagster type. + +If we try to run `stocks_job`, our run will fail. This is expected, as our (dirty) data contains nulls and Pandera columns are non-nullable by default. The [Dagster Typ](/api/python-api/types) returned by `pandera_schema_to_dagster_type` contains a type check function that calls `StockPrices.validate()`. This is invoked automatically on the return value of `apple_stock_prices_dirty`, leading to a type check failure. + +Notice the `STEP_OUTPUT` event in the following screenshot to see Pandera's full output: + +![Error report for a Pandera job in the Dagster UI](/images/integrations/pandera/error-report.png) diff --git a/docs/docs-beta/docs/integrations/libraries/powerbi.md b/docs/docs-beta/docs/integrations/libraries/powerbi/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/powerbi.md rename to docs/docs-beta/docs/integrations/libraries/powerbi/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/powerbi/using-powerbi-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/powerbi/using-powerbi-with-dagster.md new file mode 100644 index 0000000000000..5f7f9118b9429 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/powerbi/using-powerbi-with-dagster.md @@ -0,0 +1,273 @@ +--- +title: "Using Power BI with Dagster" +description: Represent your Power BI assets in Dagster +--- + +::: + +This feature is considered **experimental** + +::: + +This guide provides instructions for using Dagster with Power BI using the [`dagster-powerbi`](/api/python-api/libraries/dagster-powerbi) library. Your Power BI assets, such as semantic models, data sources, reports, and dashboards, can be represented in the Dagster asset graph, allowing you to track lineage and dependencies between Power BI assets and upstream data assets you are already modeling in Dagster. You can also use Dagster to orchestrate Power BI semantic models, allowing you to trigger refreshes of these models on a cadence or based on upstream data changes. + +## What you'll learn + +- How to represent Power BI assets in the Dagster asset graph, including lineage to other Dagster assets. +- How to customize asset definition metadata for these Power BI assets. +- How to materialize Power BI semantic models from Dagster. +- How to customize how Power BI semantic models are materialized. + +
+ Prerequisites + +- The `dagster` and `dagster-powerbi` libraries installed in your environment +- Familiarity with asset definitions and the Dagster asset graph +- Familiarity with Dagster resources +- Familiarity with Power BI concepts, like semantic models, data sources, reports, and dashboards +- A Power BI workspace +- A service principal configured to access Power BI, or an API access token. For more information, see [Embed Power BI content with service principal and an application secret](https://learn.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal) in the Power BI documentation. + +
+ +## Set up your environment + +To get started, you'll need to install the `dagster` and `dagster-powerbi` Python packages: + +```bash +pip install dagster dagster-powerbi +``` + +## Represent Power BI assets in the asset graph + +To load Power BI assets into the Dagster asset graph, you must first construct a resource, which allows Dagster to communicate with your Power BI workspace. You'll need to supply your workspace ID and credentials. You may configure a service principal or use an API access token, which can be passed directly or accessed from the environment using . + +Dagster can automatically load all semantic models, data sources, reports, and dashboards from your Power BI workspace as asset specs. Call the function, which returns a list of s representing your Power BI assets. You can then include these asset specs in your object: + +{/* TODO convert to */} +```python file=/integrations/power-bi/representing-power-bi-assets.py +from dagster_powerbi import ( + PowerBIServicePrincipal, + PowerBIToken, + PowerBIWorkspace, + load_powerbi_asset_specs, +) + +import dagster as dg + +# Connect using a service principal +power_bi_workspace = PowerBIWorkspace( + credentials=PowerBIServicePrincipal( + client_id=dg.EnvVar("POWER_BI_CLIENT_ID"), + client_secret=dg.EnvVar("POWER_BI_CLIENT_SECRET"), + tenant_id=dg.EnvVar("POWER_BI_TENANT_ID"), + ), + workspace_id=dg.EnvVar("POWER_BI_WORKSPACE_ID"), +) + +# Alternatively, connect directly using an API access token +power_bi_workspace = PowerBIWorkspace( + credentials=PowerBIToken(api_token=dg.EnvVar("POWER_BI_API_TOKEN")), + workspace_id=dg.EnvVar("POWER_BI_WORKSPACE_ID"), +) + +power_bi_specs = load_powerbi_asset_specs(power_bi_workspace) +defs = dg.Definitions( + assets=[*power_bi_specs], resources={"power_bi": power_bi_workspace} +) +``` + +By default, Dagster will attempt to snapshot your entire workspace using Power BI's [metadata scanner APIs](https://learn.microsoft.com/en-us/fabric/governance/metadata-scanning-overview), which are able to retrieve more detailed information about your Power BI assets, but rely on the workspace being configured to allow this access. + +If you encounter issues with the scanner APIs, you may disable them using `load_powerbi_asset_specs(power_bi_workspace, use_workspace_scan=False)`. + +### Customize asset definition metadata for Power BI assets + +By default, Dagster will generate asset specs for each Power BI asset based on its type, and populate default metadata. You can further customize asset properties by passing a custom subclass to the function. This subclass can implement methods to customize the asset specs for each Power BI asset type. + +{/* TODO convert to */} +```python file=/integrations/power-bi/customize-power-bi-asset-defs.py +from dagster_powerbi import ( + DagsterPowerBITranslator, + PowerBIServicePrincipal, + PowerBIWorkspace, + load_powerbi_asset_specs, +) +from dagster_powerbi.translator import PowerBIContentType, PowerBITranslatorData + +import dagster as dg + +power_bi_workspace = PowerBIWorkspace( + credentials=PowerBIServicePrincipal( + client_id=dg.EnvVar("POWER_BI_CLIENT_ID"), + client_secret=dg.EnvVar("POWER_BI_CLIENT_SECRET"), + tenant_id=dg.EnvVar("POWER_BI_TENANT_ID"), + ), + workspace_id=dg.EnvVar("POWER_BI_WORKSPACE_ID"), +) + + +# A translator class lets us customize properties of the built +# Power BI assets, such as the owners or asset key +class MyCustomPowerBITranslator(DagsterPowerBITranslator): + def get_asset_spec(self, data: PowerBITranslatorData) -> dg.AssetSpec: + # We create the default asset spec using super() + default_spec = super().get_asset_spec(data) + # We customize the team owner tag for all assets, + # and we customize the asset key prefix only for dashboards. + return default_spec.replace_attributes( + key=( + default_spec.key.with_prefix("prefix") + if data.content_type == PowerBIContentType.DASHBOARD + else default_spec.key + ), + owners=["team:my_team"], + ) + + +power_bi_specs = load_powerbi_asset_specs( + power_bi_workspace, dagster_powerbi_translator=MyCustomPowerBITranslator() +) +defs = dg.Definitions( + assets=[*power_bi_specs], resources={"power_bi": power_bi_workspace} +) +``` + +Note that `super()` is called in each of the overridden methods to generate the default asset spec. It is best practice to generate the default asset spec before customizing it. + +### Load Power BI assets from multiple workspaces + +Definitions from multiple Power BI workspaces can be combined by instantiating multiple resources and merging their specs. This lets you view all your Power BI assets in a single asset graph: + +{/* TODO convert to */} +```python file=/integrations/power-bi/multiple-power-bi-workspaces.py +from dagster_powerbi import ( + PowerBIServicePrincipal, + PowerBIWorkspace, + load_powerbi_asset_specs, +) + +import dagster as dg + +credentials = PowerBIServicePrincipal( + client_id=dg.EnvVar("POWER_BI_CLIENT_ID"), + client_secret=dg.EnvVar("POWER_BI_CLIENT_SECRET"), + tenant_id=dg.EnvVar("POWER_BI_TENANT_ID"), +) + +sales_team_workspace = PowerBIWorkspace( + credentials=credentials, + workspace_id="726c94ff-c408-4f43-8edf-61fbfa1753c7", +) + +marketing_team_workspace = PowerBIWorkspace( + credentials=credentials, + workspace_id="8b7f815d-4e64-40dd-993c-cfa4fb12edee", +) + +sales_team_specs = load_powerbi_asset_specs(sales_team_workspace) +marketing_team_specs = load_powerbi_asset_specs(marketing_team_workspace) + +# Merge the specs into a single set of definitions +defs = dg.Definitions( + assets=[*sales_team_specs, *marketing_team_specs], + resources={ + "marketing_power_bi": marketing_team_workspace, + "sales_power_bi": sales_team_workspace, + }, +) +``` + +## Materialize Power BI semantic models from Dagster + +Dagster's default behavior is to pull in representations of Power BI semantic models as external assets, which appear in the asset graph but can't be materialized. However, you can build executable asset definitions that trigger the refresh of Power BI semantic models. The utility will construct an asset definition that triggers a refresh of a semantic model when materialized. + +{/* TODO convert to */} +```python file=/integrations/power-bi/materialize-semantic-models.py +from dagster_powerbi import ( + PowerBIServicePrincipal, + PowerBIWorkspace, + build_semantic_model_refresh_asset_definition, + load_powerbi_asset_specs, +) + +import dagster as dg + +power_bi_workspace = PowerBIWorkspace( + credentials=PowerBIServicePrincipal( + client_id=dg.EnvVar("POWER_BI_CLIENT_ID"), + client_secret=dg.EnvVar("POWER_BI_CLIENT_SECRET"), + tenant_id=dg.EnvVar("POWER_BI_TENANT_ID"), + ), + workspace_id=dg.EnvVar("POWER_BI_WORKSPACE_ID"), +) + +# Load Power BI asset specs, and use the asset definition builder to +# construct a semantic model refresh definition for each semantic model +power_bi_assets = [ + build_semantic_model_refresh_asset_definition(resource_key="power_bi", spec=spec) + if spec.tags.get("dagster-powerbi/asset_type") == "semantic_model" + else spec + for spec in load_powerbi_asset_specs(power_bi_workspace) +] +defs = dg.Definitions( + assets=[*power_bi_assets], resources={"power_bi": power_bi_workspace} +) +``` + +You can then add these semantic models to jobs or as targets of Dagster sensors or schedules to trigger refreshes of the models on a cadence or based on other conditions. + +### Customizing how Power BI semantic models are materialized + +Instead of using the out-of-the-box utility, you can build your own asset definitions that trigger the refresh of Power BI semantic models. This allows you to customize how the refresh is triggered or to run custom code before or after the refresh. + +{/* TODO convert to */} +```python file=/integrations/power-bi/materialize-semantic-models-advanced.py +from dagster_powerbi import ( + PowerBIServicePrincipal, + PowerBIWorkspace, + build_semantic_model_refresh_asset_definition, + load_powerbi_asset_specs, +) + +import dagster as dg + +power_bi_workspace = PowerBIWorkspace( + credentials=PowerBIServicePrincipal( + client_id=dg.EnvVar("POWER_BI_CLIENT_ID"), + client_secret=dg.EnvVar("POWER_BI_CLIENT_SECRET"), + tenant_id=dg.EnvVar("POWER_BI_TENANT_ID"), + ), + workspace_id=dg.EnvVar("POWER_BI_WORKSPACE_ID"), +) + + +# Asset definition factory which triggers a semantic model refresh and sends a notification +# once complete +def build_semantic_model_refresh_and_notify_asset_def( + spec: dg.AssetSpec, +) -> dg.AssetsDefinition: + dataset_id = spec.metadata["dagster-powerbi/id"] + + @dg.multi_asset(specs=[spec], name=spec.key.to_python_identifier()) + def rebuild_semantic_model( + context: dg.AssetExecutionContext, power_bi: PowerBIWorkspace + ) -> None: + power_bi.trigger_and_poll_refresh(dataset_id) + # Do some custom work after refreshing here, such as sending an email notification + + return rebuild_semantic_model + + +# Load Power BI asset specs, and use our custom asset definition builder to +# construct a definition for each semantic model +power_bi_assets = [ + build_semantic_model_refresh_and_notify_asset_def(spec=spec) + if spec.tags.get("dagster-powerbi/asset_type") == "semantic_model" + else spec + for spec in load_powerbi_asset_specs(power_bi_workspace) +] +defs = dg.Definitions( + assets=[*power_bi_assets], resources={"power_bi": power_bi_workspace} +) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/sigma.md b/docs/docs-beta/docs/integrations/libraries/sigma/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/sigma.md rename to docs/docs-beta/docs/integrations/libraries/sigma/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/sigma/using-sigma-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/sigma/using-sigma-with-dagster.md new file mode 100644 index 0000000000000..e782545bef8bd --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/sigma/using-sigma-with-dagster.md @@ -0,0 +1,180 @@ +--- +title: "Using Sigma with Dagster" +description: Represent your Sigma assets in Dagster +--- + +::: + +This feature is considered **experimental** + +::: + +This guide provides instructions for using Dagster with Sigma using the [`dagster-sigma`](/api/python-api/libraries/dagster-sigma) library. Your Sigma assets, including datasets and workbooks, can be represented in the Dagster asset graph, allowing you to track lineage and dependencies between Sigma assets and upstream data assets you are already modeling in Dagster. + +## What you'll learn + +- How to represent Sigma assets in the Dagster asset graph, including lineage to other Dagster assets. +- How to customize asset definition metadata for these Sigma assets. + +
+ Prerequisites + +- The `dagster-sigma` library installed in your environment +- Familiarity with asset definitions and the Dagster asset graph +- Familiarity with Dagster resources +- Familiarity with Sigma concepts, like datasets and workbooks +- A Sigma organization +- A Sigma client ID and client secret. For more information, see [Generate API client credentials](https://help.sigmacomputing.com/reference/generate-client-credentials#generate-api-client-credentials) in the Sigma documentation. + +
+ +## Set up your environment + +To get started, you'll need to install the `dagster` and `dagster-sigma` Python packages: + +```bash +pip install dagster dagster-sigma +``` + +## Represent Sigma assets in the asset graph + +To load Sigma assets into the Dagster asset graph, you must first construct a resource, which allows Dagster to communicate with your Sigma organization. You'll need to supply your client ID and client secret alongside the base URL. See [Identify your API request URL](https://help.sigmacomputing.com/reference/get-started-sigma-api#identify-your-api-request-url) in the Sigma documentation for more information on how to find your base URL. + +Dagster can automatically load all datasets and workbooks from your Sigma workspace as asset specs. Call the function, which returns list of s representing your Sigma assets. You can then include these asset specs in your object: + +{/* TODO convert to */} +```python file=/integrations/sigma/representing-sigma-assets.py +from dagster_sigma import SigmaBaseUrl, SigmaOrganization, load_sigma_asset_specs + +import dagster as dg + +sigma_organization = SigmaOrganization( + base_url=SigmaBaseUrl.AWS_US, + client_id=dg.EnvVar("SIGMA_CLIENT_ID"), + client_secret=dg.EnvVar("SIGMA_CLIENT_SECRET"), +) + +sigma_specs = load_sigma_asset_specs(sigma_organization) +defs = dg.Definitions(assets=[*sigma_specs], resources={"sigma": sigma_organization}) +``` + +## Load Sigma assets from filtered workbooks + +It is possible to load a subset of your Sigma assets by providing a to the function. This `SigmaFilter` object allows you to specify the folders from which you want to load Sigma workbooks, and also will allow you to configure which datasets are represented as assets. + +Note that the content and size of Sigma organization may affect the performance of your Dagster deployments. Filtering the workbooks selection from which your Sigma assets will be loaded is particularly useful for improving loading times. + +{/* TODO convert to */} +```python file=/integrations/sigma/filtering-sigma-assets.py +from dagster_sigma import ( + SigmaBaseUrl, + SigmaFilter, + SigmaOrganization, + load_sigma_asset_specs, +) + +import dagster as dg + +sigma_organization = SigmaOrganization( + base_url=SigmaBaseUrl.AWS_US, + client_id=dg.EnvVar("SIGMA_CLIENT_ID"), + client_secret=dg.EnvVar("SIGMA_CLIENT_SECRET"), +) + +sigma_specs = load_sigma_asset_specs( + organization=sigma_organization, + sigma_filter=SigmaFilter( + # Filter down to only the workbooks in these folders + workbook_folders=[ + ("my_folder", "my_subfolder"), + ("my_folder", "my_other_subfolder"), + ], + # Specify whether to include datasets that are not used in any workbooks + # default is True + include_unused_datasets=False, + ), +) +defs = dg.Definitions(assets=[*sigma_specs], resources={"sigma": sigma_organization}) +``` + +### Customize asset definition metadata for Sigma assets + +By default, Dagster will generate asset specs for each Sigma asset based on its type, and populate default metadata. You can further customize asset properties by passing a custom subclass to the function. This subclass can implement methods to customize the asset specs for each Sigma asset type. + +{/* TODO convert to */} +```python file=/integrations/sigma/customize-sigma-asset-defs.py +from dagster_sigma import ( + DagsterSigmaTranslator, + SigmaBaseUrl, + SigmaOrganization, + SigmaWorkbookTranslatorData, + load_sigma_asset_specs, +) + +import dagster as dg + +sigma_organization = SigmaOrganization( + base_url=SigmaBaseUrl.AWS_US, + client_id=dg.EnvVar("SIGMA_CLIENT_ID"), + client_secret=dg.EnvVar("SIGMA_CLIENT_SECRET"), +) + + +# A translator class lets us customize properties of the built Sigma assets, such as the owners or asset key +class MyCustomSigmaTranslator(DagsterSigmaTranslator): + def get_asset_spec(self, data: SigmaWorkbookTranslatorData) -> dg.AssetSpec: + # We create the default asset spec using super() + default_spec = super().get_asset_spec(data) + # we customize the team owner tag for all Sigma assets + return default_spec.replace_attributes(owners=["team:my_team"]) + + +sigma_specs = load_sigma_asset_specs( + sigma_organization, dagster_sigma_translator=MyCustomSigmaTranslator() +) +defs = dg.Definitions(assets=[*sigma_specs], resources={"sigma": sigma_organization}) +``` + +Note that `super()` is called in each of the overridden methods to generate the default asset spec. It is best practice to generate the default asset spec before customizing it. + +### Load Sigma assets from multiple organizations + +Definitions from multiple Sigma organizations can be combined by instantiating multiple resources and merging their specs. This lets you view all your Sigma assets in a single asset graph: + +{/* TODO convert to */} +```python file=/integrations/sigma/multiple-sigma-organizations.py +from dagster_sigma import SigmaBaseUrl, SigmaOrganization, load_sigma_asset_specs + +import dagster as dg + +sales_team_organization = SigmaOrganization( + base_url=SigmaBaseUrl.AWS_US, + client_id=dg.EnvVar("SALES_SIGMA_CLIENT_ID"), + client_secret=dg.EnvVar("SALES_SIGMA_CLIENT_SECRET"), +) + +marketing_team_organization = SigmaOrganization( + base_url=SigmaBaseUrl.AWS_US, + client_id=dg.EnvVar("MARKETING_SIGMA_CLIENT_ID"), + client_secret=dg.EnvVar("MARKETING_SIGMA_CLIENT_SECRET"), +) + +sales_team_specs = load_sigma_asset_specs(sales_team_organization) +marketing_team_specs = load_sigma_asset_specs(marketing_team_organization) + +# Merge the specs into a single set of definitions +defs = dg.Definitions( + assets=[*sales_team_specs, *marketing_team_specs], + resources={ + "marketing_sigma": marketing_team_organization, + "sales_sigma": sales_team_organization, + }, +) +``` + +### Related + +- [`dagster-sigma` API reference](/api/python-api/libraries/dagster-sigma) +- [Asset definitions](/guides/build/assets/defining-assets) +- [Resources](/guides/build/external-resources/) +- [Using environment variables and secrets](/guides/deploy/using-environment-variables-and-secrets) diff --git a/docs/docs-beta/docs/integrations/libraries/snowflake.md b/docs/docs-beta/docs/integrations/libraries/snowflake/index.md similarity index 100% rename from docs/docs-beta/docs/integrations/libraries/snowflake.md rename to docs/docs-beta/docs/integrations/libraries/snowflake/index.md diff --git a/docs/docs-beta/docs/integrations/libraries/snowflake/reference.md b/docs/docs-beta/docs/integrations/libraries/snowflake/reference.md new file mode 100644 index 0000000000000..3286ed0f32d71 --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/snowflake/reference.md @@ -0,0 +1,700 @@ +--- +title: "dagster-snowflake integration reference" +description: Store your Dagster assets in Snowflak +sidebar_position: 300 +--- + +This reference page provides information for working with [`dagster-snowflake`](/api/python-api/libraries/dagster-snowflake) features that are not covered as part of the Snowflake & Dagster tutorials ([resources](using-snowflake-with-dagster), [I/O managers](using-snowflake-with-dagster-io-managers)). + +## Authenticating using a private key + +In addition to password-based authentication, you can authenticate with Snowflake using a key pair. To set up private key authentication for your Snowflake account, see the instructions in the [Snowflake docs](https://docs.snowflake.com/en/user-guide/key-pair-auth.html#configuring-key-pair-authentication). + +Currently, the Dagster's Snowflake integration only supports encrypted private keys. You can provide the private key directly to the Snowflake resource or I/O manager, or via a file containing the private key. + + + + +**Directly to the resource** + +{/* TODO convert to */} +```python file=/integrations/snowflake/private_key_auth_resource.py startafter=start_direct_key endbefore=end_direct_key +from dagster_snowflake import SnowflakeResource + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_dataset], + resources={ + "snowflake": SnowflakeResource( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + private_key=EnvVar("SNOWFLAKE_PK"), + private_key_password=EnvVar("SNOWFLAKE_PK_PASSWORD"), + database="FLOWERS", + ) + }, +) +``` + +**Via a file** + +{/* TODO convert to */} +```python file=/integrations/snowflake/private_key_auth_resource.py startafter=start_key_file endbefore=end_key_file +from dagster_snowflake import SnowflakeResource + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_dataset], + resources={ + "snowflake": SnowflakeResource( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + private_key_path="/path/to/private/key/file.p8", + private_key_password=EnvVar("SNOWFLAKE_PK_PASSWORD"), + database="FLOWERS", + ) + }, +) +``` + + + + +**Directly to the I/O manager** + +{/* TODO convert to */} +```python file=/integrations/snowflake/private_key_auth_io_manager.py startafter=start_direct_key endbefore=end_direct_key +from dagster_snowflake_pandas import SnowflakePandasIOManager + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": SnowflakePandasIOManager( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + private_key=EnvVar("SNOWFLAKE_PK"), + private_key_password=EnvVar("SNOWFLAKE_PK_PASSWORD"), + database="FLOWERS", + ) + }, +) +``` + +**Via a file** + +{/* TODO convert to */} +```python file=/integrations/snowflake/private_key_auth_io_manager.py startafter=start_key_file endbefore=end_key_file +from dagster_snowflake_pandas import SnowflakePandasIOManager + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": SnowflakePandasIOManager( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + private_key_path="/path/to/private/key/file.p8", + private_key_password=EnvVar("SNOWFLAKE_PK_PASSWORD"), + database="FLOWERS", + ) + }, +) +``` + + + + +## Using the Snowflake resource + +### Executing custom SQL commands + +Using a [Snowflake resource](/api/python-api/libraries/dagster-snowflake#resource), you can execute custom SQL queries on a Snowflake database: + +{/* TODO convert to */} +```python file=/integrations/snowflake/resource.py startafter=start endbefore=end +from dagster_snowflake import SnowflakeResource + +from dagster import Definitions, EnvVar, asset + +# this example executes a query against the IRIS_DATASET table created in Step 2 of the +# Using Dagster with Snowflake tutorial + + +@asset +def small_petals(snowflake: SnowflakeResource): + query = """ + create or replace table iris.small_petals as ( + SELECT * + FROM iris.iris_dataset + WHERE species = 'petal_length_cm' < 1 AND 'petal_width_cm' < 1 + ); + """ + + with snowflake.get_connection() as conn: + conn.cursor.execute(query) + + +defs = Definitions( + assets=[small_petals], + resources={ + "snowflake": SnowflakeResource( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + password=EnvVar("SNOWFLAKE_PASSWORD"), + database="FLOWERS", + schema="IRIS", + ) + }, +) +``` + +Let's review what's happening in this example: + +- Attached the `SnowflakeResource` to the `small_petals` asset +- Used the `get_connection` context manager method of the Snowflake resource to get a [`snowflake.connector.Connection`](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api#object-connection) object +- Used the connection to execute a custom SQL query against the `IRIS_DATASET` table created in [Step 2](using-snowflake-with-dagster#step-2-create-tables-in-snowflake) of the [Snowflake resource tutorial](using-snowflake-with-dagster) + +For more information on the Snowflake resource, including additional configuration settings, see the API docs. + + +## Using the Snowflake I/O manager + +### Selecting specific columns in a downstream asset + +Sometimes you may not want to fetch an entire table as the input to a downstream asset. With the Snowflake I/O manager, you can select specific columns to load by supplying metadata on the downstream asset. + +{/* TODO convert to */} +```python file=/integrations/snowflake/downstream_columns.py +import pandas as pd + +from dagster import AssetIn, asset + +# this example uses the iris_dataset asset from Step 2 of the Using Dagster with Snowflake tutorial + + +@asset( + ins={ + "iris_sepal": AssetIn( + key="iris_dataset", + metadata={"columns": ["sepal_length_cm", "sepal_width_cm"]}, + ) + } +) +def sepal_data(iris_sepal: pd.DataFrame) -> pd.DataFrame: + iris_sepal["sepal_area_cm2"] = ( + iris_sepal["sepal_length_cm"] * iris_sepal["sepal_width_cm"] + ) + return iris_sepal +``` + +In this example, we only use the columns containing sepal data from the `IRIS_DATASET` table created in [Step 2](using-snowflake-with-dagster-io-managers#store-a-dagster-asset-as-a-table-in-snowflake) of the [Snowflake I/O manager tutorial](using-snowflake-with-dagster-io-managers). Fetching the entire table would be unnecessarily costly, so to select specific columns, we can add metadata to the input asset. We do this in the `metadata` parameter of the `AssetIn` that loads the `iris_dataset` asset in the `ins` parameter. We supply the key `columns` with a list of names of the columns we want to fetch. + +When Dagster materializes `sepal_data` and loads the `iris_dataset` asset using the Snowflake I/O manager, it will only fetch the `sepal_length_cm` and `sepal_width_cm` columns of the `FLOWERS.IRIS.IRIS_DATASET` table and pass them to `sepal_data` as a Pandas DataFrame. + +### Storing partitioned assets + +The Snowflake I/O manager supports storing and loading partitioned data. In order to correctly store and load data from the Snowflake table, the Snowflake I/O manager needs to know which column contains the data defining the partition bounds. The Snowflake I/O manager uses this information to construct the correct queries to select or replace the data. In the following sections, we describe how the I/O manager constructs these queries for different types of partitions. + + + + +To store statically-partitioned assets in Snowflake, specify `partition_expr` metadata on the asset to tell the Snowflake I/O manager which column contains the partition data: + +{/* TODO convert to CodeExample */} +```python file=/integrations/snowflake/static_partition.py +import pandas as pd + +from dagster import AssetExecutionContext, StaticPartitionsDefinition, asset + + +@asset( + partitions_def=StaticPartitionsDefinition( + ["Iris-setosa", "Iris-virginica", "Iris-versicolor"] + ), + metadata={"partition_expr": "SPECIES"}, +) +def iris_dataset_partitioned(context: AssetExecutionContext) -> pd.DataFrame: + species = context.partition_key + + full_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + return full_df[full_df["Species"] == species] + + +@asset +def iris_cleaned(iris_dataset_partitioned: pd.DataFrame): + return iris_dataset_partitioned.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the partition in the downstream asset. When loading a static partition (or multiple static partitions), the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] in ([selected partitions]) +``` + +When the `partition_expr` value is injected into this statement, the resulting SQL query must follow Snowflake's SQL syntax. Refer to the [Snowflake documentation](https://docs.snowflake.com/en/sql-reference/constructs) for more information. + +{/* TODO fix link: When materializing the above assets, a partition must be selected, as described in [Materializing partitioned assets](/concepts/partitions-schedules-sensors/partitioning-assets#materializing-partitioned-assets).*/} When materializing the above assets, a partition must be selected. In this example, the query used when materializing the `Iris-setosa` partition of the above assets would be: + +```sql +SELECT * + WHERE SPECIES in ('Iris-setosa') +``` + + + + +Like statically-partitioned assets, you can specify `partition_expr` metadata on the asset to tell the Snowflake I/O manager which column contains the partition data: + +{/* TODO convert to CodeExample */} +```python file=/integrations/snowflake/time_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import AssetExecutionContext, DailyPartitionsDefinition, asset + + +@asset( + partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"), + metadata={"partition_expr": "TO_TIMESTAMP(TIME::INT)"}, +) +def iris_data_per_day(context: AssetExecutionContext) -> pd.DataFrame: + partition = context.partition_key + + # get_iris_data_for_date fetches all of the iris data for a given date, + # the returned dataframe contains a column named 'time' with that stores + # the time of the row as an integer of seconds since epoch + return get_iris_data_for_date(partition) + + +@asset +def iris_cleaned(iris_data_per_day: pd.DataFrame): + return iris_data_per_day.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in the downstream asset. When loading a dynamic partition, the following statement is used: + +```sql +SELECT * + WHERE [partition_expr] >= [partition_start] + AND [partition_expr] < [partition_end] +``` + +When the `partition_expr` value is injected into this statement, the resulting SQL query must follow Snowflake's SQL syntax. Refer to the [Snowflake documentation](https://docs.snowflake.com/en/sql-reference/constructs) for more information. + +{/* TODO fix link: When materializing the above assets, a partition must be selected, as described in [Materializing partitioned assets](/concepts/partitions-schedules-sensors/partitioning-assets#materializing-partitioned-assets). */} When materializing the above assets, a partition must be selected. The `[partition_start]` and `[partition_end]` bounds are of the form `YYYY-MM-DD HH:MM:SS`. In this example, the query when materializing the `2023-01-02` partition of the above assets would be: + +```sql +SELECT * + WHERE TO_TIMESTAMP(TIME::INT) >= '2023-01-02 00:00:00' + AND TO_TIMESTAMP(TIME::INT) < '2023-01-03 00:00:00' +``` + +In this example, the data in the `TIME` column are integers, so the `partition_expr` metadata includes a SQL statement to convert integers to timestamps. A full list of Snowflake functions can be found [here](https://docs.snowflake.com/en/sql-reference/functions-all). + + + + +The Snowflake I/O manager can also store data partitioned on multiple dimensions. To do this, you must specify the column for each partition as a dictionary of `partition_expr` metadata: + +{/* TODO convert to CodeExample */} +```python file=/integrations/snowflake/multi_partition.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import ( + AssetExecutionContext, + DailyPartitionsDefinition, + MultiPartitionKey, + MultiPartitionsDefinition, + StaticPartitionsDefinition, + asset, +) + + +@asset( + partitions_def=MultiPartitionsDefinition( + { + "date": DailyPartitionsDefinition(start_date="2023-01-01"), + "species": StaticPartitionsDefinition( + ["Iris-setosa", "Iris-virginica", "Iris-versicolor"] + ), + } + ), + metadata={ + "partition_expr": {"date": "TO_TIMESTAMP(TIME::INT)", "species": "SPECIES"} + }, +) +def iris_dataset_partitioned(context: AssetExecutionContext) -> pd.DataFrame: + partition = context.partition_key.keys_by_dimension + species = partition["species"] + date = partition["date"] + + # get_iris_data_for_date fetches all of the iris data for a given date, + # the returned dataframe contains a column named 'time' with that stores + # the time of the row as an integer of seconds since epoch + full_df = get_iris_data_for_date(date) + + return full_df[full_df["species"] == species] + + +@asset +def iris_cleaned(iris_dataset_partitioned: pd.DataFrame): + return iris_dataset_partitioned.dropna().drop_duplicates() +``` + +Dagster uses the `partition_expr` metadata to craft the `SELECT` statement when loading the correct partition in a downstream asset. For multi-partitions, Dagster concatenates the `WHERE` statements described in the above sections to craft the correct `SELECT` statement. + +{/* TODO fix link: When materializing the above assets, a partition must be selected, as described in [Materializing partitioned assets](/concepts/partitions-schedules-sensors/partitioning-assets#materializing-partitioned-assets). */} When materializing the above assets, a partition must be selected. For example, when materializing the `2023-01-02|Iris-setosa` partition of the above assets, the following query will be used: + +```sql +SELECT * + WHERE SPECIES in ('Iris-setosa') + AND TO_TIMESTAMP(TIME::INT) >= '2023-01-02 00:00:00' + AND TO_TIMESTAMP(TIME::INT) < '2023-01-03 00:00:00' +``` + + + + +### Storing tables in multiple schemas + +If you want to have different assets stored in different Snowflake schemas, the Snowflake I/O manager allows you to specify the schema in a few ways. + +You can specify the default schema where data will be stored as configuration to the I/O manager, like we did in [Step 1](using-snowflake-with-dagster-io-managers#step-1-configure-the-snowflake-io-manager) of the [Snowflake I/O manager tutorial](using-snowflake-with-dagster-io-managers). + +To store assets in different schemas, specify the schema as metadata: + +{/* TODO convert to */} +```python file=/integrations/snowflake/schema.py startafter=start_metadata endbefore=end_metadata dedent=4 +daffodil_dataset = AssetSpec( + key=["daffodil_dataset"], metadata={"schema": "daffodil"} +) + +@asset(metadata={"schema": "iris"}) +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +You can also specify the schema as part of the asset's asset key: + +{/* TODO convert to */} +```python file=/integrations/snowflake/schema.py startafter=start_asset_key endbefore=end_asset_key dedent=4 +daffodil_dataset = AssetSpec(key=["daffodil", "daffodil_dataset"]) + +@asset(key_prefix=["iris"]) +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +In this example, the `iris_dataset` asset will be stored in the `IRIS` schema, and the `daffodil_dataset` asset will be found in the `DAFFODIL` schema. + +:::note + + The schema is determined in this order: +
    +
  1. If the schema is set via metadata, that schema will be used
  2. +
  3. + Otherwise, the schema set as configuration on the I/O manager will be used +
  4. +
  5. + Otherwise, if there is a key_prefix, that schema will be used +
  6. +
  7. + If none of the above are provided, the default schema will be PUBLIC +
  8. +
+ +::: + +### Storing timestamp data in Pandas DataFrames + +When storing a Pandas DataFrame with the Snowflake I/O manager, the I/O manager will check if timestamp data has a timezone attached, and if not, **it will assign the UTC timezone**. In Snowflake, you will see the timestamp data stored as the `TIMESTAMP_NTZ(9)` type, as this is the type assigned by the Snowflake Pandas connector. + +:::note + +Prior to `dagster-snowflake` version `0.19.0` the Snowflake I/O manager converted all timestamp data to strings before loading the data in Snowflake, and did the opposite conversion when fetching a DataFrame from Snowflake. If you have used a version of `dagster-snowflake` prior to version `0.19.0`, see the [Migration Guide](/guides/migrate/version-migration#extension-libraries) for information about migrating database tables. + +::: + +### Using the Snowflake I/O manager with other I/O managers + +You may have assets that you don't want to store in Snowflake. You can provide an I/O manager to each asset using the `io_manager_key` parameter in the `asset` decorator: + +{/* TODO convert to */} +```python file=/integrations/snowflake/multiple_io_managers.py startafter=start_example endbefore=end_example +import pandas as pd +from dagster_aws.s3.io_manager import s3_pickle_io_manager +from dagster_snowflake_pandas import SnowflakePandasIOManager + +from dagster import Definitions, EnvVar, asset + + +@asset(io_manager_key="warehouse_io_manager") +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + +@asset(io_manager_key="blob_io_manager") +def iris_plots(iris_dataset): + # plot_data is a function we've defined somewhere else + # that plots the data in a DataFrame + return plot_data(iris_dataset) + + +defs = Definitions( + assets=[iris_dataset, iris_plots], + resources={ + "warehouse_io_manager": SnowflakePandasIOManager( + database="FLOWERS", + schema="IRIS", + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + password=EnvVar("SNOWFLAKE_PASSWORD"), + ), + "blob_io_manager": s3_pickle_io_manager, + }, +) +``` + +In this example, the `iris_dataset` asset uses the I/O manager bound to the key `warehouse_io_manager` and `iris_plots` will use the I/O manager bound to the key `blob_io_manager`. In the object, we supply the I/O managers for those keys. When the assets are materialized, the `iris_dataset` will be stored in Snowflake, and `iris_plots` will be saved in Amazon S3. + +### Storing and loading PySpark DataFrames in Snowflake + +The Snowflake I/O manager also supports storing and loading PySpark DataFrames. To use the , first install the package: + +```shell +pip install dagster-snowflake-pyspark +``` + +Then you can use the `SnowflakePySparkIOManager` in your `Definitions` as in [Step 1](using-snowflake-with-dagster-io-managers#step-1-configure-the-snowflake-io-manager) of the [Snowflake I/O manager tutorial](using-snowflake-with-dagster-io-managers). + +{/* TODO convert to */} +```python file=/integrations/snowflake/pyspark_configuration.py startafter=start_configuration endbefore=end_configuration +from dagster_snowflake_pyspark import SnowflakePySparkIOManager + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": SnowflakePySparkIOManager( + account="abc1234.us-east-1", # required + user=EnvVar("SNOWFLAKE_USER"), # required + password=EnvVar("SNOWFLAKE_PASSWORD"), # password or private key required + database="FLOWERS", # required + warehouse="PLANTS", # required for PySpark + role="writer", # optional, defaults to the default role for the account + schema="IRIS", # optional, defaults to PUBLIC + ) + }, +) +``` + +:::note + +When using the `snowflake_pyspark_io_manager` the `warehouse` configuration is required. + +::: + +The `SnowflakePySparkIOManager` requires that a `SparkSession` be active and configured with the [Snowflake connector for Spark](https://docs.snowflake.com/en/user-guide/spark-connector.html). You can either create your own `SparkSession` or use the . + + + + +{/* TODO convert to CodeExample */} +```python file=/integrations/snowflake/pyspark_with_spark_resource.py +from dagster_pyspark import pyspark_resource +from dagster_snowflake_pyspark import SnowflakePySparkIOManager +from pyspark import SparkFiles +from pyspark.sql import DataFrame +from pyspark.sql.types import DoubleType, StringType, StructField, StructType + +from dagster import AssetExecutionContext, Definitions, EnvVar, asset + +SNOWFLAKE_JARS = "net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.12:2.8.2-spark_3.0" + + +@asset(required_resource_keys={"pyspark"}) +def iris_dataset(context: AssetExecutionContext) -> DataFrame: + spark = context.resources.pyspark.spark_session + + schema = StructType( + [ + StructField("sepal_length_cm", DoubleType()), + StructField("sepal_width_cm", DoubleType()), + StructField("petal_length_cm", DoubleType()), + StructField("petal_width_cm", DoubleType()), + StructField("species", StringType()), + ] + ) + + url = "https://docs.dagster.io/assets/iris.csv" + spark.sparkContext.addFile(url) + + return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.csv")) + + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": SnowflakePySparkIOManager( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + password=EnvVar("SNOWFLAKE_PASSWORD"), + database="FLOWERS", + warehouse="PLANTS", + schema="IRIS", + ), + "pyspark": pyspark_resource.configured( + {"spark_conf": {"spark.jars.packages": SNOWFLAKE_JARS}} + ), + }, +) +``` + + + + +{/* TODO convert to CodeExample */} +```python file=/integrations/snowflake/pyspark_with_spark_session.py +from dagster_snowflake_pyspark import SnowflakePySparkIOManager +from pyspark import SparkFiles +from pyspark.sql import DataFrame, SparkSession +from pyspark.sql.types import DoubleType, StringType, StructField, StructType + +from dagster import Definitions, EnvVar, asset + +SNOWFLAKE_JARS = "net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.12:2.8.2-spark_3.0" + + +@asset +def iris_dataset() -> DataFrame: + spark = SparkSession.builder.config( + key="spark.jars.packages", + value=SNOWFLAKE_JARS, + ).getOrCreate() + + schema = StructType( + [ + StructField("sepal_length_cm", DoubleType()), + StructField("sepal_width_cm", DoubleType()), + StructField("petal_length_cm", DoubleType()), + StructField("petal_width_cm", DoubleType()), + StructField("species", StringType()), + ] + ) + + url = ("https://docs.dagster.io/assets/iris.csv",) + spark.sparkContext.addFile(url) + + return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.csv")) + + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": SnowflakePySparkIOManager( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + password=EnvVar("SNOWFLAKE_PASSWORD"), + database="FLOWERS", + warehouse="PLANTS", + schema="IRIS", + ), + }, +) +``` + + + + +### Using Pandas and PySpark DataFrames with Snowflake + +If you work with both Pandas and PySpark DataFrames and want a single I/O manager to handle storing and loading these DataFrames in Snowflake, you can write a new I/O manager that handles both types. To do this, inherit from the base class and implement the `type_handlers` and `default_load_type` methods. The resulting I/O manager will inherit the configuration fields of the base `SnowflakeIOManager`. + +{/* TODO convert to */} +```python file=/integrations/snowflake/pandas_and_pyspark.py startafter=start_example endbefore=end_example +from typing import Optional, Type + +import pandas as pd +from dagster_snowflake import SnowflakeIOManager +from dagster_snowflake_pandas import SnowflakePandasTypeHandler +from dagster_snowflake_pyspark import SnowflakePySparkTypeHandler + +from dagster import Definitions, EnvVar + + +class SnowflakePandasPySparkIOManager(SnowflakeIOManager): + @staticmethod + def type_handlers(): + """type_handlers should return a list of the TypeHandlers that the I/O manager can use. + Here we return the SnowflakePandasTypeHandler and SnowflakePySparkTypeHandler so that the I/O + manager can store Pandas DataFrames and PySpark DataFrames. + """ + return [SnowflakePandasTypeHandler(), SnowflakePySparkTypeHandler()] + + @staticmethod + def default_load_type() -> Optional[type]: + """If an asset is not annotated with an return type, default_load_type will be used to + determine which TypeHandler to use to store and load the output. + In this case, unannotated assets will be stored and loaded as Pandas DataFrames. + """ + return pd.DataFrame + + +defs = Definitions( + assets=[iris_dataset, rose_dataset], + resources={ + "io_manager": SnowflakePandasPySparkIOManager( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + password=EnvVar("SNOWFLAKE_PASSWORD"), + database="FLOWERS", + role="writer", + warehouse="PLANTS", + schema="IRIS", + ) + }, +) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/snowflake/using-snowflake-with-dagster-io-managers.md b/docs/docs-beta/docs/integrations/libraries/snowflake/using-snowflake-with-dagster-io-managers.md new file mode 100644 index 0000000000000..2f7bd07816fbf --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/snowflake/using-snowflake-with-dagster-io-managers.md @@ -0,0 +1,213 @@ +--- +title: "Using Snowflake with with Dagster I/O managers" +description: "Learn to integrate Snowflake with Dagster using a Snowflake I/O manager." +sidebar_position: 100 +--- + +This tutorial focuses on how to store and load Dagster's [asset definitions](/guides/build/assets/defining-assets) in Snowflake by using a Snowflake I/O manager. An [**I/O manager**](/guides/build/io-managers/) transfers the responsibility of storing and loading DataFrames as Snowflake tables to Dagster. + +By the end of the tutorial, you will: + +- Configure a Snowflake I/O manager +- Create a table in Snowflake using a Dagster asset +- Make a Snowflake table available in Dagster +- Load Snowflake tables in downstream assets + +This guide focuses on storing and loading Pandas DataFrames in Snowflake, but Dagster also supports using PySpark DataFrames with Snowflake. The concepts from this guide apply to working with PySpark DataFrames, and you can learn more about setting up and using the Snowflake I/O manager with PySpark DataFrames in the [Snowflake reference](reference). + +**Prefer to use resources instead?** Unlike an I/O manager, resources allow you to run SQL queries directly against tables within an asset's compute function. For details, see "[Using Snowlake with Dagster resources](using-snowflake-with-dagster)". + +## Prerequisites + +To complete this tutorial, you'll need: + +- **To install the `dagster-snowflake` and `dagster-snowflake-pandas` libraries**: + + ```shell + pip install dagster-snowflake dagster-snowflake-pandas + ``` + +- **To gather the following information**, which is required to use the Snowflake I/O manager: + + - **Snowflake account name**: You can find this by logging into Snowflake and getting the account name from the URL: + + ![Snowflake account name from URL](/images/integrations/snowflake/snowflake-account.png) + + - **Snowflake credentials**: You can authenticate with Snowflake two ways: with a username and password, or with a username and private key. + + The Snowflake I/O manager can read all of these authentication values from environment variables. In this guide, we use password authentication and store the username and password as `SNOWFLAKE_USER` and `SNOWFLAKE_PASSWORD`, respectively. + + ```shell + export SNOWFLAKE_USER= + export SNOWFLAKE_PASSWORD= + ``` + + Refer to the [Using environment variables and secrets guide](/guides/deploy/using-environment-variables-and-secrets) for more info. + + For more information on authenticating with a private key, see [Authenticating with a private key](reference#authenticating-using-a-private-key) in the Snowflake reference guide. + + + +## Step 1: Configure the Snowflake I/O manager + +The Snowflake I/O manager requires some configuration to connect to your Snowflake instance. The `account`, `user` are required to connect with Snowflake. One method of authentication is required. You can use a password or a private key. Additionally, you need to specify a `database` to where all the tables should be stored. + +You can also provide some optional configuration to further customize the Snowflake I/O manager. You can specify a `warehouse` and `schema` where data should be stored, and a `role` for the I/O manager. + +{/* TODO convert to */} +```python file=/integrations/snowflake/io_manager_tutorial/configuration.py startafter=start_example endbefore=end_example +from dagster_snowflake_pandas import SnowflakePandasIOManager + +from dagster import Definitions, EnvVar + +defs = Definitions( + assets=[iris_dataset], + resources={ + "io_manager": SnowflakePandasIOManager( + account="abc1234.us-east-1", # required + user=EnvVar("SNOWFLAKE_USER"), # required + password=EnvVar("SNOWFLAKE_PASSWORD"), # password or private key required + database="FLOWERS", # required + role="writer", # optional, defaults to the default role for the account + warehouse="PLANTS", # optional, defaults to default warehouse for the account + schema="IRIS", # optional, defaults to PUBLIC + ) + }, +) +``` + +With this configuration, if you materialized an asset called `iris_dataset`, the Snowflake I/O manager would be permissioned with the role `writer` and would store the data in the `FLOWERS.IRIS.IRIS_DATASET` table in the `PLANTS` warehouse. + +Finally, in the object, we assign the to the `io_manager` key. `io_manager` is a reserved key to set the default I/O manager for your assets. + +For more info about each of the configuration values, refer to the API documentation. + + + +## Step 2: Create tables in Snowflake + +The Snowflake I/O manager can create and update tables for your Dagster defined assets, but you can also make existing Snowflake tables available to Dagster. + + + + + +### Store a Dagster asset as a table in Snowflake + +To store data in Snowflake using the Snowflake I/O manager, the definitions of your assets don't need to change. You can tell Dagster to use the Snowflake I/O manager, like in [Step 1: Configure the Snowflake I/O manager](#step-1-configure-the-snowflake-io-manager), and Dagster will handle storing and loading your assets in Snowflake. + +{/* TODO convert to */} +```python file=/integrations/snowflake/io_manager_tutorial/create_table.py +import pandas as pd + +from dagster import asset + + +@asset +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) +``` + +In this example, we first define our [asset](/guides/build/assets/defining-assets). Here, we are fetching the Iris dataset as a Pandas DataFrame and renaming the columns. The type signature of the function tells the I/O manager what data type it is working with, so it is important to include the return type `pd.DataFrame`. + +When Dagster materializes the `iris_dataset` asset using the configuration from [Step 1: Configure the Snowflake I/O manager](#step-1-configure-the-snowflake-io-manager), the Snowflake I/O manager will create the table `FLOWERS.IRIS.IRIS_DATASET` if it does not exist and replace the contents of the table with the value returned from the `iris_dataset` asset. + + + + + +You may already have tables in Snowflake that you want to make available to other Dagster assets. You can define [external assets](/guides/build/assets/external-assets) for these tables. By defining an external asset for the existing table, you tell Dagster how to find the table so it can be fetched for downstream assets. + +{/* TODO convert to */} +```python file=/integrations/snowflake/source_asset.py +from dagster import AssetSpec + +iris_harvest_data = AssetSpec(key="iris_harvest_data") +``` + +In this example, we create a for a pre-existing table - perhaps created by an external data ingestion tool - that contains data about iris harvests. To make the data available to other Dagster assets, we need to tell the Snowflake I/O manager how to find the data. + +Since we supply the database and the schema in the I/O manager configuration in [Step 1: Configure the Snowflake I/O manager](#step-1-configure-the-snowflake-io-manager), we only need to provide the table name. We do this with the `key` parameter in `AssetSpec`. When the I/O manager needs to load the `iris_harvest_data` in a downstream asset, it will select the data in the `FLOWERS.IRIS.IRIS_HARVEST_DATA` table as a Pandas DataFrame and provide it to the downstream asset. + + + + +## Step 3: Load Snowflake tables in downstream assets + +Once you have created an asset that represents a table in Snowflake, you will likely want to create additional assets that work with the data. Dagster and the Snowflake I/O manager allow you to load the data stored in Snowflake tables into downstream assets. + +{/* TODO convert to */} +```python file=/integrations/snowflake/io_manager_tutorial/downstream.py startafter=start_example endbefore=end_example +import pandas as pd + +from dagster import asset + +# this example uses the iris_dataset asset from Step 2 + + +@asset +def iris_cleaned(iris_dataset: pd.DataFrame) -> pd.DataFrame: + return iris_dataset.dropna().drop_duplicates() +``` + +In this example, we want to provide the `iris_dataset` asset from the [Store a Dagster asset as a table in Snowflake](#store-a-dagster-asset-as-a-table-in-snowflake) example to the `iris_cleaned` asset. In `iris_cleaned`, the `iris_dataset` parameter tells Dagster that the value for the `iris_dataset` asset should be provided as input to `iris_cleaned`. + +When materializing these assets, Dagster will use the `SnowflakePandasIOManager` to fetch the `FLOWERS.IRIS.IRIS_DATASET` as a Pandas DataFrame and pass this DataFrame as the `iris_dataset` parameter to `iris_cleaned`. When `iris_cleaned` returns a Pandas DataFrame, Dagster will use the `SnowflakePandasIOManager` to store the DataFrame as the `FLOWERS.IRIS.IRIS_CLEANED` table in Snowflake. + +## Completed code example + +When finished, your code should look like the following: + +{/* TODO convert to */} +```python file=/integrations/snowflake/io_manager_tutorial/full_example.py +import pandas as pd +from dagster_snowflake_pandas import SnowflakePandasIOManager + +from dagster import AssetSpec, Definitions, EnvVar, asset + +iris_harvest_data = AssetSpec(key="iris_harvest_data") + + +@asset +def iris_dataset() -> pd.DataFrame: + return pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "sepal_width_cm", + "petal_length_cm", + "petal_width_cm", + "species", + ], + ) + + +@asset +def iris_cleaned(iris_dataset: pd.DataFrame) -> pd.DataFrame: + return iris_dataset.dropna().drop_duplicates() + + +defs = Definitions( + assets=[iris_dataset, iris_harvest_data, iris_cleaned], + resources={ + "io_manager": SnowflakePandasIOManager( + account="abc1234.us-east-1", + user=EnvVar("SNOWFLAKE_USER"), + password=EnvVar("SNOWFLAKE_PASSWORD"), + database="FLOWERS", + role="writer", + warehouse="PLANTS", + schema="IRIS", + ) + }, +) +``` diff --git a/docs/docs-beta/docs/integrations/libraries/snowflake/using-snowflake-with-dagster.md b/docs/docs-beta/docs/integrations/libraries/snowflake/using-snowflake-with-dagster.md new file mode 100644 index 0000000000000..c48d2334ab58b --- /dev/null +++ b/docs/docs-beta/docs/integrations/libraries/snowflake/using-snowflake-with-dagster.md @@ -0,0 +1,257 @@ +--- +title: "Using Snowflake with Dagster resources" +description: "Learn to integrate Snowflake with Dagster using a Snowflake resource." +sidebar_position: 200 +--- + +This tutorial focuses on how to store and load Dagster's [asset definitions](/guides/build/assets/defining-assets) in Snowflake by using Dagster's . A [**resource**](/guides/build/external-resources/) allows you to directly run SQL queries against tables within an asset's compute function. + +By the end of the tutorial, you will: + +- Configure a Snowflake resource +- Use the Snowflake resource to execute a SQL query that creates a table +- Load Snowflake tables in downstream assets +- Add the assets and Snowflake resource to a `Definitions` object + +**Prefer to use an I/O manager?** Unlike resources, an [I/O manager](/guides/build/io-managers/) transfers the responsibility of storing and loading DataFrames as Snowflake tables to Dagster. Refer to the [Snowlake I/O manager guide](using-snowflake-with-dagster-io-managers) for more info. + +## Prerequisites + +To complete this tutorial, you'll need: + +- **To install the following libraries**: + + ```shell + pip install dagster dagster-snowflake pandas + ``` + +- **To gather the following information**, which is required to use the Snowflake resource: + + - **Snowflake account name**: You can find this by logging into Snowflake and getting the account name from the URL: + + ![Snowflake account name in URL](/images/integrations/snowflake/snowflake-account.png) + + - **Snowflake credentials**: You can authenticate with Snowflake two ways: with a username and password or with a username and private key. + + The Snowflake resource can read these authentication values from environment variables. In this guide, we use password authentication and store the username and password as `SNOWFLAKE_USER` and `SNOWFLAKE_PASSWORD`, respectively: + + ```shell + export SNOWFLAKE_USER= + export SNOWFLAKE_PASSWORD= + ``` + + Refer to the [Using environment variables and secrets guide](/guides/deploy/using-environment-variables-and-secrets) for more info. + + For more information on authenticating with a private key, see [Authenticating with a private key](reference#authenticating-using-a-private-key) in the Snowflake reference guide. + + +## Step 1: Configure the Snowflake resource + +To connect to Snowflake, we'll use the `dagster-snowflake` . The requires some configuration: + +- **The `account` and `user` values are required.** +- **One method of authentication is required**, either by using a password or a private key. +- **Optional**: Using the `warehouse`, `schema`, and `role` attributes, you can specify where data should be stored and a `role` for the resource to use. + +{/* TODO convert to */} +```python file=/integrations/snowflake/resource_tutorial/full_example.py startafter=start_config endbefore=end_config +from dagster_snowflake import SnowflakeResource +from snowflake.connector.pandas_tools import write_pandas + +from dagster import Definitions, EnvVar, MaterializeResult, asset + +snowflake = SnowflakeResource( + account=EnvVar("SNOWFLAKE_ACCOUNT"), # required + user=EnvVar("SNOWFLAKE_USER"), # required + password=EnvVar("SNOWFLAKE_PASSWORD"), # password or private key required + warehouse="PLANTS", + schema="IRIS", + role="WRITER", +) +``` + +With this configuration, if you materialized an asset named `iris_dataset`, would use the role `WRITER` and store the data in the `FLOWERS.IRIS.IRIS_DATASET` table using the `PLANTS` warehouse. + +For more info about each of the configuration values, refer to the API documentation. + +## Step 2: Create tables in Snowflake + + + + +Using the Snowflake resource, you can create Snowflake tables using the Snowflake Python API: + +{/* TODO convert to */} +```python file=/integrations/snowflake/resource_tutorial/full_example.py startafter=start_asset endbefore=end_asset +import pandas as pd +from dagster_snowflake import SnowflakeResource +from snowflake.connector.pandas_tools import write_pandas + +from dagster import MaterializeResult, asset + + +@asset +def iris_dataset(snowflake: SnowflakeResource): + iris_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "species", + ], + ) + + with snowflake.get_connection() as conn: + table_name = "iris_dataset" + database = "flowers" + schema = "iris" + success, number_chunks, rows_inserted, output = write_pandas( + conn, + iris_df, + table_name=table_name, + database=database, + schema=schema, + auto_create_table=True, + overwrite=True, + quote_identifiers=False, + ) + + return MaterializeResult( + metadata={"rows_inserted": rows_inserted}, + ) +``` + +In this example, we've defined an asset that fetches the Iris dataset as a Pandas DataFrame. Then, using the Snowflake resource, the DataFrame is stored in Snowflake as the `FLOWERS.IRIS.IRIS_DATASET` table. + + + + +If you have existing tables in Snowflake and other assets defined in Dagster depend on those tables, you may want Dagster to be aware of those upstream dependencies. + +Making Dagster aware of these tables allows you to track the full data lineage in Dagster. You can accomplish this by defining [external assets](/guides/build/assets/external-assets) for these tables. For example: + +{/* TODO convert to */} +```python file=/integrations/snowflake/source_asset.py +from dagster import AssetSpec + +iris_harvest_data = AssetSpec(key="iris_harvest_data") +``` + +In this example, we created a for a pre-existing table called `iris_harvest_data`. + +Since we supplied the database and the schema in the resource configuration in [Step 1](#step-1-configure-the-snowflake-resource), we only need to provide the table name. We did this by using the `key` parameter in our . When the `iris_harvest_data` asset needs to be loaded in a downstream asset, the data in the `FLOWERS.IRIS.IRIS_HARVEST_DATA` table will be selected and provided to the asset. + + + + +## Step 3: Define downstream assets + +Once you've created an asset that represents a table in Snowflake, you may want to create additional assets that work with the data. In the following example, we've defined an asset that creates a second table, which contains only the data for the _Iris Setosa_ species: + +{/* TODO convert to */} +```python file=/integrations/snowflake/resource_tutorial/full_example.py startafter=start_downstream endbefore=end_downstream +from dagster_snowflake import SnowflakeResource + +from dagster import asset + + +@asset(deps=["iris_dataset"]) +def iris_setosa(snowflake: SnowflakeResource) -> None: + query = """ + create or replace table iris.iris_setosa as ( + SELECT * + FROM iris.iris_dataset + WHERE species = 'Iris-setosa' + ); + """ + + with snowflake.get_connection() as conn: + conn.cursor.execute(query) +``` + +To accomplish this, we defined a dependency on the `iris_dataset` asset using the `deps` parameter. Then, the SQL query runs and creates the table of _Iris Setosa_ data. + +## Step 4: Definitions object + +The last step is to add the and the assets to the project's object: + +{/* TODO convert to */} +```python file=/integrations/snowflake/resource_tutorial/full_example.py startafter=start_definitions endbefore=end_definitions +from dagster import Definitions + +defs = Definitions( + assets=[iris_dataset, iris_setosa], resources={"snowflake": snowflake} +) +``` + +This makes the resource and assets available to Dagster tools like the UI and CLI. + +## Completed code example + +When finished, your code should look like the following: + +{/* TODO convert to */} +```python file=/integrations/snowflake/resource_tutorial/full_example.py lines=1,4-16,27-58,67-80,86-88 +import pandas as pd +from dagster_snowflake import SnowflakeResource +from snowflake.connector.pandas_tools import write_pandas + +from dagster import Definitions, EnvVar, MaterializeResult, asset + +snowflake = SnowflakeResource( + account=EnvVar("SNOWFLAKE_ACCOUNT"), # required + user=EnvVar("SNOWFLAKE_USER"), # required + password=EnvVar("SNOWFLAKE_PASSWORD"), # password or private key required + warehouse="PLANTS", + schema="IRIS", + role="WRITER", +) + + +@asset +def iris_dataset(snowflake: SnowflakeResource): + iris_df = pd.read_csv( + "https://docs.dagster.io/assets/iris.csv", + names=[ + "sepal_length_cm", + "species", + ], + ) + + with snowflake.get_connection() as conn: + table_name = "iris_dataset" + database = "flowers" + schema = "iris" + success, number_chunks, rows_inserted, output = write_pandas( + conn, + iris_df, + table_name=table_name, + database=database, + schema=schema, + auto_create_table=True, + overwrite=True, + quote_identifiers=False, + ) + + return MaterializeResult( + metadata={"rows_inserted": rows_inserted}, + ) + + +@asset(deps=["iris_dataset"]) +def iris_setosa(snowflake: SnowflakeResource) -> None: + query = """ + create or replace table iris.iris_setosa as ( + SELECT * + FROM iris.iris_dataset + WHERE species = 'Iris-setosa' + ); + """ + + with snowflake.get_connection() as conn: + conn.cursor.execute(query) + + +defs = Definitions( + assets=[iris_dataset, iris_setosa], resources={"snowflake": snowflake} +) +``` diff --git a/docs/docs-beta/docusaurus.config.ts b/docs/docs-beta/docusaurus.config.ts index 99c9cb39a2064..a2364b5c93647 100644 --- a/docs/docs-beta/docusaurus.config.ts +++ b/docs/docs-beta/docusaurus.config.ts @@ -1214,27 +1214,27 @@ const config: Config = { }, { from: '/integrations/deltalake', - to: '/integrations/libraries/deltalake', + to: '/integrations/libraries/deltalake/', }, { from: '/integrations/deltalake/using-deltalake-with-dagster', - to: '/integrations/libraries/deltalake', + to: '/integrations/libraries/deltalake/using-deltalake-with-dagster', }, { from: '/integrations/deltalake/reference', - to: '/integrations/libraries/deltalake', + to: '/integrations/libraries/deltalake/reference', }, { from: '/integrations/duckdb', - to: '/integrations/libraries/duckdb', + to: '/integrations/libraries/duckdb/', }, { from: '/integrations/duckdb/using-duckdb-with-dagster', - to: '/integrations/libraries/duckdb', + to: '/integrations/libraries/duckdb/using-duckdb-with-dagster', }, { from: '/integrations/duckdb/reference', - to: '/integrations/libraries/duckdb', + to: '/integrations/libraries/duckdb/reference', }, { from: '/integrations/embedded-elt', @@ -1242,7 +1242,7 @@ const config: Config = { }, { from: '/integrations/embedded-elt/dlt', - to: '/integrations/libraries/dlt', + to: '/integrations/libraries/dlt/', }, { from: '/integrations/embedded-elt/sling', @@ -1254,51 +1254,51 @@ const config: Config = { }, { from: '/integrations/bigquery', - to: '/integrations/libraries/gcp/bigquery', + to: '/integrations/libraries/gcp/bigquery/', }, { from: '/integrations/bigquery/using-bigquery-with-dagster', - to: '/integrations/libraries/gcp/bigquery', + to: '/integrations/libraries/gcp/bigquery/using-bigquery-with-dagster', }, { from: '/integrations/bigquery/reference', - to: '/integrations/libraries/gcp/bigquery', + to: '/integrations/libraries/gcp/bigquery/reference', }, { from: '/integrations/dagstermill', - to: '/integrations/libraries/jupyter', + to: '/integrations/libraries/jupyter/', }, { from: '/integrations/dagstermill/using-notebooks-with-dagster', - to: '/integrations/libraries/', + to: '/integrations/libraries/jupyter/using-notebooks-with-dagster', }, { from: '/integrations/dagstermill/reference', - to: '/integrations/libraries/', + to: '/integrations/libraries/jupyter/reference', }, { from: '/integrations/looker', - to: '/integrations/libraries/looker', + to: '/integrations/libraries/looker/', }, { from: '/integrations/openai', - to: '/integrations/libraries/openai', + to: '/integrations/libraries/openai/', }, { from: '/integrations/pandas', - to: '/integrations/libraries/pandas', + to: '/integrations/libraries/pandas/', }, { from: '/integrations/pandera', - to: '/integrations/libraries/pandera', + to: '/integrations/libraries/pandera/', }, { from: '/integrations/powerbi', - to: '/integrations/libraries/powerbi', + to: '/integrations/libraries/powerbi/', }, { from: '/integrations/sigma', - to: '/integrations/libraries/sigma', + to: '/integrations/libraries/sigma/', }, { from: '/integrations/spark', @@ -1306,19 +1306,19 @@ const config: Config = { }, { from: '/integrations/snowflake', - to: '/integrations/libraries/snowflake', + to: '/integrations/libraries/snowflake/', }, { from: '/integrations/snowflake/using-snowflake-with-dagster', - to: '/integrations/libraries/snowflake', + to: '/integrations/libraries/snowflake/using-snowflake-with-dagster', }, { from: '/integrations/snowflake/using-snowflake-with-dagster-io-managers', - to: '/integrations/libraries/snowflake', + to: '/integrations/libraries/snowflake/using-snowflake-with-dagster-io-managers', }, { from: '/integrations/snowflake/reference', - to: '/integrations/libraries/snowflake', + to: '/integrations/libraries/snowflake/reference', }, { from: '/integrations/tableau', diff --git a/docs/docs-beta/static/images/integrations/jupyter/descriptive-plots.png b/docs/docs-beta/static/images/integrations/jupyter/descriptive-plots.png new file mode 100644 index 0000000000000..412f3be317531 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/descriptive-plots.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/jupyter-tags.png b/docs/docs-beta/static/images/integrations/jupyter/jupyter-tags.png new file mode 100644 index 0000000000000..76136e042b0f9 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/jupyter-tags.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/jupyter-view-menu.png b/docs/docs-beta/static/images/integrations/jupyter/jupyter-view-menu.png new file mode 100644 index 0000000000000..a680af242ba31 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/jupyter-view-menu.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/kmeans-plots.png b/docs/docs-beta/static/images/integrations/jupyter/kmeans-plots.png new file mode 100644 index 0000000000000..cb4661afeb375 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/kmeans-plots.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/ui-one.png b/docs/docs-beta/static/images/integrations/jupyter/ui-one.png new file mode 100644 index 0000000000000..b3774a2bafcc8 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/ui-one.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/ui-three.png b/docs/docs-beta/static/images/integrations/jupyter/ui-three.png new file mode 100644 index 0000000000000..16300623157c2 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/ui-three.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/ui-two.png b/docs/docs-beta/static/images/integrations/jupyter/ui-two.png new file mode 100644 index 0000000000000..3413fa30ab791 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/ui-two.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/view-executed-notebook.png b/docs/docs-beta/static/images/integrations/jupyter/view-executed-notebook.png new file mode 100644 index 0000000000000..62fe0f7fd9c60 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/view-executed-notebook.png differ diff --git a/docs/docs-beta/static/images/integrations/jupyter/view-source-notebook.png b/docs/docs-beta/static/images/integrations/jupyter/view-source-notebook.png new file mode 100644 index 0000000000000..3dea8beacbf7a Binary files /dev/null and b/docs/docs-beta/static/images/integrations/jupyter/view-source-notebook.png differ diff --git a/docs/docs-beta/static/images/integrations/mssql.png b/docs/docs-beta/static/images/integrations/mssql.png new file mode 100644 index 0000000000000..de93818e54b3d Binary files /dev/null and b/docs/docs-beta/static/images/integrations/mssql.png differ diff --git a/docs/docs-beta/static/images/integrations/pandas/tutorial1.png b/docs/docs-beta/static/images/integrations/pandas/tutorial1.png new file mode 100644 index 0000000000000..85871be2fc503 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/pandas/tutorial1.png differ diff --git a/docs/docs-beta/static/images/integrations/pandas/tutorial2.png b/docs/docs-beta/static/images/integrations/pandas/tutorial2.png new file mode 100644 index 0000000000000..b621464746d3c Binary files /dev/null and b/docs/docs-beta/static/images/integrations/pandas/tutorial2.png differ diff --git a/docs/docs-beta/static/images/integrations/pandera/error-report.png b/docs/docs-beta/static/images/integrations/pandera/error-report.png new file mode 100644 index 0000000000000..e9faf8c5a5064 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/pandera/error-report.png differ diff --git a/docs/docs-beta/static/images/integrations/pandera/schema.png b/docs/docs-beta/static/images/integrations/pandera/schema.png new file mode 100644 index 0000000000000..f604137364b95 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/pandera/schema.png differ diff --git a/docs/docs-beta/static/images/integrations/snowflake/snowflake-account.png b/docs/docs-beta/static/images/integrations/snowflake/snowflake-account.png new file mode 100644 index 0000000000000..9998aa815a454 Binary files /dev/null and b/docs/docs-beta/static/images/integrations/snowflake/snowflake-account.png differ