Skip to content

Latest commit

 

History

History
170 lines (126 loc) · 6.43 KB

README.md

File metadata and controls

170 lines (126 loc) · 6.43 KB

Databricks Unity Catalog Connector

The Databricks Unity Catalog Connector makes querying Databricks Unity Catalog tables in Spice simple.

Prerequisites

Step 1. Create a Databricks API token

Create a Databricks personal access token by following the Databricks documentation.

Step 2. Create a new directory and initialize a Spicepod

mkdir databricks-catalog-demo
cd databricks-catalog-demo
spice init

Step 3. Add the Databricks Unity Catalog Connector to spicepod.yaml

Add the following configuration to your spicepod.yaml:

catalogs:
  - from: databricks:<CATALOG_NAME>
    name: db_uc
    params:
      mode: spark_connect # or delta_lake
      databricks_token: ${env:DATABRICKS_TOKEN}
      databricks_endpoint: <instance-id>.cloud.databricks.com
      databricks_cluster_id: <cluster-id>

For mode you can choose between spark_connect or delta_lake. spark_connect is the default mode and requires an All-Purpose Compute Cluster to be available. delta_lake mode queries directly against Delta Lake tables in object storage, and requires Spice to have the necessary permissions to access the object storage directly.

Set the DATABRICKS_TOKEN environment variable to the Databricks personal access token created in Step 1. A .env file created in the same directory as spicepod.yaml can be used to set the variable, i.e.:

echo "DATABRICKS_TOKEN=<token>" > .env

Visit the documentation for more information configuring the Databricks Unity Catalog Connector.

Step 4. Set the object storage credentials for delta_lake mode

When using the delta_lake mode, the object storage credentials must be set for Spice to access the data.

Using Delta Lake directly against AWS S3

params:
  mode: delta_lake
  databricks_endpoint: <instance-id>.cloud.databricks.com
  databricks_token: ${env:DATABRICKS_TOKEN}
  databricks_aws_access_key_id: ${env:AWS_ACCESS_KEY_ID}
  databricks_aws_secret_access_key: ${env:AWS_SECRET_ACCESS_KEY}
  databricks_aws_region: <region> # E.g. us-east-1, us-west-2
  databricks_aws_endpoint: <endpoint> # If using an S3-compatible service, like Minio

Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables to the AWS access key and secret key, respectively.

Using Delta Lake directly against Azure Blob Storage

params:
  mode: delta_lake
  databricks_token: ${env:DATABRICKS_TOKEN}
  databricks_azure_storage_account_name: ${env:AZURE_ACCOUNT_NAME}
  databricks_azure_account_key: ${env:AZURE_ACCOUNT_KEY}

Set the AZURE_ACCOUNT_NAME and AZURE_ACCOUNT_KEY environment variables to the Azure storage account name and account key, respectively.

Using Delta Lake directly against Google Cloud Storage

params:
  mode: delta_lake
  databricks_token: ${env:DATABRICKS_TOKEN}
  databricks_google_service_account: </path/to/service-account.json>

Step 5. Start the Spice runtime

spice run

Step 6. Query a dataset

spice sql
SELECT * FROM db_uc.<SCHEMA_NAME>.<TABLE_NAME> LIMIT 10;

Example:

sql> select trace_id, block_number from db_uc.default.traces limit 10;
+-------------------------------------------------------------------------------+--------------+
| trace_id                                                                      | block_number |
+-------------------------------------------------------------------------------+--------------+
| call_0x0e981c555b68e4f7847155e348e75de70729d06a5f3f238dd4e7d4e062a62eed_      | 16876417     |
| call_0xf2e97e476aaba415ad6793e5d09e82d7ef52d7d595db956306c44dc4e08d1f72_      | 16876417     |
| call_0x79be0ec50306a78a79eed5c368a38d17ff7d3d51d0c8331e36914b95d8635ef3_5     | 16876417     |
| call_0x79be0ec50306a78a79eed5c368a38d17ff7d3d51d0c8331e36914b95d8635ef3_5_0   | 16876417     |
| call_0x7997d1a4ea8a7ece16d3a306c1e820de66744b7f340e7a51b78a35fad5d789d0_      | 16876417     |
| call_0x7997d1a4ea8a7ece16d3a306c1e820de66744b7f340e7a51b78a35fad5d789d0_0     | 16876417     |
| call_0x7997d1a4ea8a7ece16d3a306c1e820de66744b7f340e7a51b78a35fad5d789d0_0_0   | 16876417     |
| call_0x7997d1a4ea8a7ece16d3a306c1e820de66744b7f340e7a51b78a35fad5d789d0_0_0_0 | 16876417     |
| call_0x7997d1a4ea8a7ece16d3a306c1e820de66744b7f340e7a51b78a35fad5d789d0_0_1   | 16876417     |
| call_0x7997d1a4ea8a7ece16d3a306c1e820de66744b7f340e7a51b78a35fad5d789d0_0_2   | 16876417     |
+-------------------------------------------------------------------------------+--------------+

Time: 1.5179735 seconds. 10 rows.

Step 7. Refresh the datasets in catalog

When datasets are dynamically added or removed in the Databricks catalog, Spice runtime will refresh the catalog accordingly.

Create a new Delta Lake table: In Databricks SQL Editor, run the following query (replace <CATALOG_NAME> and <SCHEMA_NAME> with your catalog and schema names):

CREATE TABLE <CATALOG_NAME>.<SCHEMA_NAME>.test_table_no_v2checkpoint
TBLPROPERTIES (
  'delta.minReaderVersion' = '1',
  'delta.minWriterVersion' = '2'
)
AS
SELECT 1 AS id;

Verify table addition in Spice: Check the Spice runtime log for the new table:

2025-01-18T00:58:48.747333Z  INFO data_components::unity_catalog::provider: Refreshed schema <CATALOG_NAME>.<SCHEMA_NAME>. Tables added: test_table_no_v2checkpoint.

Query the new table: In another terminal, run spice sql and query the new table:

>> spice sql
Welcome to the Spice.ai SQL REPL! Type 'help' for help.
sql> select * from db_uc.<SCHEMA_NAME>.test_table_no_v2checkpoint;
+----+
| id |
+----+
| 1  |
+----+

Delete the table: Run the following command in Databricks to delete the table:

drop table <CATALOG_NAME>.<SCHEMA_NAME>.test_table_no_v2checkpoint;

Verify table removal in Spice: Observe that the table has beem removed in spice runtime log

2025-01-18T00:59:49.121835Z  INFO data_components::unity_catalog::provider: Refreshed schema <CATALOG_NAME>.<SCHEMA_NAME>. Tables removed: test_table_no_v2checkpoint.