-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dlt.destinations.exceptions.DatabaseUndefinedRelation sql server destination #2088
Comments
Hey @axellpadilla, can you provide a script with example data (you can just extract a dictionary that produces this error from your mongo and remove sensitive data) so i can reproduce this? otherwise it will be very hard to debug. Thanks. |
Yeah @sh-rp , it is indeed sensitive and it will be hard to reproduce manually (but I will try to find time to build a test case.), the schema of this collection is like 1900 lines, using the modified schema as import for the pipeline helped with the reference issue when NOT using refresh = drop_resources on pipeline execution, after trying for a couple times with write_disposition=replace, it changed the name of the table reference errored for each execution (random sort order for the creation of the tables?). After this sequence (not sure if it was just good luck) it worked but launched a different error for data types or decimals too long on sql server:
Checking failed jobs in load id '1732580035.753799'
JOB: orders.ab71c490bb.insert_values(orders)
JOB file type: insert_values
JOB file path: C:\Users\bpadilla\.dlt\pipelines\local\mongodb_ecommerce_hn_orders\load\loaded\1732580035.753799\failed_jobs\orders.ab71c490bb.0.insert_values
('22003', "[22003] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The number '1288.30999999999994543031789362430572509765625' is out of the range for numeric representation (maximum precision 38). (1007) (SQLExecDirectW); [22003] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The number '353.26999999999998181010596454143524169921875' is out of the range for numeric representation (maximum precision 38). (1007); [22003] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The number '4.71999999999999975131004248396493494510650634765625' is out of the range for numeric representation (maximum precision 38). (1007); [22003] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The number '1497.920000000000072759576141834259033203125' is out of the range for numeric representation (maximum precision 38). (1007); [22003] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The number '4.71999999999999975131004248396493494510650634765625' is out of the range for numeric representation (maximum precision 38). (1007); [22003] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The number '620.73000000000001818989403545856475830078125' is out of the range for numeric representation (maximum precision 38). (1007)") Another thing, there is a "root_key" creation error when using merge write_disposition on existing data (this shouldn't happen even changing write_disposition if you have "root_key" always enabled, previous runs should have it created). Probably, is that some order of execution of the sql is not being correctly executed on the first case, considering also that for each load the schema could be different, one with text and one with document on the same field, and on the second case, I even checked the code of dlt and can't find a reason why the root_key isn't being created at first use, but maybe the reason is the same, a change from nested to no nested. |
Hey @axellpadilla, I'll close this since I don't have a test to reprocude this. |
This comment has been minimized.
This comment has been minimized.
@axellpadilla did you try to run this outside of dagster? I'm not getting any errors on duckdb and mssql destinations. I'm on current in your code there's one suspicious thing: you set
this is most probably related to dagster. are you running several pipelines in parallel? |
@rudolfix It failed using that code with multiple variations, it doesn't fail always but when it starts failing it keeps failing with the same error, that dagster import just provides the pipeline with the destination created but it also fails directly, I'm using sql server 2017. I hope to have more time to reproduce it consistently on a fully featured test on the repository itself (dlt works perfectly with multiple pipelines and collections, but complex ones with multiple nested tables always have multiple different problems, maybe given a specific point in memory something executes on the wrong order?) |
@axellpadilla it just occured to me that we fixed a similar problem not so long ago, but I cannot find the note of this fix in the release docs. Could you try to run this on dlt 1.5.0 and see if the problem persists? |
Sad, yeah, already updated, I'm currently trying to make a test on my local repo, I will generate enough random data with changing structure. I realized the pokeapi test is failing but it should be a change on the api:
|
Hi, got a test that sometimes is non-deterministic (not always the same error), it seems that create_indexes=True is the problem and hope it also works for you @sh-rp : import hashlib
from typing import Dict, List, Any, Literal
import random
from dlt.common import pendulum
from dlt import pipeline
import logging
from dlt.destinations import mssql
import dlt
from dlt.common.utils import uniq_id
from tests.pipeline.utils import (
load_table_counts,
)
def generate_nested_json_structure(depth: int = 0, max_depth: int = 2) -> Dict[str, Any]:
if depth >= max_depth:
return {
"leaf_value": random.randint(1, 1000),
"leaf_date": pendulum.now()
}
return {
f"level_{depth}_field": {
"nested_doc": generate_nested_json_structure(depth + 1, max_depth),
"nested_array": [
generate_nested_json_structure(depth + 1, max_depth)
for _ in range(random.randint(1, 2))
]
}
}
def generate_deterministic_id(index: int, stage: str) -> str:
"""Generates a deterministic but unique id based on index and stage"""
return hashlib.sha256(f"{stage}_{index}".encode()).hexdigest()[:24]
def generate_json_like_data(stage: Literal["s1", "s2", "s3"], size: int = 100, seed: int = 42) -> List[Dict[str, Any]]:
random.seed(seed)
data = []
# Pre-generate common values to avoid repeated random calls
names = ["Alice", "Bob", "Charlie", "David", None, None] # Added more Nones
ages = [random.randint(18, 80) for _ in range(10)] + [None, None] # Added Nones to ages
timestamps = [pendulum.now().subtract(days=d) for d in range(365)] + [None]
versions = list(range(1, 6)) + [None]
# Pre-generate a pool of nested structures with some None values
nested_pool = {
'small': [generate_nested_json_structure(0, 2) for _ in range(5)] + [None],
'medium': [generate_nested_json_structure(0, 3) for _ in range(5)] + [None],
'large': [generate_nested_json_structure(0, 4) for _ in range(5)] + [None]
}
nested_tables = {
"users": (lambda: {
"name": random.choice(names),
"age": random.choice(ages),
"preferences": random.choice(nested_pool['small']) if random.random() > 0.2 else None
}, 1.0),
"orders": (lambda: {
"order_date": random.choice([pendulum.now(), None, None]), # Increased chance of None
"items": random.sample(nested_pool['medium'], k=random.randint(0, 3)) if random.random() > 0.3 else None,
"shipping": random.choice(nested_pool['small'])
}, 0.8),
"products": (lambda: {
"details": random.choice(nested_pool['large']) if random.random() > 0.2 else None,
"categories": random.choice([
random.sample(nested_pool['small'], k=2),
None,
[],
None # Added another None option
]),
"specs": random.choice(nested_pool['medium']) if random.random() > 0.25 else None
}, 0.7)
}
for i in range(size):
base_doc = {
"_id": generate_deterministic_id(i, stage),
"timestamp": random.choice(timestamps), # Now can be None
"version": random.choice(versions) # Now can be None
}
if stage == "s1":
# Most consistent stage, but still with some missing fields
doc = {
**base_doc,
**{k: v[0]() for k, v in nested_tables.items() if random.random() < v[1]}
}
elif stage == "s2":
# Even more variable with missing fields
doc = {
**base_doc,
**{k: v[0]() for k, v in nested_tables.items() if random.random() < v[1] * 0.9}
}
# Add some completely new fields, sometimes null
if random.random() > 0.5:
doc["dynamic_field"] = random.choice([
generate_nested_json_structure(0, random.randint(1, 4)),
None
])
else: # s3
# Most chaotic stage with more nulls and missing fields
doc = {
**base_doc,
**{
k: random.choice([
v[0](),
None,
None, # Increased chance of None
{"modified": True},
[1, 2, 3],
"converted_to_string"
]) for k, v in nested_tables.items() if random.random() < v[1] * 0.7
}
}
# Add stage specific fields, sometimes null
if random.random() > 0.3:
doc["s3_specific"] = random.choice([
generate_nested_json_structure(0, 4),
None
])
# Randomly remove some fields
if random.random() > 0.8:
keys_to_remove = random.sample(list(doc.keys() - {"_id"}), k=random.randint(1, 2))
for key in keys_to_remove:
doc.pop(key)
data.append(doc)
return data
def test_generate_json_like_data():
logging.basicConfig(level=logging.DEBUG)
logging.debug("Starting test_generate_json_like_data")
data = generate_json_like_data("s1", 10)
assert len(data) == 10
logging.debug("Finished test_generate_json_like_data")
def test_mssql_json_loading() -> None:
"""Test loading jsonDB-like data into MSSQL with different sequences"""
dest = mssql(create_indexes=True)
@dlt.source(name="json_source", parallelized=True)
def source(stage: str):
@dlt.resource(name="documents", write_disposition="merge", primary_key="_id")
def documents():
yield from generate_json_like_data(stage)
return documents
# Test sequence 1: s1 -> s2 -> s3
p1 = pipeline(
"json_test_1",
dataset_name="json_test_1_" + uniq_id()
)
p1.run(source("s1"), destination=dest)
p1.run(source("s2"), destination=dest)
p1.run(source("s3"), destination=dest)
# Test sequence 2: s2 -> s1 -> s3
p2 = pipeline(
"json_test_2",
dataset_name="json_test_2_" + uniq_id()
)
p2.run(source("s2"), destination=dest)
p2.run(source("s1"), destination=dest)
p2.run(source("s3"), destination=dest)
# Test sequence 3: s3 -> s2 -> s1
p3 = pipeline(
"json_test_3",
dataset_name="json_test_3_" + uniq_id()
)
p3.run(source("s3"), destination=dest)
p3.run(source("s2"), destination=dest)
p3.run(source("s1"), destination=dest)
# Test sequence 4: s2 with root_id -> s1 -> s3
p4 = pipeline(
"json_test_4",
dataset_name="json_test_4_" + uniq_id()
)
@dlt.source(name="json_source_root_id", root_key=True)
def source_with_root_id(stage: str):
@dlt.resource(name="documents", write_disposition="replace", primary_key="_id")
def documents():
yield from generate_json_like_data(stage)
return documents
p4.run(source_with_root_id("s2"), destination=dest)
p4.run(source_with_root_id("s1"), destination=dest, write_disposition="merge")
p4.run(source_with_root_id("s3"), destination=dest, write_disposition="merge")
# Verify data for each pipeline
for p in [p1, p2, p3, p4]:
assert load_table_counts(p,"documents")["documents"] == 300 |
@sh-rp Where you able to reproduce it? |
dlt version
1.4
Describe the problem
While processing documents using a source.incremental, some causes a lot of errors of this type, this shouldn't happen.
It works without the incremental config.
Edit: Using a filter instead of incremental still causes errors, so this could be a underlying problem with handling data quality issues.
Edit 2: max_table_nesting doesn't work with the verified source, it still fails on the same deeper level than specified parameter. But adding a mongodb.resources["orders"].max_table_nesting = 2 do work partially, for example, still receiving errors for orders__shopping_cart__products__key.
Edit 3: The problem continued even without drop_sources option.
Im using the verified mongodb source, tested many ways, some collections load if are cleaner.
Expected behavior
No errors, documents loaded.
Steps to reproduce
Load documents with nested documents on fields, where some aren't documents, for example, mix array with undefined.
Use refresh = "drop_sources"
And the error is true, for example:
Cannot find the object "orders__shopping_cart" because it does not exist or you do not have permissions.
Doesn't exists on any schema.
Operating system
Linux, Windows
Runtime environment
Local
Python version
3.11
dlt data source
dlt verified source mongodb
dlt destination
No response
Other deployment details
sql server destination
Additional information
consistent error, 1.3 version same problem.
The text was updated successfully, but these errors were encountered: