You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using sql_table() source with pyarrow backend to extract data from a mysql db with a custom extraction query. The custom extraction query is passed to query_adapter_callback as a string, and selects a subset of columns from source table, as well as some additional derived columns (example below)
Upon calling pipeline.extract(), a dlt.extract.exceptions.ResourceExtractionError is thrown. The stack trace indicates a KeyError for one of the table columns which is not defined in custom extraction query result set
More details with full reproducable example and stack trace below
Note: I am following along with the example in the documentation here
Expected behavior
query_adapter_callback with custom extraction query should extract only the columns defined in query into pyarrow table. KeyError should not occur
Steps to reproduce
Full reproducible example using public mysql db and duckdb destination:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/dlt/extract/pipe_iterator.py", line 277, in _get_source_item
pipe_item = next(gen)
File "/usr/local/lib/python3.10/site-packages/dlt/sources/sql_database/helpers.py", line 301, in table_rows
yield from loader.load_rows(backend_kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/sources/sql_database/helpers.py", line 178, in load_rows
yield from self._load_rows(query, backend_kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/sources/sql_database/helpers.py", line 200, in _load_rows
yield row_tuples_to_arrow(
File "/usr/local/lib/python3.10/site-packages/dlt/common/configuration/inject.py", line 247, in _wrap
return f(*bound_args.args, **bound_args.kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/sources/sql_database/arrow_helpers.py", line 22, in row_tuples_to_arrow
return _row_tuples_to_arrow(
File "/usr/local/lib/python3.10/site-packages/dlt/common/libs/pyarrow.py", line 615, in row_tuples_to_arrow
columnar_known_types = {
File "/usr/local/lib/python3.10/site-packages/dlt/common/libs/pyarrow.py", line 616, in <dictcomp>
col["name"]: columnar[col["name"]]
KeyError: 'initials'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 471, in extract
self._extract_source(
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 1239, in _extract_source
load_id = extract.extract(
File "/usr/local/lib/python3.10/site-packages/dlt/extract/extract.py", line 421, in extract
self._extract_single_source(
File "/usr/local/lib/python3.10/site-packages/dlt/extract/extract.py", line 344, in _extract_single_source
for pipe_item in pipes:
File "/usr/local/lib/python3.10/site-packages/dlt/extract/pipe_iterator.py", line 162, in __next__
pipe_item = self._get_source_item()
File "/usr/local/lib/python3.10/site-packages/dlt/extract/pipe_iterator.py", line 307, in _get_source_item
raise ResourceExtractionError(pipe.name, gen, str(ex), "generator") from ex
dlt.extract.exceptions.ResourceExtractionError: In processing pipe author: extraction of resource author in generator table_rows caused an exception: 'initials'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/prefect/flows/subflows/repro_dlt_query_adapter_issue.py", line 57, in <module>
pipeline.extract(table_source)
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 226, in _wrap
step_info = f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 180, in _wrap
rv = f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 166, in _wrap
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 275, in _wrap
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/dlt/pipeline/pipeline.py", line 492, in extract
raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1741133041.594346 with exception:
<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe author: extraction of resource author in generator table_rows caused an exception: 'initials'
I dug into the internals of the arrow_helpers.row_tuples_to_arrow function a bit and noticed some odd behavior
which is not what I would expect. This output contains all columns from the source table, as well as my additional derived column, my_custom_column
The value of len(rows[0]) is 3, which is what I would expect - 3 columns in results, which aligns to my custom extraction query
The value of columnar.keys() is dict_keys(['author_id', 'name', 'last_name']) - not what I would expect. last_name is not included in the result set of my custom query
It seems something may be going wrong with the operation here, where columns are zipped with rows
Separately from the issue described here, I wonder whether I am going about this the right way. Is this the preferred approach for extracting custom column sets (including derived columns) from source tables?
An additional requirement I should mention for my own use case is that the custom extraction query must be passed the the query_adapter_callback as a raw SQL string - it cannot be built using native sqlalchemy syntax as I am re-using the same query elsewhere
The text was updated successfully, but these errors were encountered:
acaruso7
changed the title
sql_table()query_adapter_callback function with custom extraction query fails with KeyError (pyarrow backend)sql_table()query_adapter_callback function with custom extraction query fails with KeyError during pipeline.extract() (pyarrow backend)
Mar 5, 2025
dlt version
1.5.0
Describe the problem
https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1740596179916109
I am using
sql_table()
source withpyarrow
backend to extract data from a mysql db with a custom extraction query. The custom extraction query is passed toquery_adapter_callback
as a string, and selects a subset of columns from source table, as well as some additional derived columns (example below)Upon calling
pipeline.extract()
, adlt.extract.exceptions.ResourceExtractionError
is thrown. The stack trace indicates aKeyError
for one of the table columns which is not defined in custom extraction query result setMore details with full reproducable example and stack trace below
Note: I am following along with the example in the documentation here
Expected behavior
query_adapter_callback
with custom extraction query should extract only the columns defined in query into pyarrow table.KeyError
should not occurSteps to reproduce
Full reproducible example using public mysql db and duckdb destination:
Full stack trace:
I dug into the internals of the
arrow_helpers.row_tuples_to_arrow
function a bit and noticed some odd behaviordlt/dlt/common/libs/pyarrow.py
Line 575 in e8c5e9b
The value of the
columns
variable iswhich is not what I would expect. This output contains all columns from the source table, as well as my additional derived column,
my_custom_column
The value of
len(rows[0])
is3
, which is what I would expect - 3 columns in results, which aligns to my custom extraction queryThe value of
columnar.keys()
isdict_keys(['author_id', 'name', 'last_name'])
- not what I would expect.last_name
is not included in the result set of my custom queryIt seems something may be going wrong with the operation here, where
columns
are zipped withrows
dlt/dlt/common/libs/pyarrow.py
Lines 595 to 597 in e8c5e9b
This may lead to a
KeyError
laterOperating system
Linux
Runtime environment
Docker, Docker Compose
Python version
3.10
dlt data source
sql_table()
withpyarrow
backenddlt destination
No response
Other deployment details
No response
Additional information
Separately from the issue described here, I wonder whether I am going about this the right way. Is this the preferred approach for extracting custom column sets (including derived columns) from source tables?
An additional requirement I should mention for my own use case is that the custom extraction query must be passed the the
query_adapter_callback
as a raw SQL string - it cannot be built using native sqlalchemy syntax as I am re-using the same query elsewhereThe text was updated successfully, but these errors were encountered: