Schema parsing/resolution is done serially or something (slower than resolving it concurrently with a python threadpool) #21034
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
Log output
Generating a DataFrame with 4000 rows and 10000 columns... Writing 200 Parquet files to 'data'... Dataset generation complete. Running sequential scan... Sequential scan took 4.72 seconds. Running concurrent scan... Concurrent scan took 1.86 seconds.
Issue description
It takes much longer if I collect the schema purely within polars (without doing it concurrently in a python threadpool).
This difference is dramatically larger when done against files hosted in the cloud (i'm getting a 7-20x speed difference).
Expected behavior
Should be equivalently fast or faster doing it in pure polars.
Installed versions
The text was updated successfully, but these errors were encountered: