-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spatial join very slow? #475
Comments
I'm having similar issues when running st_within. Basically it never finishes. Did you get any solution for this? |
Hey folks, I worked with Opendatasoft's support to give them feedback on how they were generating parquet files since querying them was unexpectedly slow. They have since made updates on how they create their parquet files--they even support exporting files with zstd compression. Here's an example with the dataset that you provided featuring a zstd compressed parquet file. curl -X 'GET' \
'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/parquet?parquet_compression=zstd' \
-H 'accept: */*' |
tried with the new dataset above, but query is still very slow 👎 |
Hello! Just had a quick look at this, it seems that the In the meantime, here's a (ugly) workaround: PRAGMA disabled_optimizers='join_order,filter_pushdown';
SELECT
a.name AS city_a,
b.name AS city_b,
ST_Distance(
a.coordinates,
b.coordinates
) AS distance_degrees
FROM cities a
JOIN cities b ON st_intersects(st_buffer(a.coordinates, 5), b.coordinates)
WHERE a.geoname_id != b.geoname_id;
By rewriting the distance predicate into an intersects on the buffer you trigger spatials I-E-join optimizer rule which performs a range join on the bounding box. (This rewrite should probably be done automatically in the future). Additionally you have to disable the join_order and filter_pushdown optimizers or DuckDB will try to push the WHERE predicate into the join condition again and turn it into a nested loop join before spatial has the chance to rewrite the query. I know this isn't ideal, so I'll investigate more how we can make this work better. |
With the above changes, the final query plan becomes:
|
Yeah the buffering does probably add quite a bit of overhead, butI mean, in essence you are comparing everything to everything, you're going to end up with a result doing around ~21621643849 distance comparisons one way or another. Running the following seems slightly faster
But even if I run it with a limit on the LHS
|
I don't know exactly what PostGIS does to make it that much faster... is the table indexed? I guess it would be possible to create a temporary r-tree index to prune out distance checks in a custom operator, but I don't know if that's what the Postgres optimizer does. We could definitely do that though with some work. |
@bertt You may want to take a look at some of these recommendations for GeoParquet--you may have better luck recreating those GeoParquet files around these suggestions. |
Ok, the Postgres query plan seems to be basically equivalent to ours, here ive created a table with 100k rows of random ids and points.
And it doesn't seem to finish either. Applying an index on the pos columns doesn't change the plan. Is there anything else I'm missing to make this work on PostgreSQL? If we instead generate a table with 10000 rows it eventually finishes in about 14s, but I don't know how you got the full 140k~ dataset to finish in a couple seconds while I can't even get 10k under 10s (explain analyze output)
While the equivalent query takes duckdb about 19s on my machine. CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 10000) x(i);
SELECT
a.id AS city_a,
b.id AS city_b,
ST_Distance(
a.pos,
b.pos
) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
a.pos,
b.pos
) <= 5;
----
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT a.id AS city_a, b.id AS city_b, ST_Distance( a.pos, b.pos ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance( a.pos, b.pos ) <= 5;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 19.70s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ city_a │
│ city_b │
│ distance_degrees │
│ │
│ 59988 Rows │
│ (0.01s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ (ST_Distance(pos, pos) <= │
│ 5.0) │
│ │
│ 59988 Rows │
│ (19.38s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ NESTED_LOOP_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ Conditions: id != id ├──────────────┐
│ │ │
│ 99990000 Rows │ │
│ (0.18s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ TABLE_SCAN ││ TABLE_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Table: t2 ││ Table: t2 │
│ Type: Sequential Scan ││ Type: Sequential Scan │
│ ││ │
│ Projections: ││ Projections: │
│ id ││ id │
│ pos ││ pos │
│ ││ │
│ 10000 Rows ││ 10000 Rows │
│ (0.00s) ││ (0.00s) │
└───────────────────────────┘└───────────────────────────┘ Now Postgis being about 5s faster (for 10000 rows) is still significant, and my guess as to why is that postgis has implemented ST_Distance natively while we still have to call out into GEOS's C-API, allocating and copying a bunch of memory back and forth. There are a lot of optimizations we can do here to reduce the overhead, in this case in particular because all geometries are just points. In fact, if we change the table slightly and create duckdb
The query finishes in less than second:
So conclusion: we should optimize the |
Hi,
I'm executing the following query on a 147K world city points table (loaded parquet file from https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/export/?disjunctive.cou_name_en&sort=name)
Find pairs of cities that are within 5 degrees (about 555 km) of each other.
but it ?never? finishes (v1.1.1 ) :-( PostGIS is ready within a second...
The text was updated successfully, but these errors were encountered: