Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial join very slow? #475

Open
bertt opened this issue Jan 10, 2025 · 10 comments
Open

Spatial join very slow? #475

bertt opened this issue Jan 10, 2025 · 10 comments

Comments

@bertt
Copy link

bertt commented Jan 10, 2025

Hi,

I'm executing the following query on a 147K world city points table (loaded parquet file from https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/export/?disjunctive.cou_name_en&sort=name)

Find pairs of cities that are within 5 degrees (about 555 km) of each other.

but it ?never? finishes (v1.1.1 ) :-( PostGIS is ready within a second...

SELECT
    a.name AS city_a,
    b.name AS city_b,
    ST_Distance(
        a.coordinates,
        b.coordinates
    ) AS distance_degrees
FROM cities a
JOIN cities b ON a.geoname_id <> b.geoname_id
WHERE ST_Distance(
    a.coordinates,
    b.coordinates
) <= 5; -- Distance in degrees
@brpy
Copy link

brpy commented Jan 31, 2025

I'm having similar issues when running st_within.

Basically it never finishes.

Did you get any solution for this?

@gregorywaynepower
Copy link

gregorywaynepower commented Feb 6, 2025

Hey folks, I worked with Opendatasoft's support to give them feedback on how they were generating parquet files since querying them was unexpectedly slow. They have since made updates on how they create their parquet files--they even support exporting files with zstd compression.

Here's an example with the dataset that you provided featuring a zstd compressed parquet file.

curl -X 'GET' \
  'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/parquet?parquet_compression=zstd' \
  -H 'accept: */*'

@bertt
Copy link
Author

bertt commented Feb 7, 2025

tried with the new dataset above, but query is still very slow 👎

@Maxxen
Copy link
Member

Maxxen commented Feb 7, 2025

Hello!

Just had a quick look at this, it seems that the <> join condition is the culprit as it turns the query plan from a HASH JOIN (fast) into a NESTED LOOP JOIN which basically has to compare everything with everything (M*N complexity). Now, why can't we do a hash join for inequality predicates? Im not entirely sure, but I'll ask around.

In the meantime, here's a (ugly) workaround:

PRAGMA disabled_optimizers='join_order,filter_pushdown';
SELECT
      a.name AS city_a,
      b.name AS city_b,
      ST_Distance(
          a.coordinates,
          b.coordinates
      ) AS distance_degrees
  FROM cities a
  JOIN cities b ON st_intersects(st_buffer(a.coordinates, 5), b.coordinates) 
 WHERE a.geoname_id != b.geoname_id;

By rewriting the distance predicate into an intersects on the buffer you trigger spatials I-E-join optimizer rule which performs a range join on the bounding box. (This rewrite should probably be done automatically in the future). Additionally you have to disable the join_order and filter_pushdown optimizers or DuckDB will try to push the WHERE predicate into the join condition again and turn it into a nested loop join before spatial has the chance to rewrite the query.

I know this isn't ideal, so I'll investigate more how we can make this work better.

@Maxxen
Copy link
Member

Maxxen commented Feb 7, 2025

With the above changes, the final query plan becomes:

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           city_a          │
│           city_b          │
│      distance_degrees     │
│                           │
│        ~147043 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (geoname_id != geoname_id)│
│                           │
│        ~147043 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│  ST_Intersects(ST_Buffer  │
│    (coordinates, 5.0),    │
│        coordinates)       │
│                           │
│        ~147043 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          IE_JOIN          │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        │
│     ST_XMin(ST_Extent     │
│ (ST_Buffer(coordinates, 5 │
│ .0))) <= ST_XMax(ST_Extent│
│       (coordinates))      │
│     ST_XMax(ST_Extent     │
│ (ST_Buffer(coordinates, 5 │
│ .0))) >= ST_XMin(ST_Extent├──────────────┐
│       (coordinates))      │              │
│     ST_YMin(ST_Extent     │              │
│ (ST_Buffer(coordinates, 5 │              │
│ .0))) <= ST_YMax(ST_Extent│              │
│       (coordinates))      │              │
│     ST_YMax(ST_Extent     │              │
│ (ST_Buffer(coordinates, 5 │              │
│ .0))) >= ST_YMin(ST_Extent│              │
│       (coordinates))      │              │
│                           │              │
│        ~147043 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│           cities          ││           cities          │
│                           ││                           │
│        Projections:       ││        Projections:       │
│        coordinates        ││        coordinates        │
│         geoname_id        ││         geoname_id        │
│            name           ││            name           │
│                           ││                           │
│        ~147043 Rows       ││        ~147043 Rows       │
└───────────────────────────┘└───────────────────────────┘

@bertt
Copy link
Author

bertt commented Feb 7, 2025

still takes a long time (v1.2.0), but seeing progress bar now to 83%

Image

@Maxxen
Copy link
Member

Maxxen commented Feb 7, 2025

Yeah the buffering does probably add quite a bit of overhead, butI mean, in essence you are comparing everything to everything, you're going to end up with a result doing around ~21621643849 distance comparisons one way or another.

Running the following seems slightly faster

SELECT
        a.name AS city_a,
        b.name AS city_b,
        ST_Distance(
            a.coordinates,
            b.coordinates
        ) AS distance_degrees
    FROM cities a
    JOIN cities b ON st_distance(a.coordinates, b.coordinates) <= 5
   WHERE a.geoname_id != b.geoname_id;

But even if I run it with a limit on the LHS (FROM cities LIMIT 500) a with EXPLAIN ANALYZE the majority of the time is spent in the filter

┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (ST_Distance(coordinates, │
│    coordinates) <= 5.0)   │
│                           │
│        1493466 Rows       │
│          (14.23s)         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────┐
│  geoname_id != geoname_id │              │
│                           │              │
│       73521000 Rows       │              │
│          (0.13s)          │              │
└─────────────┬─────────────┘              │

@Maxxen
Copy link
Member

Maxxen commented Feb 7, 2025

I don't know exactly what PostGIS does to make it that much faster... is the table indexed? I guess it would be possible to create a temporary r-tree index to prune out distance checks in a custom operator, but I don't know if that's what the Postgres optimizer does. We could definitely do that though with some work.

@gregorywaynepower
Copy link

gregorywaynepower commented Feb 7, 2025

@bertt You may want to take a look at some of these recommendations for GeoParquet--you may have better luck recreating those GeoParquet files around these suggestions.

https://github.com/opengeospatial/geoparquet/pull/254/files

@Maxxen
Copy link
Member

Maxxen commented Feb 7, 2025

Ok, the Postgres query plan seems to be basically equivalent to ours, here ive created a table with 100k rows of random ids and points.

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 100000) as I
EXPLAIN SELECT
    a.id AS city_a,
    b.id AS city_b,
    ST_Distance(
        a.pos,
        b.pos
    ) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
    a.pos,
    b.pos
) <= 5;
 Nested Loop  (cost=0.00..166919453918.00 rows=3333300000 width=18)
   Join Filter: (((a.id)::text <> (b.id)::text) AND (st_distance(a.pos, b.pos) <= '5'::double precision))
   ->  Seq Scan on t2 a  (cost=0.00..1834.00 rows=100000 width=37)
   ->  Materialize  (cost=0.00..3116.00 rows=100000 width=37)
         ->  Seq Scan on t2 b  (cost=0.00..1834.00 rows=100000 width=37)

And it doesn't seem to finish either. Applying an index on the pos columns doesn't change the plan. Is there anything else I'm missing to make this work on PostgreSQL?

If we instead generate a table with 10000 rows it eventually finishes in about 14s, but I don't know how you got the full 140k~ dataset to finish in a couple seconds while I can't even get 10k under 10s

(explain analyze output)

 Nested Loop  (cost=0.00..1668375393.00 rows=33330000 width=16) (actual time=0.043..14398.525 rows=59988 loops=1)
   Join Filter: (((a.id)::text <> (b.id)::text) AND (st_distance(a.pos, b.pos) <= '5'::double precision))
   Rows Removed by Join Filter: 99940012
   ->  Seq Scan on t2 a  (cost=0.00..184.00 rows=10000 width=36) (actual time=0.016..0.681 rows=10000 loops=1)
   ->  Materialize  (cost=0.00..234.00 rows=10000 width=36) (actual time=0.000..0.157 rows=10000 loops=10000)
         ->  Seq Scan on t2 b  (cost=0.00..184.00 rows=10000 width=36) (actual time=0.003..1.001 rows=10000 loops=1)
 Planning Time: 0.483 ms
 Execution Time: 14399.476 ms

While the equivalent query takes duckdb about 19s on my machine.

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 10000) x(i);

SELECT
     a.id AS city_a,
     b.id AS city_b,
     ST_Distance(
         a.pos,
         b.pos
     ) AS distance_degrees
 FROM t2 a JOIN t2 b ON a.id <> b.id
 WHERE ST_Distance(
     a.pos,
     b.pos
 ) <= 5;


----
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT     a.id AS city_a,     b.id AS city_b,     ST_Distance(         a.pos,         b.pos     ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance(     a.pos,     b.pos ) <= 5;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 19.70s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           city_a          │
│           city_b          │
│      distance_degrees     │
│                           │
│         59988 Rows        │
│          (0.01s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (ST_Distance(pos, pos) <= │
│            5.0)           │
│                           │
│         59988 Rows        │
│          (19.38s)         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id != id   ├──────────────┐
│                           │              │
│       99990000 Rows       │              │
│          (0.18s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t2         ││         Table: t2         │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│             id            ││             id            │
│            pos            ││            pos            │
│                           ││                           │
│         10000 Rows        ││         10000 Rows        │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘

Now Postgis being about 5s faster (for 10000 rows) is still significant, and my guess as to why is that postgis has implemented ST_Distance natively while we still have to call out into GEOS's C-API, allocating and copying a bunch of memory back and forth. There are a lot of optimizations we can do here to reduce the overhead, in this case in particular because all geometries are just points.

In fact, if we change the table slightly and create duckdb POINT_2D instead of GEOMETRY types, using

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point2d(i, i) as pos FROM generate_series(1, 10000) x(i);

The query finishes in less than second:

EXPLAIN ANALYZE SELECT
     a.id AS city_a,
     b.id AS city_b,
     ST_Distance(
         a.pos,
         b.pos
     ) AS distance_degrees
 FROM t2 a JOIN t2 b ON a.id <> b.id
 WHERE ST_Distance(
     a.pos,
     b.pos
) <= 5;

----
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT     a.id AS city_a,     b.id AS city_b,     ST_Distance(         a.pos,         b.pos     ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance(     a.pos,     b.pos ) <= 5;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.949s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           city_a          │
│           city_b          │
│      distance_degrees     │
│                           │
│         59988 Rows        │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (ST_Distance(pos, pos) <= │
│            5.0)           │
│                           │
│         59988 Rows        │
│          (0.61s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id != id   ├──────────────┐
│                           │              │
│       99990000 Rows       │              │
│          (0.19s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t2         ││         Table: t2         │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│             id            ││             id            │
│            pos            ││            pos            │
│                           ││                           │
│         10000 Rows        ││         10000 Rows        │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘

So conclusion: we should optimize the ST_Distance variant for GEOMETRY in the same way as we have done for POINT_2D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants