Using Mapchete #665

normanb · 2025-01-16T17:13:23Z

I have been using mapchete to process data at scale (most of North America). Thank you for the project!

I have a couple of questions and observations.

The initial process tiles takes a long time if the bounds are not set in the mapchete file. I have to break up my initial North America bounds into smaller regions when doing testing. Is that to be expected or am I doing something wrong? For dev I just process one tile at a time.

 mapchete execute --debug -v --workers 6 --concurrency threads --overwrite my.mapchete

I think ^^ is the right way to call mapchete execute, but if I just specify workers what is the default concurrency? I know when I set it to processes I did spawn a lot of child python processes.

I am outputting vectors, registering the output format was simple, but the kwargs in https://github.com/ungarj/mapchete/blob/main/mapchete/io/vector/write.py#L137 do not get passed into fiona_write which restricts additional fiona supported vector formats.

The docs have not kept up with the code changes. I had to read a lot of code to get started.

I appreciate PRs are the way to improve this, point me in the right direction of where to start!

The text was updated successfully, but these errors were encountered:

ungarj · 2025-01-17T07:37:47Z

Hi @normanb and thanks for the feedback!

The initial process tiles takes a long time if the bounds are not set in the mapchete file.

Under certain circumstances this could indeed take a while and there could be a couple of reasons for it:

Determining the process tiles requires checking for all existing output tiles intersecting with the current process area unless using --overwrite where this check is omitted. If this area is larger and/or the zoom level is high, there will be a lot of tiles to check.
The process area is the union of all input datasets coverage (i.e. bounds in the dataset CRS), but if one of them does not have those bounds (i.e. if the input is a TileDirectory), then the process area will be global and thus many process tiles will intersect.
Checking for existing output tiles which are on S3 may take longer as potentially many requests have to be made. We do try to optimize it by listing the contents of the row paths to check multiple tiles at once but still this can take a while.

Given that you use the overwrite option, I assume that your process covers a lot of process tiles. One possible solution is to increase the tile size by playing with the metatiling setting. This works for process tiles as well as for output tiles (but keep in mind the output tiles cannot be larger than the process tiles). So for example a metatiling of 2 would combine 2x2 tiles into one, which effectively means there is only a fourth of the number of tiles to be dealt with. A rule of thumb is that the process tiles should be as large as the machine can handle memory-wise as this also will speed up processing performance drastically.

I think ^^ is the right way to call mapchete execute, but if I just specify workers what is the default concurrency? I know when I set it to processes I did spawn a lot of child python processes.

There is no single right way, it all depends on the type of process :). For testing, you can set concurrency to none and turn on debug to see what is going on. Default concurrency is processes.

I am outputting vector tiles, registering the output format was simple, but the kwargs in https://github.com/ungarj/mapchete/blob/main/mapchete/io/vector/write.py#L137 do not get passed into fiona_write which restricts additional fiona supported vector formats.

Which vector formats are you interested in? We could certainly extend the vector driver so it would accept any fiona based output format or at the very least enable the desired format. The current way is not really optimal for this but I am all in for improvements.

The docs have not kept up with the code changes. I had to read a lot of code to get started.

Sorry for that, we know we are a bit behind in updating the docs. Which parts specifically were lacking from your point of view?

I appreciate PRs are the way to improve this, point me in the right direction of where to start!

We do too :). It depends on where do you want to start and with what you are comfortable with, be it code and/or documentation. What should we address first?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Mapchete #665

Using Mapchete #665

normanb commented Jan 16, 2025 •

edited

Loading

ungarj commented Jan 17, 2025

Using Mapchete #665

Using Mapchete #665

Comments

normanb commented Jan 16, 2025 • edited Loading

ungarj commented Jan 17, 2025

normanb commented Jan 16, 2025 •

edited

Loading