Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Mapchete #665

Open
normanb opened this issue Jan 16, 2025 · 1 comment
Open

Using Mapchete #665

normanb opened this issue Jan 16, 2025 · 1 comment

Comments

@normanb
Copy link

normanb commented Jan 16, 2025

I have been using mapchete to process data at scale (most of North America). Thank you for the project!

I have a couple of questions and observations.

The initial process tiles takes a long time if the bounds are not set in the mapchete file. I have to break up my initial North America bounds into smaller regions when doing testing. Is that to be expected or am I doing something wrong? For dev I just process one tile at a time.

 mapchete execute --debug -v --workers 6 --concurrency threads --overwrite my.mapchete

I think ^^ is the right way to call mapchete execute, but if I just specify workers what is the default concurrency? I know when I set it to processes I did spawn a lot of child python processes.

I am outputting vectors, registering the output format was simple, but the kwargs in https://github.com/ungarj/mapchete/blob/main/mapchete/io/vector/write.py#L137 do not get passed into fiona_write which restricts additional fiona supported vector formats.

The docs have not kept up with the code changes. I had to read a lot of code to get started.

I appreciate PRs are the way to improve this, point me in the right direction of where to start!

@ungarj
Copy link
Owner

ungarj commented Jan 17, 2025

Hi @normanb and thanks for the feedback!

The initial process tiles takes a long time if the bounds are not set in the mapchete file.

Under certain circumstances this could indeed take a while and there could be a couple of reasons for it:

  • Determining the process tiles requires checking for all existing output tiles intersecting with the current process area unless using --overwrite where this check is omitted. If this area is larger and/or the zoom level is high, there will be a lot of tiles to check.
  • The process area is the union of all input datasets coverage (i.e. bounds in the dataset CRS), but if one of them does not have those bounds (i.e. if the input is a TileDirectory), then the process area will be global and thus many process tiles will intersect.
  • Checking for existing output tiles which are on S3 may take longer as potentially many requests have to be made. We do try to optimize it by listing the contents of the row paths to check multiple tiles at once but still this can take a while.

Given that you use the overwrite option, I assume that your process covers a lot of process tiles. One possible solution is to increase the tile size by playing with the metatiling setting. This works for process tiles as well as for output tiles (but keep in mind the output tiles cannot be larger than the process tiles). So for example a metatiling of 2 would combine 2x2 tiles into one, which effectively means there is only a fourth of the number of tiles to be dealt with. A rule of thumb is that the process tiles should be as large as the machine can handle memory-wise as this also will speed up processing performance drastically.

I think ^^ is the right way to call mapchete execute, but if I just specify workers what is the default concurrency? I know when I set it to processes I did spawn a lot of child python processes.

There is no single right way, it all depends on the type of process :). For testing, you can set concurrency to none and turn on debug to see what is going on. Default concurrency is processes.

I am outputting vector tiles, registering the output format was simple, but the kwargs in https://github.com/ungarj/mapchete/blob/main/mapchete/io/vector/write.py#L137 do not get passed into fiona_write which restricts additional fiona supported vector formats.

Which vector formats are you interested in? We could certainly extend the vector driver so it would accept any fiona based output format or at the very least enable the desired format. The current way is not really optimal for this but I am all in for improvements.

The docs have not kept up with the code changes. I had to read a lot of code to get started.

Sorry for that, we know we are a bit behind in updating the docs. Which parts specifically were lacking from your point of view?

I appreciate PRs are the way to improve this, point me in the right direction of where to start!

We do too :). It depends on where do you want to start and with what you are comfortable with, be it code and/or documentation. What should we address first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants