Reading a BP5 series (groupBased) using 120+ GB memory, any fix around it? #1724
-
Hello, However, I have ran a large simulation which in return produced a huge data file. The ".bp" directory that is created is around: 122GB size large. While, I was accessing it using: import openpmd_api as io
series = io.Series(file_path, io.Access.read_only) The RAM usage spikes to 120GB, which kind of makes sense due to the size of the file. Without performing any additional statistics on the data, the RAM used reaches 120GB. Basically when the python code hits the series in the script it stays there to read the series and it reaches 120GB RAM without progessing to next lines of code. Is there a way to not use this much memory, because my machine is maxed at 134GB of RAM. Is there a method for lazy loading and accessing the corresponding data as needed? It seems just calling the series as I did, loads up everything even though lets say I just want to process every 50 or 100 turns data but it is written as per turn data if that makes sense. Any suggestions will be appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 19 replies
-
Hello,
You can verify 2. and 3. using bpls:
In this combination, the BP5 engine of ADIOS2 has been previously observed to have quadratic metadata output size depending on the number of steps created, due to the assumptions that it makes for its serialization. This means that if you write lots of Iterations, this exact situation may occur. At the moment, the recommended alternative is to use file-based encoding which creates a new file for each output step. This can be done by including an expansion pattern in the filename: For converting your existing dataset to file-based encoding, we will have to use an open mode in ADIOS2 that does not try to consume all metadata at once. There is one slight problem in that: Due to the difficulty in correctly associating attributes to steps in that read mode, openPMD-api 0.16 no longer supports that read mode on group-based files, meaning that you would have to temporarily downgrade to openPMD-api 0.15 for that.
I'm sorry for the troubles that this may cause, I will add a fix for the upcoming patch release so that we no longer create files in default configurations that are unreadable in certain read modes. |
Beta Was this translation helpful? Give feedback.
-
Hi @OLuckyG, Thank you for your report! I am co-developing both ImpactX and openPMD-api and will try to add a few more details to see if we can get this figured out, together with the details already shared by @franzpoeschel. I am tracking this as an issue report in BLAST-ImpactX/impactx#868 as well now. First of all, to fully understand your problem: Do you mind sharing your full analysis routines? Is the memory already spiking just at If you don't mind, can you share a reproducer, e.g., the ImpactX input (or simplified version) and a demonstrator in analysis? How many turns (outputs) and particles are in your >120GB simulation? My guess currently is that your memory is not spiking on Otherwise, I can guide you in the meantime to change the output mode in your ImpactX file to use another format that does not have this issue. Also, to give you quick relief, are you aware of the new
@franzpoeschel FYI: we default to group based encoding, so that might be part of the problem with BP5. |
Beta Was this translation helpful? Give feedback.
-
I have replicated a similar workflow as yours and I think that I can verify this is a metadata issue in BP5. Using PIConGPU to create 10000 output steps, once using variable-based encoding, once using group-based encoding:
The more than 60GB size difference is entirely in metadata. For your upcoming runs, either:
For your existing data, either:
|
Beta Was this translation helpful? Give feedback.
I have replicated a similar workflow as yours and I think that I can verify this is a metadata issue in BP5. Using PIConGPU to create 10000 output steps, once using variable-based encoding, once using group-based encoding: