-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated CI failures on Windows #238
Comments
@mkitti for error 3 in particular, do you have an idea of where I should check PyJulia? It almost looks like Python garbage collected the pointer to the Julia runtime which is strange. |
What changed? |
So I have seen a few of these on-and-off for a while, especially on Windows. However, the rate has gone up recently. Perhaps this is because I have added more unit-tests over time, and tested more complex functionality (e.g., LoopVectorization.jl) and thus there is cumulatively a higher chance of each error occurring. I am really not sure what causes error 1 and 3 though. Error 2 and 4 seem doable to debug but seem more related to CI than the code itself; so I am mostly worried about 1+3. |
I wonder if it has to do with the
i.e., maybe the fix is def get_libjulia():
+ global _LIBJULIA
return _LIBJULIA |
Edit: looks like the access error in particular was introduced between these two commits: https://github.com/MilesCranmer/PySR/compare/c97f60de90203bd5091c3f49e031f49b17a0c6fa..da0bef974b69dc9215a0986145c53f5f7f4462a9. Maybe it has to do with setting |
Nope; neither the It seems like the access errors first show up in |
I can't reproduce the errors on a local copy of Windows (in Parallels) - Python 3.10, Julia 1.8.3. I wonder if the GitHub action is just running out of memory or something... |
Running out of memory would definitely put pressure on the garbage collector |
Indeed I think it is an overuse of memory from some sort of garbage not being properly collected from threads: I was launching searches repeatedly from IPython, and at one point there was 10 GB allocated in the RAM. Even when I set The short term solution is to split the CI into separate launches of Python, so that memory is forced to clear after multiple tests. The long term solution is to debug exactly why memory is not being freed. Perhaps it has something to do with jobs being added to this list through the use of |
Edit: seems like there isn't actually a memory leak; it's just the JIT cache. |
Even just splitting it into 10 different subsets of tests seems to cause segfaults: https://github.com/MilesCranmer/PySR/actions/runs/3752052933. |
Got some cloud compute to try to debug this. Looks like the test triggering the series of access violations is Lines 300 to 317 in d045586
Updates:
|
The poster in #266 confirmed that multi-processing got rid of their issue. So it seems like a data race issue. I wonder if this is because One possible solution is to implement a task handler that will safely kill tasks, as described here: https://discourse.julialang.org/t/how-to-kill-thread/34236/8. |
Presumably fixed by #535 |
Many of the Windows tests are now failing with various segmentation faults, which appear to be randomly triggered:
They seem to occur more frequently on older versions of Julia, and rarely on Julia 1.8.3. Regardless, a segfault anywhere is cause for concern and should be tracked down.
The errors include:
e.g., Windows:
also occurs on Ubuntu sometimes:
e.g.,
One other curious thing is that this error is raised on some Windows tests (https://github.com/MilesCranmer/PySR/actions/runs/3664894286/jobs/6195713513). But, this should not take place...
The text was updated successfully, but these errors were encountered: