-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the fault recovery of the jvm oom #875
Labels
chaosblade-exec-jvm
chaosblade-exec-jvm project
Comments
If we indeed need this optimization, I can do this work |
针对方案1 可以给一些详细的设计 |
|
这个问题现在还有继续跟进吗?感觉这个方案挺好的,期待实现对应的能力。为了减少影响面,是不是可以给oom增加一个额外的参数,可以考虑避开timeout参数,比如faultRecoveryTimeout。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Issue Description
Optimize the fault recovery of the jvm oom
Describe what happened (or what feature you want)
Currently, there are some problems with jvm oom, especially heap oom failure recovery. Currently, you can only destroy by launching the destroy command, but in fact, after failure injection, a large number of FullGC happened (also wild-mode=false) in JVM, sandbox can no longer receive processing commands properly. In this case(In Fact, the jetty thread of sandbox did not set uncaughtExceptionHandler, and the processing command caused the OOM thread to exit), we can only restart the jvm。
目前jvm的oom尤其是heap oom故障的恢复存在一些问题,目前只能通过发起destroy命令,但实际上故障注入后,JVM大量FGC(wild-mode=false也是这样),sandbox已经没办法正常接收处理命令了(而且sandbox的jetty线程没有设置uncaughtExceptionHandler,处理命令导致OOM,线程退出),这时只能重启恢复。
Describe what you expected to happen
The fault can be automatically recovered without restart.
故障能够自动恢复不需要重启。
So far I've come up with a few optimization plans:
1.In JVM OOM scenarios, add the action parameter ‘timeout’: actually take the timeout parameter of blade command with it (currently there is a special treatment for timeout in chaosblade without passing it to sandbox).
2.Another thread is started to release memory periodically.
目前我想到了几种优化方式:
How to reproduce it (as minimally and precisely as possible)
Tell us your environment
Anything else we need to know?
The text was updated successfully, but these errors were encountered: