-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic and seemingly random crashes on different setups but same issue #13732
Comments
Update: For future reference: Keeping this issue opened anyway because the bug is not isolated to this case. |
It should, you can try it out and check the |
regarding your original issue: are you ok to experiment with 1 of your crashing nodes? Don't want to cause unexpected downtime/further issues for you if you are not ok with it. The things I would suggest to try next are:
|
I did all the steps but instead of removing everything I removed the corrupted snapshot files from 19000 to 19100. Everything looks fine apart from the fact that it's stuck at 100% downloading from 25 minutes ago. |
what happens after restart? |
Everything looks good, logs: After one restart it's still stuck on [1/12 Snapshots] download with progress 100% Update: it got unstuck after 40 minutes. I'll wait for it to proceed and update if there is another problem. I added the flags you suggested in the other machine. Will report if it generates a .prof file |
@Jeko83 ok, just note that the logic for Edit: Make sure you see the |
I pulled the changes and now the git log command gives back: Build info at erigon startup: I'm starting this other erigon instance via: Flags:
System info for this second fully synced machine:OS: Ubuntu 24.04 noble Even with this starting command and commit it doesn't look like I'm getting the requested Thanks again for the help |
@Jeko83 sorry, I made a mistake - it should be:
(without |
The proxmox instance crashed while at stage 4/12. The last lines of the log prior to the crash were these ones:
I started this instance with the experimental flags as well, but no output file so the heap for this run was good. Another issue, that I don't really know how to debug, happened. Latest go logs in console:
For the second bare metal ubuntu machine, a crash happened as well and it generated the erigon-mem.prof file, png here: |
System information
Proxmox VE 8.3.3 (pve-manager/8.3.3/f157a38b211595d6 (running kernel: 6.8.12-8-pve))
Erigon running on an Ubuntu LXC Container in proxmox.
Info on the LXC Container:
OS: Ubuntu 24.04 noble
Kernel: x86_64 Linux 6.8.12-8-pve (not modded)
Shell: bash 5.2.21
CPU: AMD Ryzen 9 7900X3D 12-Core @ 20x 5.66GHz (10 cores assigned to container)
Disk: 6.1T assigned zfs raid0 (out of 8T), nvme pci gen4 (2x4T) (Storage is ZFS in raid0 configuration. I know this is not optimal but for now I don't have much choice. No errors or fragmentation tough and everything is running fine).
Page size for this ZFS pool has been set to 16K.
RAM: 40GB assigned (out of 64GB), 6400MHZ DDR5
Erigon version:
./erigon --version
erigon version 2.61.0-3ea0dd41
OS & Version: Windows/Linux/OSX
LXC Container with Ubuntu 24.04 noble, see above
Commit hash:
3ea0dd4
Erigon Command (with flags/config):
Consensus Layer:
Prysm version v5.2.0
beacon-chain version Prysm/v5.2.0/ac1717f1e44bd218b0bd3af0c4dec951c075f462
Consensus Layer Command (with flags/config):
Chain/Network:
Mainnet
Expected behaviour
Erigon doesn't sporadicly and randomly crash.
Actual behaviour
Erigon crashes randomly and at different points in time.
Steps to reproduce the behaviour
Wait until erigon crashes. I'm processing each transaction in each block via some of OTS' rpc methods and getting each transaction's data via normal rpc calls. On top of this other requests get done if certain conditions are met. I'm thinking there could be around 100-400+ requests for each ETH block.
Logs:
Restarted 3-4 times after the first crash but then it kept crashing.
erigon.log
Other Info:
Discord message and attached thread here
Erigon was running in archive mode (default) on all the machines where this bug presented itself.
In this instance and setup the process is crashing at the Snapshots indexing phase (first phase), but in the other machines it was crashing randomly (after a full sync). We can start with this setup and when this is hopefully fixed we can move to the other machines (different setups but the bug is still very similar).
Thanks in advance for the hard work and the help, I would really appreciate it because I never successfully managed to have a stable erigon instance for more than 1day/1week.
The text was updated successfully, but these errors were encountered: