Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic and seemingly random crashes on different setups but same issue #13732

Open
Jeko83 opened this issue Feb 7, 2025 · 10 comments
Open
Labels

Comments

@Jeko83
Copy link

Jeko83 commented Feb 7, 2025

System information

Proxmox VE 8.3.3 (pve-manager/8.3.3/f157a38b211595d6 (running kernel: 6.8.12-8-pve))
Erigon running on an Ubuntu LXC Container in proxmox.

Info on the LXC Container:
OS: Ubuntu 24.04 noble
Kernel: x86_64 Linux 6.8.12-8-pve (not modded)
Shell: bash 5.2.21
CPU: AMD Ryzen 9 7900X3D 12-Core @ 20x 5.66GHz (10 cores assigned to container)
Disk: 6.1T assigned zfs raid0 (out of 8T), nvme pci gen4 (2x4T) (Storage is ZFS in raid0 configuration. I know this is not optimal but for now I don't have much choice. No errors or fragmentation tough and everything is running fine).
Page size for this ZFS pool has been set to 16K.
RAM: 40GB assigned (out of 64GB), 6400MHZ DDR5

Erigon version: ./erigon --version

erigon version 2.61.0-3ea0dd41

OS & Version: Windows/Linux/OSX

LXC Container with Ubuntu 24.04 noble, see above

Commit hash:

3ea0dd4

Erigon Command (with flags/config):

/home/erigon/erigon/build/bin/erigon 
--snapshots 
--torrent.download.rate=1024mb 
--datadir=/home/erigon/erigonData 
--http 
--http.port=8545 
--ws 
--ws.port=8546 
--http.api=eth,engine,debug,net,trace,web3,erigon,ots,txpool 
--authrpc.addr=0.0.0.0 
--authrpc.vhosts=* 
--http.corsdomain=* 
--http.vhosts=* 
--torrent.conns.perfile=3 
--db.read.concurrency=3 
--batchSize=16M 
--rpc.batch.concurrency=2 
--db.pagesize=16k

Consensus Layer:

Prysm version v5.2.0
beacon-chain version Prysm/v5.2.0/ac1717f1e44bd218b0bd3af0c4dec951c075f462

Consensus Layer Command (with flags/config):

./consensus/prysm.sh beacon-chain 
--datadir=/home/prysm/beacon 
--execution-endpoint=http://192.168.0.102:8551 
--mainnet 
--jwt-secret=/home/prysm/consensus/jwt.hex 
--slots-per-archive-point=32 
--checkpoint-sync-url=https://mainnet-checkpoint-sync.attestant.io 
--genesis-beacon-api-url=https://mainnet-checkpoint-sync.attestant.io

Chain/Network:

Mainnet

Expected behaviour

Erigon doesn't sporadicly and randomly crash.

Actual behaviour

Erigon crashes randomly and at different points in time.

Steps to reproduce the behaviour

Wait until erigon crashes. I'm processing each transaction in each block via some of OTS' rpc methods and getting each transaction's data via normal rpc calls. On top of this other requests get done if certain conditions are met. I'm thinking there could be around 100-400+ requests for each ETH block.

Logs:

Restarted 3-4 times after the first crash but then it kept crashing.
erigon.log

Other Info:

Discord message and attached thread here
Erigon was running in archive mode (default) on all the machines where this bug presented itself.
In this instance and setup the process is crashing at the Snapshots indexing phase (first phase), but in the other machines it was crashing randomly (after a full sync). We can start with this setup and when this is hopefully fixed we can move to the other machines (different setups but the bug is still very similar).
Thanks in advance for the hard work and the help, I would really appreciate it because I never successfully managed to have a stable erigon instance for more than 1day/1week.

@Jeko83
Copy link
Author

Jeko83 commented Feb 7, 2025

Update:
Looks like one file in the zfs raid0 storage is corrupted. This might explain the crashing exactly at the 18500 mark.
Can i safely delete this snapshot file and erigon will re-download it or do I need to do anything else?
Thanks.

For future reference:
Command: zpool status -v

Image

Keeping this issue opened anyway because the bug is not isolated to this case.

@taratorio
Copy link
Member

taratorio commented Feb 7, 2025

It should, you can try it out and check the datadir/snapshots directory
I would probably delete all files at the 18500 mark and later and try restarting
If it doesn't re-download the files you can stop it and try to rm -rf datadir/downloader && rm datadir/snapshots/prohibit_new_downloads.lock and start again

@taratorio
Copy link
Member

taratorio commented Feb 7, 2025

regarding your original issue: are you ok to experiment with 1 of your crashing nodes? Don't want to cause unexpected downtime/further issues for you if you are not ok with it.

The things I would suggest to try next are:

  • git checkout to latest commit on release/2.61 branch - commit cebcd1c and build from source (make erigon)
  • run Erigon with 2 env vars set, e.g.SAVE_HEAP_PROFILE=true && HEAP_PROFILE_FILE_PATH=<yourdir>/erigon-mem.prof && ./build/bin/erigon --datadir ...
  • what this will do is - every 30 secs it will check if your node is close to OOM killer threshold and will save a heap profile at HEAP_PROFILE_FILE_PATH
  • if you've correctly run erigon with the above env vars you should see logs like [Experiment] heap profile threshold check and [Experiment] saving heap profile as near OOM filePath=<xxx>
  • when you successfully capture that profile file please attach it to this issue or use go tool pprof -png erigon-mem.prof to generate a png to attach here

@Jeko83
Copy link
Author

Jeko83 commented Feb 7, 2025

It should, you can try it out and check the datadir/snapshots directory I would probably delete all files at the 18500 mark and later and try restarting If it doesn't re-download the files you can stop it and try to rm -rf datadir/downloader && rm datadir/snapshots/prohibit_new_downloads.lock and start again

I did all the steps but instead of removing everything I removed the corrupted snapshot files from 19000 to 19100. Everything looks fine apart from the fact that it's stuck at 100% downloading from 25 minutes ago.
Looks like it's still downloading something (idk) but the diskIO (read and write) is basically at 0, also cpu usage is very very low

Image
Image

@taratorio
Copy link
Member

what happens after restart?

@Jeko83
Copy link
Author

Jeko83 commented Feb 7, 2025

what happens after restart?

Everything looks good, logs:
erigon.log

After one restart it's still stuck on [1/12 Snapshots] download with progress 100%

Update: it got unstuck after 40 minutes. I'll wait for it to proceed and update if there is another problem.

I added the flags you suggested in the other machine. Will report if it generates a .prof file

@taratorio
Copy link
Member

taratorio commented Feb 7, 2025

@Jeko83 ok, just note that the logic for SAVE_HEAP_PROFILE and HEAP_PROFILE_FILE_PATH is not available on Erigon 2.61 commit 3ea0dd4 (which you previously told me you are using) - which is why I suggested to move to latest commit on release/2.61 branch - want to make sure you have updated the versions?

Edit: Make sure you see the [Experiment] heap profile threshold check in your logs

@Jeko83
Copy link
Author

Jeko83 commented Feb 7, 2025

@Jeko83 ok, just note that the logic for SAVE_HEAP_PROFILE and HEAP_PROFILE_FILE_PATH is not available on Erigon 2.61 commit 3ea0dd4 (which you previously told me you are using) - which is why I suggested to move to latest commit on release/2.61 branch - want to make sure you have updated the versions?

Edit: Make sure you see the [Experiment] heap profile threshold check in your logs

I pulled the changes and now the git log command gives back:
commit cebcd1c80663d83b32fc9c2e6db71c78ea7c7171 (HEAD -> release/2.61, origin/release/2.61), which should be correct.

Build info at erigon startup:
INFO[02-07|12:21:17.046] Build info git_branch=release/2.61 git_tag=v2.61.0-26-gcebcd1c git_commit=cebcd1c80663d83b32fc9c2e6db71c78ea7c7171

I'm starting this other erigon instance via:
HEAP_PROFILE_FILE_PATH=/home/erigon/memHEAPDebug/erigon-mem.prof && SAVE_HEAP_PROFILE=true && /home/erigon/erigon/build/bin/erigon ...

Flags:

--internalcl 
--snapshots 
--torrent.download.rate=512mb 
--datadir=/home/erigon/erigonData 
--http 
--http.port=8545 
--ws 
--ws.port=8546 
--http.api=eth,debug,net,trace,web3,erigon,ots,txpool 
--http.corsdomain=* 
--http.vhosts=* 
--torrent.conns.perfile=3 
--db.read.concurrency=3 
--batchSize=64M 
--rpc.batch.concurrency=1

System info for this second fully synced machine:

OS: Ubuntu 24.04 noble
Kernel: x86_64 Linux 6.8.0-45-generic
Shell: bash 5.2.21
Disk: 4T (almost full, 35gb or so left - crashes were happening way before the storage was almost full so this is not the issue)
CPU: Intel Core i7-8700 @ 12x 4.6GHz
RAM: 39gb

Even with this starting command and commit it doesn't look like I'm getting the requested [Experiment] heap profile threshold check output. The folder /home/erigon/memHEAPDebug/erigon-mem.prof is also empty.
I'm running the erigon instance inside a tmux panel.

Thanks again for the help

@taratorio
Copy link
Member

@Jeko83 sorry, I made a mistake - it should be:

HEAP_PROFILE_FILE_PATH=/home/erigon/memHEAPDebug/erigon-mem.prof SAVE_HEAP_PROFILE=true /home/erigon/erigon/build/bin/erigon ...

(without && just space)

@Jeko83
Copy link
Author

Jeko83 commented Feb 10, 2025

The proxmox instance crashed while at stage 4/12. The last lines of the log prior to the crash were these ones:

[INFO] [02-09|18:01:03.939] [Experiment] heap profile threshold check alloc=5.2GB total=61.9GB
[INFO] [02-09|18:01:03.959] [txpool] stat                            pending=0 baseFee=0 queued=30000 alloc=5.2GB sys=8.7GB
[INFO] [02-09|18:01:04.005] [4/12 Execution] Executed blocks         number=14126125 blk/s=27.4 tx/s=4964.7 Mgas/s=423.9 gasState=0.10 batch=42.8MB alloc=5.2GB sys=8.7>
[INFO] [02-09|18:01:04.442] [mem] memory stats                       Rss=15.9GB Size=0B Pss=15.9GB SharedClean=6.1MB SharedDirty=0B PrivateClean=8.2GB PrivateDirty=7.7>
[INFO] [02-09|18:01:33.939] [Experiment] heap profile threshold check alloc=3.6GB total=61.9GB
[INFO] [02-09|18:01:34.039] [4/12 Execution] Executed blocks         number=14126939 blk/s=27.1 tx/s=4597.7 Mgas/s=418.5 gasState=0.15 batch=59.7MB alloc=3.6GB sys=8.7>
[INFO] [02-09|18:02:03.939] [Experiment] heap profile threshold check alloc=5.0GB total=61.9GB
[INFO] [02-09|18:02:04.042] [4/12 Execution] Executed blocks         number=14127780 blk/s=28.0 tx/s=5128.7 Mgas/s=436.6 gasState=0.20 batch=75.6MB alloc=5.0GB sys=8.7>
[INFO] [02-09|18:02:33.939] [Experiment] heap profile threshold check alloc=6.0GB total=61.9GB
[INFO] [02-09|18:02:34.009] [4/12 Execution] Executed blocks         number=14128602 blk/s=27.4 tx/s=4943.5 Mgas/s=423.0 gasState=0.24 batch=90.7MB alloc=6.0GB sys=8.7>
[INFO] [02-09|18:03:03.939] [Experiment] heap profile threshold check alloc=4.8GB total=61.9GB
[INFO] [02-09|18:03:04.612] [4/12 Execution] Executed blocks         number=14129548 blk/s=30.9 tx/s=5719.5 Mgas/s=476.2 gasState=0.30 batch=107.5MB alloc=4.8GB sys=8.>
[INFO] [02-09|18:03:33.939] [Experiment] heap profile threshold check alloc=6.2GB total=61.9GB
[INFO] [02-09|18:03:33.996] [4/12 Execution] Executed blocks         number=14130491 blk/s=32.1 tx/s=5691.2 Mgas/s=496.5 gasState=0.35 batch=124.4MB alloc=6.2GB sys=8.>
[INFO] [02-09|18:03:41.110] [] Flushed buffer file                   name=erigon-sortable-buf-523445851
[INFO] [02-09|18:03:41.747] [] Flushed buffer file                   name=erigon-sortable-buf-1893321613
[INFO] [02-09|18:04:03.939] [p2p] GoodPeers                          eth67=79 eth66=19 eth68=33
[INFO] [02-09|18:04:03.941] [Experiment] heap profile threshold check alloc=5.1GB total=61.9GB
[INFO] [02-09|18:04:04.493] [mem] memory stats                       Rss=14.4GB Size=0B Pss=14.4GB SharedClean=6.1MB SharedDirty=0B PrivateClean=5.1GB PrivateDirty=9.3>
[INFO] [02-09|18:04:05.637] [txpool] stat                            pending=0 baseFee=0 queued=30000 alloc=5.1GB sys=8.7GB
[INFO] [02-09|18:04:10.513] [] ETL [2/2] Loading                     into=PlainState current_prefix=4be3223f
[INFO] [02-09|18:04:33.940] [Experiment] heap profile threshold check alloc=5.3GB total=61.9GB
[INFO] [02-09|18:04:40.865] [] ETL [2/2] Loading                     into=PlainState current_prefix=7be8076f
[INFO] [02-09|18:05:03.939] [Experiment] heap profile threshold check alloc=5.5GB total=61.9GB
[INFO] [02-09|18:05:10.512] [] ETL [2/2] Loading                     into=PlainState current_prefix=8b166890
[INFO] [02-09|18:05:33.939] [Experiment] heap profile threshold check alloc=5.7GB total=61.9GB
[INFO] [02-09|18:05:40.717] [] ETL [2/2] Loading                     into=PlainState current_prefix=bb75b334
[INFO] [02-09|18:06:03.939] [Experiment] heap profile threshold check alloc=4.0GB total=61.9GB

I started this instance with the experimental flags as well, but no output file so the heap for this run was good. Another issue, that I don't really know how to debug, happened.

Latest go logs in console:

github.com/erigontech/erigon/p2p.(*Peer).run.gowrap1()                                                                                                                  
        github.com/erigontech/erigon/p2p/peer.go:271 +0x25 fp=0xc15f907fe0 sp=0xc15f907fc0 pc=0x1468c65                                                                 
runtime.goexit({})                                                                                                                                                      
        runtime/asm_amd64.s:1695 +0x1 fp=0xc15f907fe8 sp=0xc15f907fe0 pc=0x49fb81                                                                                       
created by github.com/erigontech/erigon/p2p.(*Peer).run in goroutine 453672                                                                                             
        github.com/erigontech/erigon/p2p/peer.go:271 +0xef                                                                                                              
                                                                                                                                                                        
goroutine 48828730 gp=0xc095788000 m=nil [select]:                                                                                                                      
runtime.gopark(0xc06a57be18?, 0x2?, 0x98?, 0xbc?, 0xc06a57bdf4?)                                                                                                        
        runtime/proc.go:402 +0xce fp=0xc06a57bc50 sp=0xc06a57bc30 pc=0x4677ae                                                                                           
runtime.selectgo(0xc06a57be18, 0xc06a57bdf0, 0x10?, 0x0, 0x28e74a0?, 0x1)                                                                                               
        runtime/select.go:327 +0x725 fp=0xc06a57bd70 sp=0xc06a57bc50 pc=0x479685                                                                                        
github.com/erigontech/erigon/p2p/discover.(*lookup).advance(0xc11b5a8a20)                                                                                               
        github.com/erigontech/erigon/p2p/discover/lookup.go:73 +0xaa fp=0xc06a57be60 sp=0xc06a57bd70 pc=0x143d96a                                                       
github.com/erigontech/erigon/p2p/discover.(*lookup).run(0xc11b5a8a20)                                                                                                   
        github.com/erigontech/erigon/p2p/discover/lookup.go:64 +0x25 fp=0xc06a57be98 sp=0xc06a57be60 pc=0x143d825                                                       
github.com/erigontech/erigon/p2p/discover.(*UDPv4).LookupPubkey(0xc0ba0a8620, 0xc1344a3300)                                                                             
        github.com/erigontech/erigon/p2p/discover/v4_udp.go:324 +0xf9 fp=0xc06a57bf18 sp=0xc06a57be98 pc=0x1446e59                                                      
github.com/erigontech/erigon/p2p/discover.(*UDPv4).loop.func4()                                                                                                         
        github.com/erigontech/erigon/p2p/discover/v4_udp.go:648 +0x99 fp=0xc06a57bfe0 sp=0xc06a57bf18 pc=0x1449a39                                                      
runtime.goexit({})                                                                                                                                                      
        runtime/asm_amd64.s:1695 +0x1 fp=0xc06a57bfe8 sp=0xc06a57bfe0 pc=0x49fb81                                                                                       
created by github.com/erigontech/erigon/p2p/discover.(*UDPv4).loop in goroutine 2614                                                                                    
        github.com/erigontech/erigon/p2p/discover/v4_udp.go:646 +0x745                                                                                                  
                                                                                                                                                                        
rax    0x749f2c000cd0                                                                                                                                                   
rbx    0x0                                                                                                                                                              
rcx    0xd                                                                                                                                                              
rdx    0x749f2c000cd0                                                                                                                                                   
rdi    0x12                                                                                                                                                             
rsi    0x749f2c036d50                                                                                                                                                   
rbp    0x749f2c00b010                                                                                                                                                   
rsp    0x749f673ff790                                                                                                                                                   
r8     0x6665283a43e8                                                                                                                                                   
r9     0x6665283a43e4                                                                                                                                                   
r10    0x1                                                                                                                                                              
r11    0x689ef0000000                                                                                                                                                   
r12    0x742bd20                                                                                                                                                        
r13    0x537ce                                                                                                                                                          
r14    0x698747a40000                                                                                                                                                   
r15    0x66686f2b96c0                                                                                                                                                   
rip    0x41235d                                                                                                                                                         
rflags 0x10206                                                                                                                                                          
cs     0x33                                                                                                                                                             
fs     0x0                                                                                                                                                              
gs     0x0 

For the second bare metal ubuntu machine, a crash happened as well and it generated the erigon-mem.prof file, png here:
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants