-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383
Comments
@tglman since you are the goat here, please can you read my thread and perhaps you can spot the problem very fast :) |
Hmm sorry after another test i got the 'stuck' bug with 3.2.27 too. so actually seems that latest version wihtout bug is 3.2.26 (still testing) |
Hi, Looking at the first error it seems that in one server you are using an installation using the agent.jar and in another without it, the other cases I'm not sure actually what is happening, I would definitely say make sure that you use same kind of installation on all the servers, with exact same version and the same version of the agent.jar. I do definitely suggest to use the agent.jar if you are not planning to use lucene indexes, that are not yet supported by the sync base on the agent.jar In terms of use of the piped stream, the current implementation should never used them in the same thread, the backup logic for extracting the database to be send across network it is run on a specific thread started for the scope. |
Hello @tglman thanks for the fast answer. All the screens i posted here were always without using agent.jar in /plugins folder for all nodes. I dont understand how agent.jar modify the behaviour of full db sync at startup? i tested again (without agent.jar) and i got the random 'lock' symptom on ODB version 3.2.27 (and after) ... BUT not on ODB 3.2.26 and versions before... so im stuck at that version. I will test again with enterprise agent to be sure if that ODB 3.2.27 can be fixed with that but again, what commit change would have provoque the 'stuck' bug in the 3.2.27 version? i dont really understand. Any idea? |
Another question, is there a way to 'force' a replica node to get his full db sync from the first and only master node of the cluster? It seems that the replica node when he start will get his db from the latest started node (so another replica node) |
So @tglman i think there is a problem and its not relative to enterprise mode or not. |
FYI @tglman i disabled the OClusterHealthChecker just to test (-Ddistributed.checkHealthEvery=0) and i dont get the 'lock' bug Can that be the source of the pb of the lock? |
Hmm ok after lot of new test playing with -Ddistributed.checkHealthEvery=xxx value, when i disable it or set a big value i dont have the bug. i think the lock bug is related to something like updating the server config in com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker#checkServerConfig and that introduce some sort of thread lock in full backup process. |
Hi @cschockaert, Thanks for the detailed check, I think a saw a similar thing in the past, the health check should skip if the database is installing, I will double check it. |
Hi Yes, it's that, master BACKUPING node is making a cluster check and see a change in distributed configuration, then try to save the new config. The change is coming from the starting node that is requesting full datasync. |
OrientDB Version: 3.2.27 and tested with 3.2.36 too
Java Version: 17.0.3
OS: linux
Expected behavior
replica node should start and get a full database sync without timeout
Actual behavior
sometime a replica node is not starting after requesting his full database sync (around 500 MB zipped)
i made a threaddump and found that server (master node) making the backup seems locked on adding file to zip:
on the replica node side i got that error:
after some research it seems that 17.0.13/PipedOutputStream.java AND 17.0.13/PipedInputStream.java cannot be used by the same thread or it can conduct to some lock behaviour.
I dont know why but it seem that i dont have the problem with theses ODB versions: 3.2.10 OK and 3.2.26 OK
Steps to reproduce
start a master node with a big DB > 1Gig uncompressed
start a replica node.
if bug dont happen, delete all data and restart the replica node to make that bug happen.
NB: i tested with agent.jar (enteprise) and i get locked too in full database sync
The text was updated successfully, but these errors were encountered: