Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383

Open
cschockaert opened this issue Jan 15, 2025 · 14 comments

Comments

@cschockaert
Copy link

cschockaert commented Jan 15, 2025

OrientDB Version: 3.2.27 and tested with 3.2.36 too

Java Version: 17.0.3

OS: linux

Expected behavior

replica node should start and get a full database sync without timeout

Actual behavior

sometime a replica node is not starting after requesting his full database sync (around 500 MB zipped)
i made a threaddump and found that server (master node) making the backup seems locked on adding file to zip:

image

on the replica node side i got that error:

image

after some research it seems that 17.0.13/PipedOutputStream.java AND 17.0.13/PipedInputStream.java cannot be used by the same thread or it can conduct to some lock behaviour.

I dont know why but it seem that i dont have the problem with theses ODB versions: 3.2.10 OK and 3.2.26 OK

Steps to reproduce

start a master node with a big DB > 1Gig uncompressed
start a replica node.
if bug dont happen, delete all data and restart the replica node to make that bug happen.

NB: i tested with agent.jar (enteprise) and i get locked too in full database sync

@cschockaert
Copy link
Author

cschockaert commented Jan 15, 2025

@tglman since you are the goat here, please can you read my thread and perhaps you can spot the problem very fast :)

@cschockaert
Copy link
Author

cschockaert commented Jan 15, 2025

I retried with ODB 3.2.26 and make a threadump during DB backup full sync and got that stack server side:

image

as you can see we dont see any PipedOutputStream in the stack
in that version i dont get stuck

@cschockaert
Copy link
Author

cschockaert commented Jan 15, 2025

Some more screen

on ODB 3.2.27 (seems OK):

Pasted Graphic 3

on ODB 3.2.28 (got the lock):

image

@cschockaert
Copy link
Author

Hmm sorry after another test i got the 'stuck' bug with 3.2.27 too. so actually seems that latest version wihtout bug is 3.2.26 (still testing)

@tglman
Copy link
Member

tglman commented Jan 15, 2025

Hi,

Looking at the first error it seems that in one server you are using an installation using the agent.jar and in another without it, the other cases I'm not sure actually what is happening, I would definitely say make sure that you use same kind of installation on all the servers, with exact same version and the same version of the agent.jar.

I do definitely suggest to use the agent.jar if you are not planning to use lucene indexes, that are not yet supported by the sync base on the agent.jar

In terms of use of the piped stream, the current implementation should never used them in the same thread, the backup logic for extracting the database to be send across network it is run on a specific thread started for the scope.

@cschockaert
Copy link
Author

Hello @tglman thanks for the fast answer. All the screens i posted here were always without using agent.jar in /plugins folder for all nodes.
So i was not in enterprise mode.

I dont understand how agent.jar modify the behaviour of full db sync at startup?

i tested again (without agent.jar) and i got the random 'lock' symptom on ODB version 3.2.27 (and after) ... BUT not on ODB 3.2.26 and versions before... so im stuck at that version.

I will test again with enterprise agent to be sure if that ODB 3.2.27 can be fixed with that but again, what commit change would have provoque the 'stuck' bug in the 3.2.27 version? i dont really understand. Any idea?

@cschockaert
Copy link
Author

Another question, is there a way to 'force' a replica node to get his full db sync from the first and only master node of the cluster?
We are running 1 master node and xxx replicas nodes cluster.

It seems that the replica node when he start will get his db from the latest started node (so another replica node)
We were running on ODB 3.0.X and i think that was not the case

@cschockaert
Copy link
Author

Ok i tested with agent and ODB 3.2.36 and i got the stuck bug (and im in enterprise mode):

node making the backup server side thread dump:

image

got that thread too blocked server side:

image

client getting the backup client side thread dump:

image

client side exception coz server isnt sending zip:

image

@cschockaert
Copy link
Author

So @tglman i think there is a problem and its not relative to enterprise mode or not.
that bug seems to appear in 3.2.27 version

@cschockaert
Copy link
Author

checking changes

image

@cschockaert
Copy link
Author

cschockaert commented Jan 16, 2025

FYI @tglman i disabled the OClusterHealthChecker just to test (-Ddistributed.checkHealthEvery=0) and i dont get the 'lock' bug
I saw that 3.2.27 added a lot of modification in that class and classed around that
In my previous threaddump you can see a Thread BLOCKED trying to modify the DB distributed configuration when master node is BACKUP data to replica node.

Can that be the source of the pb of the lock?

@cschockaert
Copy link
Author

Hmm ok after lot of new test playing with -Ddistributed.checkHealthEvery=xxx value, when i disable it or set a big value i dont have the bug.

i think the lock bug is related to something like updating the server config in com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker#checkServerConfig

and that introduce some sort of thread lock in full backup process.

@tglman
Copy link
Member

tglman commented Jan 16, 2025

Hi @cschockaert,

Thanks for the detailed check, I think a saw a similar thing in the past, the health check should skip if the database is installing, I will double check it.

@cschockaert
Copy link
Author

Hi

Yes, it's that, master BACKUPING node is making a cluster check and see a change in distributed configuration, then try to save the new config. The change is coming from the starting node that is requesting full datasync.
i think in this case we should not receive the change until full datasync is made & successfuly finished

@cschockaert cschockaert changed the title A fresh node will get lock on startup requesting his full database sync A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants