A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383

cschockaert · 2025-01-15T08:54:58Z

OrientDB Version: 3.2.27 and tested with 3.2.36 too

Java Version: 17.0.3

OS: linux

Expected behavior

replica node should start and get a full database sync without timeout

Actual behavior

sometime a replica node is not starting after requesting his full database sync (around 500 MB zipped)
i made a threaddump and found that server (master node) making the backup seems locked on adding file to zip:

on the replica node side i got that error:

after some research it seems that 17.0.13/PipedOutputStream.java AND 17.0.13/PipedInputStream.java cannot be used by the same thread or it can conduct to some lock behaviour.

I dont know why but it seem that i dont have the problem with theses ODB versions: 3.2.10 OK and 3.2.26 OK

Steps to reproduce

start a master node with a big DB > 1Gig uncompressed
start a replica node.
if bug dont happen, delete all data and restart the replica node to make that bug happen.

NB: i tested with agent.jar (enteprise) and i get locked too in full database sync

cschockaert · 2025-01-15T08:55:55Z

@tglman since you are the goat here, please can you read my thread and perhaps you can spot the problem very fast :)

cschockaert · 2025-01-15T09:33:51Z

I retried with ODB 3.2.26 and make a threadump during DB backup full sync and got that stack server side:

as you can see we dont see any PipedOutputStream in the stack
in that version i dont get stuck

cschockaert · 2025-01-15T10:44:52Z

Some more screen

on ODB 3.2.27 (seems OK):

on ODB 3.2.28 (got the lock):

cschockaert · 2025-01-15T15:28:56Z

Hmm sorry after another test i got the 'stuck' bug with 3.2.27 too. so actually seems that latest version wihtout bug is 3.2.26 (still testing)

tglman · 2025-01-15T17:14:59Z

Hi,

Looking at the first error it seems that in one server you are using an installation using the agent.jar and in another without it, the other cases I'm not sure actually what is happening, I would definitely say make sure that you use same kind of installation on all the servers, with exact same version and the same version of the agent.jar.

I do definitely suggest to use the agent.jar if you are not planning to use lucene indexes, that are not yet supported by the sync base on the agent.jar

In terms of use of the piped stream, the current implementation should never used them in the same thread, the backup logic for extracting the database to be send across network it is run on a specific thread started for the scope.

cschockaert · 2025-01-16T08:25:35Z

Hello @tglman thanks for the fast answer. All the screens i posted here were always without using agent.jar in /plugins folder for all nodes.
So i was not in enterprise mode.

I dont understand how agent.jar modify the behaviour of full db sync at startup?

i tested again (without agent.jar) and i got the random 'lock' symptom on ODB version 3.2.27 (and after) ... BUT not on ODB 3.2.26 and versions before... so im stuck at that version.

I will test again with enterprise agent to be sure if that ODB 3.2.27 can be fixed with that but again, what commit change would have provoque the 'stuck' bug in the 3.2.27 version? i dont really understand. Any idea?

cschockaert · 2025-01-16T08:27:24Z

Another question, is there a way to 'force' a replica node to get his full db sync from the first and only master node of the cluster?
We are running 1 master node and xxx replicas nodes cluster.

It seems that the replica node when he start will get his db from the latest started node (so another replica node)
We were running on ODB 3.0.X and i think that was not the case

cschockaert · 2025-01-16T09:21:11Z

Ok i tested with agent and ODB 3.2.36 and i got the stuck bug (and im in enterprise mode):

node making the backup server side thread dump:

got that thread too blocked server side:

client getting the backup client side thread dump:

client side exception coz server isnt sending zip:

cschockaert · 2025-01-16T09:22:04Z

So @tglman i think there is a problem and its not relative to enterprise mode or not.
that bug seems to appear in 3.2.27 version

cschockaert · 2025-01-16T09:23:01Z

checking changes

cschockaert · 2025-01-16T14:49:41Z

FYI @tglman i disabled the OClusterHealthChecker just to test (-Ddistributed.checkHealthEvery=0) and i dont get the 'lock' bug
I saw that 3.2.27 added a lot of modification in that class and classed around that
In my previous threaddump you can see a Thread BLOCKED trying to modify the DB distributed configuration when master node is BACKUP data to replica node.

Can that be the source of the pb of the lock?

cschockaert · 2025-01-16T15:33:53Z

Hmm ok after lot of new test playing with -Ddistributed.checkHealthEvery=xxx value, when i disable it or set a big value i dont have the bug.

i think the lock bug is related to something like updating the server config in com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker#checkServerConfig

and that introduce some sort of thread lock in full backup process.

tglman · 2025-01-16T18:55:04Z

Hi @cschockaert,

Thanks for the detailed check, I think a saw a similar thing in the past, the health check should skip if the database is installing, I will double check it.

cschockaert · 2025-01-16T18:57:57Z

Hi

Yes, it's that, master BACKUPING node is making a cluster check and see a change in distributed configuration, then try to save the new config. The change is coming from the starting node that is requesting full datasync.
i think in this case we should not receive the change until full datasync is made & successfuly finished

…10386

cschockaert mentioned this issue Jan 17, 2025

A fresh node announces ONLINE status for a database before completing full sync #10386

Open

cschockaert changed the title ~~A fresh node will get lock on startup requesting his full database sync~~ A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) Jan 17, 2025

tglman added a commit that referenced this issue Jan 17, 2025

fix: skip the health check if the database is on sync, issues #10383, #…

b62cc2f

…10386

tglman added a commit that referenced this issue Jan 17, 2025

fix: skip the health check if the database is on sync, issues #10383, #…

6ef186e

…10386

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383

A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025

tglman commented Jan 15, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025 •

edited

Loading

cschockaert commented Jan 16, 2025

tglman commented Jan 16, 2025

cschockaert commented Jan 16, 2025

A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383

A fresh node will get lock on startup requesting his full database sync (BUG 3.2.27) #10383

Comments

cschockaert commented Jan 15, 2025 • edited Loading

OrientDB Version: 3.2.27 and tested with 3.2.36 too

Java Version: 17.0.3

OS: linux

Expected behavior

Actual behavior

Steps to reproduce

cschockaert commented Jan 15, 2025 • edited Loading

cschockaert commented Jan 15, 2025 • edited Loading

cschockaert commented Jan 15, 2025 • edited Loading

cschockaert commented Jan 15, 2025

tglman commented Jan 15, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 16, 2025 • edited Loading

cschockaert commented Jan 16, 2025

tglman commented Jan 16, 2025

cschockaert commented Jan 16, 2025

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 15, 2025 •

edited

Loading

cschockaert commented Jan 16, 2025 •

edited

Loading