Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC issue in LMDB sail compromising deletes #5254

Open
nguyenm100 opened this issue Feb 16, 2025 · 9 comments · May be fixed by #5257
Open

GC issue in LMDB sail compromising deletes #5254

nguyenm100 opened this issue Feb 16, 2025 · 9 comments · May be fixed by #5257
Labels
🐞 bug issue is a bug

Comments

@nguyenm100
Copy link
Contributor

Current Behavior

Unfortunately I don't have a predictable repro to share at the moment but we're seeing an issue whereby a predicate is "lost" after a series of largish (10k triples) deletes and inserts occur and hoping @kenwenzel can help us identify what's going on. Effectively, the pattern is:

open a new transaction
insert 10k triples
read all triples
rollback the transaction

open a new transaction
insert 10k triples
commit transaction

open a new transaction
remove 10k triples
rollback the transaction

open a new transaction
remove 10k triples
commit transaction

repeat these steps a number of times and at some point, an expected predicate is lost from the database.. that is, lmdbsail goes to fetch it and mdb_get returns back MDB_NOTFOUND

Things we've tried:

  1. turned off multithreading in lmdbsailstore (enableMultithreading)
  2. turn off lazyfetching (just realize the value immediately)

the consequence is that ValueStore.resolveValue() returns false at some point.

@kenwenzel any pointers where to look and/or debugging advice? can we disable certain features (e.g multithreading/lazyfetching/caching/refcounting) to possibly make the issue more predictable to recreate?

what's bizarre is that it's ALWAYS the predicate that goes missing. subject and object are present.

Expected Behavior

everything to work

Steps To Reproduce

open a new transaction
insert 10k triples
read all triples
rollback the transaction

open a new transaction
insert 10k triples
commit transaction

open a new transaction
remove 10k triples
rollback the transaction

open a new transaction
remove 10k triples
commit transaction

Version

5.0.3

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

No response

@nguyenm100 nguyenm100 added the 🐞 bug issue is a bug label Feb 16, 2025
@kenwenzel
Copy link
Contributor

@nguyenm100 Unfortunately, I don't have time this week to fix this.

A good starting point is here:

protected void filterUsedIdsInTripleStore() throws IOException {

Basically, the GC algorithm goes through removed IDs and looks if they are present in the store. If not then they are removed from the values database.

Maybe the logic inside the triple store has errors:

protected void filterUsedIds(Collection<Long> ids) throws IOException {

If you don't find the cause then you could try to comment out the GC logic.

Best regards,
Ken

@nguyenm100
Copy link
Contributor Author

Thanks @kenwenzel

if I disable the gc it works. What's the implications of disabling the gc during runtime? I see it gets called upon ValueStore creation, which may occur after a machine restart. if I leave that call but disable everything else, will it GC (correctly) just once on startup?

i think what may be occuring is that once the gc marks something for deletion, it's not clear how it gets reinstated.. so if I update a set of triples based on a subj iri, that requires a delete/insert. the delete will remove that subj iri and mark it for deletion, but the insert should unmark again. does that make sense?

@kenwenzel
Copy link
Contributor

If you diable GC then everything just works as with the NativeStore where values are never deleted.

Could you try to create a minimal example that triggers the bug?

@odysa
Copy link

odysa commented Feb 18, 2025

Hello @kenwenzel , thank you for looking into it. We have figured out a minimal reproduction. cc @nguyenm100

To enforce the value eviction, we need to do Thread.sleep(Duration.ofMinutes(1).toMillis()); or set the VALUE_EVICTION_INTERVAL to be 0.
https://github.com/eclipse-rdf4j/rdf4j/blob/main/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java#L97

The following example adds a statement and then deletes it. Repeat it until an exception occurs.

package org.example;

import org.eclipse.rdf4j.common.transaction.IsolationLevels;
import org.eclipse.rdf4j.model.util.Values;
import org.eclipse.rdf4j.model.vocabulary.RDF;
import org.eclipse.rdf4j.repository.Repository;
import org.eclipse.rdf4j.repository.RepositoryConnection;
import org.eclipse.rdf4j.repository.sail.SailRepository;
import org.eclipse.rdf4j.sail.NotifyingSailConnection;
import org.eclipse.rdf4j.sail.lmdb.LmdbStore;

import java.time.Duration;

class Reproduce {

	public static void main(final String[] args) throws InterruptedException {
		final LmdbStore store = new LmdbStore();
		final Repository repository = new SailRepository(store);
		int count = 0;
		while (true) {
			try (final NotifyingSailConnection conn = store.getConnection()) {
				conn.begin(IsolationLevels.SERIALIZABLE);
				conn.addStatement(
						Values.iri(RDF.NAMESPACE + "sub"), Values.iri(RDF.NAMESPACE + "pred"),
						Values.iri(RDF.NAMESPACE + "obj")
				);
				conn.commit();
			}

			try (final RepositoryConnection conn = repository.getConnection()) {
				conn.begin();
				conn.prepareUpdate("DELETE { ?s ?p ?o } WHERE { ?s ?p ?o }").execute();
				conn.commit();
			}

			System.out.println("Count: " + count++);

			// https://github.com/eclipse-rdf4j/rdf4j/blob/main/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java#L97
			// if you set the VALUE_EVICTION_INTERVAL to be 0, remove this Thread.sleep
			Thread.sleep(Duration.ofMinutes(1).toMillis());
		}
	}
}

On my laptop it failed on the 8th.

Full exception

Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "String.hashCode()" because the return value of "org.eclipse.rdf4j.model.base.AbstractIRI.stringValue()" is null
	at org.eclipse.rdf4j.model.base.AbstractIRI.hashCode(AbstractIRI.java:57)
	at org.eclipse.rdf4j.sail.base.Changeset$SimpleStatementPattern.hashCode(Changeset.java:1066)
	at java.base/java.util.HashMap.hash(HashMap.java:338)
	at java.base/java.util.HashMap.put(HashMap.java:610)
	at java.base/java.util.HashSet.add(HashSet.java:221)
	at org.eclipse.rdf4j.sail.base.Changeset.observe(Changeset.java:323)
	at org.eclipse.rdf4j.sail.base.ObservingSailDataset.getStatements(ObservingSailDataset.java:69)
	at org.eclipse.rdf4j.sail.base.DelegatingSailDataset.getStatements(DelegatingSailDataset.java:72)
	at org.eclipse.rdf4j.sail.base.DelegatingSailDataset.getStatements(DelegatingSailDataset.java:72)
	at org.eclipse.rdf4j.sail.base.UnionSailDataset.getStatements(UnionSailDataset.java:130)
	at org.eclipse.rdf4j.sail.base.SailSourceConnection.remove(SailSourceConnection.java:828)
	at org.eclipse.rdf4j.sail.base.SailSourceConnection.removeStatement(SailSourceConnection.java:635)
	at org.eclipse.rdf4j.repository.sail.helpers.SailUpdateExecutor.deleteBoundTriples(SailUpdateExecutor.java:599)
	at org.eclipse.rdf4j.repository.sail.helpers.SailUpdateExecutor.executeModify(SailUpdateExecutor.java:450)
	at org.eclipse.rdf4j.repository.sail.helpers.SailUpdateExecutor.executeUpdate(SailUpdateExecutor.java:125)
	at org.eclipse.rdf4j.repository.sail.SailUpdate.execute(SailUpdate.java:67)
	at org.example.Reproduce.main(Reproduce.java:32)

@nguyenm100
Copy link
Contributor Author

almost positive it has something to do with: https://github.com/eclipse-rdf4j/rdf4j/blob/main/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java#L1180

if I remove the cleaner callback and just immediately add it to the unusedRevisionIds list, it fails immediately. the pattern we're seeing is very indicative of some race condition occuring during the gc.

@nguyenm100 nguyenm100 changed the title Possible race condition in LMDB GC issue in LMDB sail compromising deletes Feb 20, 2025
@kenwenzel
Copy link
Contributor

OK, I found the problem:
A while ago a special cache for common vocabulary (within W3C namespace like RDF) was introduced that is not correctly cleared if values are GCed. I will create a PR with a fix for this issue.

Afterwards, you can try the following unit test:

package org.eclipse.rdf4j.sail.lmdb;

import java.io.File;

import org.eclipse.rdf4j.common.transaction.IsolationLevels;
import org.eclipse.rdf4j.model.ValueFactory;
import org.eclipse.rdf4j.model.impl.SimpleValueFactory;
import org.eclipse.rdf4j.model.util.Values;
import org.eclipse.rdf4j.model.vocabulary.RDF;
import org.eclipse.rdf4j.repository.Repository;
import org.eclipse.rdf4j.repository.RepositoryConnection;
import org.eclipse.rdf4j.repository.sail.SailRepository;
import org.eclipse.rdf4j.sail.lmdb.config.LmdbStoreConfig;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

public class LmdbSailStoreGCTest {

    protected final ValueFactory F = SimpleValueFactory.getInstance();
    protected LmdbStore store;
    protected Repository repo;

    @BeforeEach
    public void before(@TempDir File dataDir) {
        store = new LmdbStore(dataDir, new LmdbStoreConfig("spoc,posc"));
        repo = new SailRepository(store);
        repo.init();
    }

    @Test
    public void testAddAndDelete() {
        int count = 0;
        while (count++ < 100) {
            try (final RepositoryConnection conn = repo.getConnection()) {
                conn.begin(IsolationLevels.SERIALIZABLE);
                conn.add(Values.iri(RDF.NAMESPACE + "sub"), Values.iri(RDF.NAMESPACE + "pred"), Values.iri(RDF.NAMESPACE + "obj"));
                conn.add(Values.iri("x:sub"), Values.iri("x:pred"), Values.iri("x:obj"));
                conn.add(Values.iri("x:sub"), RDF.TYPE, RDF.LIST);
                conn.commit();
            }

            try (final RepositoryConnection conn = repo.getConnection()) {
                conn.begin();
                conn.prepareUpdate("DELETE { ?s ?p ?o } WHERE { ?s ?p ?o }").execute();
                conn.commit();
            }

            ValueStore valueStore = (ValueStore) store.getBackingStore().getValueFactory();
            valueStore.forceEvictionOfValues();
            System.gc();
        }
    }

    @AfterEach
    public void after() {
        repo.shutDown();
    }
}

@nguyenm100
Copy link
Contributor Author

Thanks @kenwenzel!

btw- what are your thoughts for allowing for the configuration of lmdb to specify whether gc is used as well as gc VALUE_EVICTION_INTERVAL ? If you agree, we can help contribute to that if you're busy.

kenwenzel added a commit to kenwenzel/rdf4j that referenced this issue Feb 24, 2025
@kenwenzel kenwenzel linked a pull request Feb 24, 2025 that will close this issue
5 tasks
kenwenzel added a commit to kenwenzel/rdf4j that referenced this issue Feb 24, 2025
@kenwenzel
Copy link
Contributor

@nguyenm100 Please go ahead and add configuration options for gc if you want.

@odysa
Copy link

odysa commented Feb 24, 2025

Hello @kenwenzel, thank you so much for investigation and fixing the issue.

Regarding LMDB GC configuration options, we are thinking about 2 solutions:

  1. Expose both enableGC and valueEvictionInterval. If the interval is not valid (0 or negative), throw an exception.
  2. Expose only valueEvictionInterval. If the interval is set to be invalid (0 or negative), GC is disabled.

Would like to hear your opinions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants