Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuvs-java: Rework the api to be Java 21 friendly #628

Merged
merged 21 commits into from
Feb 6, 2025

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented Jan 30, 2025

This change reworks the api to allow it to be used with Java 21. The implementation is moved to an internal package, compiled with JDK 22, and packaged as an mrjar. The benefit of this structure is that the api can be used in environments that compile to a minimum of Java 21, but run on more recent JDKs like 22 and 23 - which is exactly what Elasticsearch and Lucene do. In fact, a minimum compilation target of Java 21 is common, since 21, at the time of writing, is the most recent LTS Java release.

The most significant change is that the non-trivial api types are now, for the most part, interfaces. Instance can be created by one of the factory methods, which lookup an spi to find the implementation. If on a release greater than Java 21, then a functioning implementation is returned. Otherwise, a no-op implementation is returned. This is a reasonably standard way for a Java api to behave, and allows the developer to handle the case where the platform does not have a functioning implementation.

This change also refactors the native downcall method handles so that they are static final constants - which optimise better by the JVM. It's also the generally accepted pattern, where the handles are tied to the lifetime of class which effectively mediates access - by virtue of reachability.

Another thing that I added is the ability to programmatically set the temporary directory used for intermediate operations - this is important to how both Lucene and Elasticsearch work - since they commonly only have permission to write to certain parts of the disk.

Additionally,

  1. the error codes from native calls are plumbed in and checked. As well as cuvsGetLastErrorText.
  2. a state is added to any classes that hold a reference to native resources that could be released.
  3. a local arena is used for memory allocation only needed per downcall invocation, e.g. the return value.
  4. I moved the tests to be integration tests, since they need to run on the jar (rather than the exploded classes). They can be run by any of; mvn verify, or mvn integration-test, or mvn -Dit.test="*Hnsw*" verify
  5. I refactored the entry-points to the api to be static methods and added an spi layer. You can see the minimal impact on the tests.
  6. Move the native library out of the top-level directory in the jar and into an os/arch position in the META-INF.
  7. add service provider support for custom implementations.

Copy link

copy-pr-bot bot commented Jan 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 30, 2025
@cjnolet
Copy link
Member

cjnolet commented Jan 30, 2025

/ok to test

@ChrisHegarty ChrisHegarty changed the title cuvs-java: Refactor method handle usage and separate out local areas from long held cuvs-java: Rework the api to only use Java 21 types Feb 1, 2025
@cjnolet
Copy link
Member

cjnolet commented Feb 1, 2025

/ok to test

@ChrisHegarty ChrisHegarty requested a review from a team as a code owner February 2, 2025 14:48
@cjnolet
Copy link
Member

cjnolet commented Feb 2, 2025

/ok to test

@ChrisHegarty ChrisHegarty changed the title cuvs-java: Rework the api to only use Java 21 types cuvs-java: Rework the api to be Java 21 friendly Feb 3, 2025
@cjnolet
Copy link
Member

cjnolet commented Feb 4, 2025

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Feb 4, 2025

/ok to test

@narangvivek10
Copy link

narangvivek10 commented Feb 4, 2025

@ChrisHegarty @chatman and I tried running this branch locally and came across the following. Please help. Thanks.

  • The maven verify is failing (the same tests are passing on branch-25.02) the test case with the following logs (on my RTX 2080Ti) https://gist.github.com/narangvivek10/6fb8f72f8fffdbe67fba0f226fa8a710
  • The examples project needs to be updated so that, that becomes a sample project for a casual user to start using cuvs-java. Ishan and I tried to do it but in our efforts we saw errors like:
java.lang.AssertionError: java.lang.ClassNotFoundException: com.nvidia.cuvs.spi.JDKProvider
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.builtinProvider(CuVSServiceProvider.java:53)
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.loadProvider(CuVSServiceProvider.java:39)
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.<clinit>(CuVSServiceProvider.java:36)
	at com.nvidia.cuvs.spi.CuVSProvider.provider(CuVSProvider.java:67)
	at com.nvidia.cuvs.CuVSResources.create(CuVSResources.java:62)
	at com.nvidia.cuvs.CuVSResources.create(CuVSResources.java:51)
	at com.nvidia.cuvs.CagraBuildAndSearchTest.testIndexingAndSearchingFlow(CagraBuildAndSearchTest.java:81)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:93)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:40)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:530)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:758)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:453)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:211)
Caused by: java.lang.ClassNotFoundException: com.nvidia.cuvs.spi.JDKProvider
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:413)
	at java.base/java.lang.Class.forName(Class.java:404)
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.builtinProvider(CuVSServiceProvider.java:49)
	... 31 more

@ChrisHegarty
Copy link
Contributor Author

@ChrisHegarty @chatman and I tried running this branch locally and came across the following. Please help. Thanks.

From the gist:

[ERROR] Process Exit Code: 134

The JVM is crashing. Please take a look at the hs_err_pid<pid>.log file. The most useful thing is the native stack trace. Please also note that I just fixed today, and committed, a change for getLastErrorText, which might explain the crash - the stack trace you see in the he_err_xxx file will confirm is this is the case.

  • The examples project needs to be updated so that, that becomes a sample project for a casual user to start using cuvs-java. Ishan and I tried to do it but in our efforts we saw errors like:
    java.lang.AssertionError: java.lang.ClassNotFoundException: com.nvidia.cuvs.spi.JDKProvider

That issue arises with running with the classes on the class path rather than the jar file. Lemme see if I can fix it.

@ChrisHegarty
Copy link
Contributor Author

  • The examples project needs to be updated so that, that becomes a sample project for a casual user to start using cuvs-java. Ishan and I tried to do it but in our efforts we saw errors like:
    java.lang.AssertionError: java.lang.ClassNotFoundException: com.nvidia.cuvs.spi.JDKProvider

That issue arises with running with the classes on the class path rather than the jar file. Lemme see if I can fix it.

The examples now compiles and runs for me. There was an issue with the multi-release attribute missing from the jar. Please pull the latest branch and confirm that all is good. Both the crash and examples should be fixed.

@cjnolet
Copy link
Member

cjnolet commented Feb 5, 2025

/ok to test

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025

I tried the example project:

  • I changed loggers to System.out.println()
  • Then "mvn clean compile test package"
  • Saw this class not found isssue
ishan@4090-workstation ~/code/cuvs/java/examples ((HEAD detached at hegarty/java_mh_refactor)) $ java -cp ./target/cuvs-java-examples-25.02.0.jar:/home/ishan/.m2/repository/com/nvidia/cuvs/cuvs-java/25.02.0/cuvs-java-25.02.0.jar com.nvidia.cuvs.examples.CagraExample 
Testing..
Exception in thread "main" java.lang.AssertionError: java.lang.ClassNotFoundException: com.nvidia.cuvs.spi.JDKProvider
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.builtinProvider(CuVSServiceProvider.java:53)
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.loadProvider(CuVSServiceProvider.java:39)
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.<clinit>(CuVSServiceProvider.java:36)
	at com.nvidia.cuvs.spi.CuVSProvider.provider(CuVSProvider.java:67)
	at com.nvidia.cuvs.CuVSResources.create(CuVSResources.java:62)
	at com.nvidia.cuvs.CuVSResources.create(CuVSResources.java:51)
	at com.nvidia.cuvs.examples.CagraExample.main(CagraExample.java:42)
Caused by: java.lang.ClassNotFoundException: com.nvidia.cuvs.spi.JDKProvider
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:413)
	at java.base/java.lang.Class.forName(Class.java:404)
	at com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.builtinProvider(CuVSServiceProvider.java:49)
	... 6 more

Any ideas, please?

Edit: before doing this, I did ./build.sh in java dir first. It caused a crash on test, so I changed "mvn verify" to "mvn package" so that the right artifacts at least get installed to local maven. Then, I tried the example project. Any ideas what I could be doing wrong?

@ChrisHegarty
Copy link
Contributor Author

@chatman Please confirm that you are on the latest commit in the branch, since that error should be fixed.

$ git show HEAD
commit 52d35628032382da6291a53a761bd77b7842f9e6 (HEAD -> java_mh_refactor, chegar/java_mh_refactor)
Merge: 161cab6 8c683b0
Author: Chris Hegarty <[email protected]>
Date:   Tue Feb 4 21:09:11 2025 +0000

    Merge branch 'branch-25.02' into java_mh_refactor

ubuntu@ip-172-31-45-5:~/git/cuvs_latest/java/examples$ 
ubuntu@ip-172-31-45-5:~/git/cuvs_latest/java/examples$ 
ubuntu@ip-172-31-45-5:~/git/cuvs_latest/java/examples$ mvn clean compile assembly:single
[INFO] Scanning for projects...
[INFO] 
[INFO] ------------< com.nvidia.cuvs.examples:cuvs-java-examples >-------------
[INFO] Building cuvs-java-examples 25.02.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:3.4.0:clean (default-clean) @ cuvs-java-examples ---
[INFO] 
[INFO] --- maven-resources-plugin:3.3.1:resources (default-resources) @ cuvs-java-examples ---
[INFO] Copying 1 resource from src/main/resources to target/classes
[INFO] 
[INFO] --- maven-compiler-plugin:3.13.0:compile (default-compile) @ cuvs-java-examples ---
[INFO] Recompiling the module because of changed source code.
[INFO] Compiling 3 source files with javac [debug release 22] to target/classes
[INFO] 
[INFO] --- maven-assembly-plugin:3.4.2:single (default-cli) @ cuvs-java-examples ---
[INFO] Building jar: /home/ubuntu/git/cuvs_latest/java/examples/target/cuvs-java-examples-25.02.0-jar-with-dependencies.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.575 s
[INFO] Finished at: 2025-02-05T10:26:04Z
[INFO] ------------------------------------------------------------------------
ubuntu@ip-172-31-45-5:~/git/cuvs_latest/java/examples$ ~/binaries/jdk-22.0.2/bin/java --enable-native-access=ALL-UNNAMED -jar target/cuvs-java-examples-25.02.0-jar-with-dependencies.jar 
[2025-02-05 10:26:14.890] [RAFT] [info] optimizing graph
[2025-02-05 10:26:14.894] [RAFT] [info] Graph optimized, creating index
[2025-02-05 10:26:14.924] [RAFT] [info] Saving CAGRA index with dataset
[main] INFO com.nvidia.cuvs.examples.CagraExample - [{3=0.038782578, 2=0.3590463, 0=0.83774555}, {0=0.12472608, 2=0.21700792, 1=0.31918612}, {3=0.047766715, 2=0.20332818, 0=0.48305473}, {1=0.15224178, 0=0.59063464, 3=0.5986642}]
[main] INFO com.nvidia.cuvs.examples.CagraExample - [{3=0.038782578, 2=0.3590463, 0=0.83774555}, {0=0.12472608, 2=0.21700792, 1=0.31918612}, {3=0.047766715, 2=0.20332818, 0=0.48305473}, {1=0.15224178, 0=0.59063464, 3=0.5986642}]

@ChrisHegarty
Copy link
Contributor Author

@chatman Ok, I see that ur running the example in a different way. Lemme try what ur doing.

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025 via email

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025

Yes, I tried that commit.
Uploading Screenshot_20250205_160108_com_server_auditor_ssh_client_TerminalActivity.jpg…

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025

Please let me know how you were able to run it.

@ChrisHegarty
Copy link
Contributor Author

@chatman Please clean and rebuild ur local maven repository, to ensure that there are no stale artifacts.

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025

I can confirm the same problem after cleaning maven before building and trying to run again.

@ChrisHegarty
Copy link
Contributor Author

The "with-dependencies.jar" was missing the multi-release attribute. Argh! maven plugin artifact configuration is convoluted. Fixed by 62583b3. Please verify. @chatman

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025

The examples work perfectly now, thanks @ChrisHegarty.
@narangvivek10 and I are investigating the crash (during test).

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Feb 5, 2025

@narangvivek10 and I are investigating the crash (during test).

I also see a crash in my local testing. The crash/issue appears to the same as the crash that I see in the main branch-25.02 (without this PR). At least, the stack traces seem to indicate so.

@chatman
Copy link
Contributor

chatman commented Feb 5, 2025

Quick summary of the test crash situation in this branch:
./build.sh java runs a "mvn verify" command
That passes on:

  • @ChrisHegarty's AWS VM (based on T4 GPUs?)
  • Passes on @chatman's RTX 4060 Ti (16GB) workstation
  • Crashes on SearchScale's RTX 4090 workstation
  • Crashes on @narangvivek10's RTX 2080 Ti workstation

@ChrisHegarty
Copy link
Contributor Author

Quick summary of the test crash situation in this branch:

Hypothesis on this.

  1. The crash reproduces intermittently - racy and dependent on generated random dataset values
  2. I do not see a crash in this PR branch. Not any more. There was a prior issue, but that has been fixed in this PR branch
    • I do see an error - a java exception [*]
  3. I believe the crash occurs in serializeToHNSW, which happens after an index create. But prior to this PR the error code of create is not checked. So if creation fails, the code will proceed without notice and (here's the speculation!) crash.

I believe that this PR branch does not see the crash, since it checks the error code from native calls.

@cjnolet
Copy link
Member

cjnolet commented Feb 5, 2025

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Feb 6, 2025

/merge

@rapids-bot rapids-bot bot merged commit 228d949 into rapidsai:branch-25.02 Feb 6, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants