-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941
Comments
Does the precompiled binary work? wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz Maybe there is something wrong with the docker |
@milot-mirdita Yeah, tried that too and got the same result. For my own sanity, have you been able to successfully run GPU search against the colabfold profile databases? |
Does the following script work on the A10G instance:
If not, can you try this script on an L4 or L40s based instance? We did most of our testing on those. Is CUDA_VISIBLE_DEVICES set to some odd value? |
Update: The script you shared generates the same error that I saw before, both from within the I'll try it on a L4 and L40S here in a bit and report back |
Good news! The script works on both the L4 and L4S, both directly on the host and from the container. Maybe there's an issue running on Ampere? My original script works as well |
Closing this for now, but you may want to update the wiki to strongly encourage Lovelace-gen GPUs for best results |
Reopening as we need to keep investigating. MMseqs2-GPU should work even on Turing (albeit rather slowly). Ampere and newer should all work fine. |
I tried on 2080 TI (Turing, cuda cap 7.5), A5000 (Ada, 8.6), 4090 (Ada Lovelace, 8.9) and L40S (Ada Lovelace, 8.9). Works everywhere. Did you also try on A100? Does this only happen on A10G? Would it be possible to give me temporary access to an A10G machine? |
Ok, I just tried on an A100 and it also works fine, both inside and outside the container. So, at least so far, it seems specific to an A10G. It looks like the list of cuda architectures passed to the compiler in the Docker container includes the one for A10G (8.6), so that's not the problem. Can't think of anything else that would be A10-specific, but we're just about at the edge of my depth on cuda. |
One more update. I switched over to a Ubuntu AMI running CUDA 12.4 on a A10G and it worked. I've seen some reports in the past of CUDA version inconsistencies when compiling on a A100 vs A10G using Amazon Linux so it's probably something to do with that. However, since (A) it works on a g5 running Ubuntu, and (B) it works on a g6 and p4 running Amazon Linux, I'm satisfied. No edits necessary to the wiki. |
Thanks! Can you please share the Driver version used in the last run with CUDA 12.4? |
Yep, that's right. The one that worked on A10G was 550.144.03 + CUDA 12.4 |
Summary
When running GPU-accelerated search using the mmseqs2 docker image against the
colabfold_envdb_202108
oruniref30_2302_db
databases I see an error sayingThat particular line is involved in getting the CUDA device count, so maybe it has something to do with seeing the GPUs on the instanc?
Environment
Steps to reproduce
Things I've tried
✅ SUCCESS: Run search with
--gpu 0
(i.e. turn off GPU acceleration)⛔ ERROR: Set the
CUDA_VISIBLE_DEVICES="0"
environment variable⛔ ERROR: Use the
ghcr.io/soedinglab/mmseqs2:latest-cuda12
container⛔ ERROR: Clone the GitHub repo and build the Dockerfile
⛔ ERROR: Modify the Dockerfile to COPY
/opt/build/lib/libmarv
from the build stage⛔ ERROR: Modify the Dockerfile to use the precompiled binary
⛔ ERROR: Run using a g5.12xlarge (4x A10G GPUs instead of 1x)
The text was updated successfully, but these errors were encountered: