Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941

Open
brianloyal opened this issue Jan 23, 2025 · 12 comments

Comments

@brianloyal
Copy link

brianloyal commented Jan 23, 2025

Summary

When running GPU-accelerated search using the mmseqs2 docker image against the colabfold_envdb_202108 or uniref30_2302_db databases I see an error saying

CUDA error: initialization error : /opt/build/lib/libmarv/src/marv.cu, line 85
Error: Prefilter died

That particular line is involved in getting the CUDA device count, so maybe it has something to do with seeing the GPUs on the instanc?

Environment

  • AWS g5.8xlarge EC2 instance
    • AMD EPYC 7R32 processor
    • 32x vCPU
    • 128 GiB Memory
    • 1x NVIDIA A10G accelerator (24 GiB VRAM)
  • Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Amazon Linux 2023) 20250107
    • 6.1.119-129.201.amzn2023.x86_64 Linux distro
    • CUDA Version: 12.6
    • CUDA Driver Version: 560.35.03
    • PyTorch 2.5.1
  • Container ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12
    • Pulled Wednesday, January 22, 2025

Steps to reproduce

mkdir data
wget https://www.rcsb.org/fasta/entry/1UTN -O data/1utn.fasta
wget https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar -xzvf colabfold_envdb_202108.tar.gz -C data
rm colabfold_envdb_202108.tar.gz
docker pull ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12
docker run -it --rm --gpus all -v "$(pwd)/data:/home/data" ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 tsv2exprofiledb "/home/data/colabfold_envdb_202108" "/home/data/targetDB" --gpu 1
docker run -it --rm --gpus all -v "$(pwd)/data:/home/data" ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 createdb "/home/data/1utn.fasta" "/home/data/queryDB"
docker run -it --rm --gpus all -v "$(pwd)/data:/home/data" ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 search "/home/data/queryDB" "/home/data/targetDB" "/home/data/result" "/home/data/tmp" --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --gpu 1 --prefilter-mode 1

Things I've tried

✅ SUCCESS: Run search with --gpu 0 (i.e. turn off GPU acceleration)
⛔ ERROR: Set the CUDA_VISIBLE_DEVICES="0" environment variable
⛔ ERROR: Use the ghcr.io/soedinglab/mmseqs2:latest-cuda12 container
⛔ ERROR: Clone the GitHub repo and build the Dockerfile
⛔ ERROR: Modify the Dockerfile to COPY /opt/build/lib/libmarv from the build stage
⛔ ERROR: Modify the Dockerfile to use the precompiled binary
⛔ ERROR: Run using a g5.12xlarge (4x A10G GPUs instead of 1x)

@brianloyal brianloyal changed the title libmarv initialization error during GPU-accelerated search against ColabFold databases libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker Jan 23, 2025
@milot-mirdita
Copy link
Member

Does the precompiled binary work?

wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz

Maybe there is something wrong with the docker

@brianloyal
Copy link
Author

@milot-mirdita Yeah, tried that too and got the same result. For my own sanity, have you been able to successfully run GPU search against the colabfold profile databases?

@milot-mirdita
Copy link
Member

Does the following script work on the A10G instance:

wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz
tar xzvf mmseqs-linux-gpu.tar.gz
wget https://raw.githubusercontent.com/soedinglab/MMseqs2/refs/heads/master/examples/QUERY.fasta
./mmseqs/bin/mmseqs easy-search QUERY.fasta QUERY.fasta res tmp --gpu 1

If not, can you try this script on an L4 or L40s based instance? We did most of our testing on those.

Is CUDA_VISIBLE_DEVICES set to some odd value?

@brianloyal
Copy link
Author

Update: The script you shared generates the same error that I saw before, both from within the ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 container and directly on the host

docker_stdout.txt

I'll try it on a L4 and L40S here in a bit and report back

@brianloyal
Copy link
Author

brianloyal commented Jan 23, 2025

Good news! The script works on both the L4 and L4S, both directly on the host and from the container. Maybe there's an issue running on Ampere?

My original script works as well

@brianloyal
Copy link
Author

Closing this for now, but you may want to update the wiki to strongly encourage Lovelace-gen GPUs for best results

@milot-mirdita
Copy link
Member

Reopening as we need to keep investigating. MMseqs2-GPU should work even on Turing (albeit rather slowly). Ampere and newer should all work fine.

@milot-mirdita
Copy link
Member

I tried on 2080 TI (Turing, cuda cap 7.5), A5000 (Ada, 8.6), 4090 (Ada Lovelace, 8.9) and L40S (Ada Lovelace, 8.9). Works everywhere. Did you also try on A100? Does this only happen on A10G? Would it be possible to give me temporary access to an A10G machine?

@brianloyal
Copy link
Author

Ok, I just tried on an A100 and it also works fine, both inside and outside the container. So, at least so far, it seems specific to an A10G.

It looks like the list of cuda architectures passed to the compiler in the Docker container includes the one for A10G (8.6), so that's not the problem. Can't think of anything else that would be A10-specific, but we're just about at the edge of my depth on cuda.

@brianloyal
Copy link
Author

One more update. I switched over to a Ubuntu AMI running CUDA 12.4 on a A10G and it worked. I've seen some reports in the past of CUDA version inconsistencies when compiling on a A100 vs A10G using Amazon Linux so it's probably something to do with that. However, since (A) it works on a g5 running Ubuntu, and (B) it works on a g6 and p4 running Amazon Linux, I'm satisfied. No edits necessary to the wiki.

@achacond
Copy link
Contributor

Thanks! Can you please share the Driver version used in the last run with CUDA 12.4?
If I understood correctly the issues are coming from Driver version 560.35.03 + CUDA 12.6?

@brianloyal
Copy link
Author

Yep, that's right. The one that worked on A10G was 550.144.03 + CUDA 12.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants