libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941

brianloyal · 2025-01-23T01:12:11Z

Summary

When running GPU-accelerated search using the mmseqs2 docker image against the colabfold_envdb_202108 or uniref30_2302_db databases I see an error saying

CUDA error: initialization error : /opt/build/lib/libmarv/src/marv.cu, line 85
Error: Prefilter died

That particular line is involved in getting the CUDA device count, so maybe it has something to do with seeing the GPUs on the instanc?

Environment

AWS g5.8xlarge EC2 instance
- AMD EPYC 7R32 processor
- 32x vCPU
- 128 GiB Memory
- 1x NVIDIA A10G accelerator (24 GiB VRAM)
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Amazon Linux 2023) 20250107
- 6.1.119-129.201.amzn2023.x86_64 Linux distro
- CUDA Version: 12.6
- CUDA Driver Version: 560.35.03
- PyTorch 2.5.1
Container ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12
- Pulled Wednesday, January 22, 2025

Steps to reproduce

mkdir data
wget https://www.rcsb.org/fasta/entry/1UTN -O data/1utn.fasta
wget https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar -xzvf colabfold_envdb_202108.tar.gz -C data
rm colabfold_envdb_202108.tar.gz
docker pull ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12
docker run -it --rm --gpus all -v "$(pwd)/data:/home/data" ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 tsv2exprofiledb "/home/data/colabfold_envdb_202108" "/home/data/targetDB" --gpu 1
docker run -it --rm --gpus all -v "$(pwd)/data:/home/data" ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 createdb "/home/data/1utn.fasta" "/home/data/queryDB"
docker run -it --rm --gpus all -v "$(pwd)/data:/home/data" ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 search "/home/data/queryDB" "/home/data/targetDB" "/home/data/result" "/home/data/tmp" --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --gpu 1 --prefilter-mode 1

Things I've tried

✅ SUCCESS: Run search with --gpu 0 (i.e. turn off GPU acceleration)
⛔ ERROR: Set the CUDA_VISIBLE_DEVICES="0" environment variable
⛔ ERROR: Use the ghcr.io/soedinglab/mmseqs2:latest-cuda12 container
⛔ ERROR: Clone the GitHub repo and build the Dockerfile
⛔ ERROR: Modify the Dockerfile to COPY /opt/build/lib/libmarv from the build stage
⛔ ERROR: Modify the Dockerfile to use the precompiled binary
⛔ ERROR: Run using a g5.12xlarge (4x A10G GPUs instead of 1x)

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2025-01-23T01:44:35Z

Does the precompiled binary work?

wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz

Maybe there is something wrong with the docker

brianloyal · 2025-01-23T15:19:37Z

@milot-mirdita Yeah, tried that too and got the same result. For my own sanity, have you been able to successfully run GPU search against the colabfold profile databases?

milot-mirdita · 2025-01-23T15:46:50Z

Does the following script work on the A10G instance:

wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz
tar xzvf mmseqs-linux-gpu.tar.gz
wget https://raw.githubusercontent.com/soedinglab/MMseqs2/refs/heads/master/examples/QUERY.fasta
./mmseqs/bin/mmseqs easy-search QUERY.fasta QUERY.fasta res tmp --gpu 1

If not, can you try this script on an L4 or L40s based instance? We did most of our testing on those.

Is CUDA_VISIBLE_DEVICES set to some odd value?

brianloyal · 2025-01-23T19:11:33Z

Update: The script you shared generates the same error that I saw before, both from within the ghcr.io/soedinglab/mmseqs2:17-b804f-cuda12 container and directly on the host

docker_stdout.txt

I'll try it on a L4 and L40S here in a bit and report back

brianloyal · 2025-01-23T20:11:29Z

Good news! The script works on both the L4 and L4S, both directly on the host and from the container. Maybe there's an issue running on Ampere?

My original script works as well

brianloyal · 2025-01-23T22:17:19Z

Closing this for now, but you may want to update the wiki to strongly encourage Lovelace-gen GPUs for best results

milot-mirdita · 2025-01-24T04:04:17Z

Reopening as we need to keep investigating. MMseqs2-GPU should work even on Turing (albeit rather slowly). Ampere and newer should all work fine.

milot-mirdita · 2025-01-24T08:16:53Z

I tried on 2080 TI (Turing, cuda cap 7.5), A5000 (Ada, 8.6), 4090 (Ada Lovelace, 8.9) and L40S (Ada Lovelace, 8.9). Works everywhere. Did you also try on A100? Does this only happen on A10G? Would it be possible to give me temporary access to an A10G machine?

brianloyal · 2025-01-24T13:43:29Z

Ok, I just tried on an A100 and it also works fine, both inside and outside the container. So, at least so far, it seems specific to an A10G.

It looks like the list of cuda architectures passed to the compiler in the Docker container includes the one for A10G (8.6), so that's not the problem. Can't think of anything else that would be A10-specific, but we're just about at the edge of my depth on cuda.

brianloyal · 2025-01-24T14:07:01Z

One more update. I switched over to a Ubuntu AMI running CUDA 12.4 on a A10G and it worked. I've seen some reports in the past of CUDA version inconsistencies when compiling on a A100 vs A10G using Amazon Linux so it's probably something to do with that. However, since (A) it works on a g5 running Ubuntu, and (B) it works on a g6 and p4 running Amazon Linux, I'm satisfied. No edits necessary to the wiki.

achacond · 2025-01-24T14:22:30Z

Thanks! Can you please share the Driver version used in the last run with CUDA 12.4?
If I understood correctly the issues are coming from Driver version 560.35.03 + CUDA 12.6?

brianloyal · 2025-01-24T14:55:55Z

Yep, that's right. The one that worked on A10G was 550.144.03 + CUDA 12.4

brianloyal changed the title ~~libmarv initialization error during GPU-accelerated search against ColabFold databases~~ libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker Jan 23, 2025

brianloyal closed this as completed Jan 23, 2025

milot-mirdita reopened this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941

libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941

brianloyal commented Jan 23, 2025 •

edited

Loading

milot-mirdita commented Jan 23, 2025

brianloyal commented Jan 23, 2025

milot-mirdita commented Jan 23, 2025

brianloyal commented Jan 23, 2025

brianloyal commented Jan 23, 2025 •

edited

Loading

brianloyal commented Jan 23, 2025

milot-mirdita commented Jan 24, 2025

milot-mirdita commented Jan 24, 2025

brianloyal commented Jan 24, 2025

brianloyal commented Jan 24, 2025

achacond commented Jan 24, 2025

brianloyal commented Jan 24, 2025

libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941

libmarv initialization error during GPU-accelerated search against ColabFold databases using Docker #941

Comments

brianloyal commented Jan 23, 2025 • edited Loading

Summary

Environment

Steps to reproduce

Things I've tried

milot-mirdita commented Jan 23, 2025

brianloyal commented Jan 23, 2025

milot-mirdita commented Jan 23, 2025

brianloyal commented Jan 23, 2025

brianloyal commented Jan 23, 2025 • edited Loading

brianloyal commented Jan 23, 2025

milot-mirdita commented Jan 24, 2025

milot-mirdita commented Jan 24, 2025

brianloyal commented Jan 24, 2025

brianloyal commented Jan 24, 2025

achacond commented Jan 24, 2025

brianloyal commented Jan 24, 2025

brianloyal commented Jan 23, 2025 •

edited

Loading

brianloyal commented Jan 23, 2025 •

edited

Loading