Multi-GPU Inference #9259

ricardorei · 2021-09-01T15:20:06Z

ricardorei
Sep 1, 2021

Hi all!

What is the best way to perform inference (predict) using multi-GPU?

ATM in our framework we are relying on DP which is extremely slow and when I switch to DDP it basically splits the data loader into several data loaders and produces several "independent" system outputs. I would like something like DDP where in the end I could call a "merge" function to gather all the predictions that were performed by the different processes.

Am I missing something? there is probably a way do to this already...

Here is our predict code: [https://github.com/Unbabel/COMET/blob/master/comet/models/base.py#L395)

zhiyuanpeng · 2021-10-22T19:51:00Z

zhiyuanpeng
Oct 22, 2021

@ricardorei Have you solved this problem? I find that trainer.test() can be used to do multi gpus inference, but I need to modify the code of testing part in my PL model. However, I have saved my checkpoint and implemented the forward function. I am trying to find a way to load the checkpoint on multi gpus and do inference

3 replies

shivangsharma1 Oct 5, 2022

Hi @zhiyuanpeng,

I am looking for a similar approach. Please tell me if you found the approach for multi GPU inferencing. I want to perform this on huggingface T5 model.

zhiyuanpeng Oct 6, 2022

@shivangsharma1 I do inference on single gpu, otherwise, you may have to drop some data or repeat some data if your # of instance can not be divided by # of gpus.

shivangsharma1 Oct 8, 2022

@zhiyuanpeng , the data part I can manage, can you please share a script which can load a pretrained T5 model and do multi-GPU inferencing, it would be of great help. I have tried deepspeed from microsoft but didn't found a workable solution in Amazon Sagemaker.

@ricardorei also please let me know if you found a workable solution for multi GPU inferencing for a pretrained hugging face model? Would appreciate if you could share a working script.

rohitgr7 · 2021-10-24T13:04:03Z

rohitgr7
Oct 24, 2021

this doesn't work for you?

trainer = Trainer(..., strategy='ddp')
model = ...
preds = trainer.predict(model, predict_dataloader)

1 reply

zhiyuanpeng Oct 24, 2021

I think it works for me. Thank you very much!

ricardorei · 2021-10-25T09:51:04Z

ricardorei
Oct 25, 2021
Author

@rohitgr7
My problem with this preds = trainer.predict(model, predict_dataloader) is how to gather all predictions in the end? My final score is the average of individual scores. If I use DDP it will create different processes and instead of one single "final" score I will have several (I will have as many final scores as the number of GPUs available)

8 replies

ricardorei Dec 22, 2021
Author

I accept suggestions if there is a simpler way!

rohitgr7 Dec 22, 2021

hey !

apologies for the late reply. You can try this:

def on_predict_epoch_end(self, results):
    if self.trainer.is_global_zero:
        all_preds = self.all_gather(results[0])

tchaton Aug 4, 2022
Maintainer

Hey everyone,

Another way is to rely on the BasePredictionWriter: https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.BasePredictionWriter.html#pytorch_lightning.callbacks.BasePredictionWriter for each rank to write their own predictions to a file.

marcmk6 Aug 8, 2022

Hey everyone,

Another way is to rely on the BasePredictionWriter: https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.BasePredictionWriter.html#pytorch_lightning.callbacks.BasePredictionWriter for each rank to write their own predictions to a file.

Hi Thomas,

Thanks for providing this option. But I'm wondering how do I recover the order of the data?
For example there are 1000 entries to be predicted and I have 4 GPUs, how can I combine the four saved results so that predicted values are in the same order as the input?

Thanks

ecolss Jan 29, 2023

hey !

apologies for the late reply. You can try this:

def on_predict_epoch_end(self, results):
    if self.trainer.is_global_zero:
        all_preds = self.all_gather(results[0])

@rohitgr7 today I learned from one of your teammates that all_gather cannot be used together with is_global_zero, otherwise deadlock happens?
See here #16541 (comment)

ricardorei · 2022-12-18T10:43:56Z

ricardorei
Dec 18, 2022
Author

Alright it took me some time to figure out the best way to do this but here is my solution using DDP:

import logging
import os
import shutil
import tempfile

import torch
from pytorch_lightning.callbacks import BasePredictionWriter

from .utils import Prediction, flatten_metadata, restore_list_order

logger = logging.getLogger(__name__)


class CustomWriter(BasePredictionWriter):
    """Pytorch Lightning Callback that saves predictions and the corresponding batch
    indices in a temporary folder when using multigpu inference.

    Args:
        write_interval (str): When to perform write operations. Defaults to 'epoch'
    """

    def __init__(self, write_interval="epoch") -> None:
        super().__init__(write_interval)

    def write_on_epoch_end(self, trainer, pl_module, predictions, batch_indices):
        """Saves predictions after running inference on all samples."""

        # We need to save predictions in the most secure manner possible to avoid
        # multiple users and processes writing to the same folder.
        # For that we will create a tmp folder that will be shared only across
        # the DDP processes that were created
        if trainer.is_global_zero:
            output_dir = [
                tempfile.mkdtemp(),
            ]
            logger.info(
                "Created temporary folder to store predictions: {}.".format(
                    output_dir[0]
                )
            )
        else:
            output_dir = [
                None,
            ]

        torch.distributed.broadcast_object_list(output_dir)

        # Make sure every process received the output_dir from RANK=0
        torch.distributed.barrier()  
        # Now that we have a single output_dir shared across processes we can save
        # prediction along with their indices.
        self.output_dir = output_dir[0]
        # this will create N (num processes) files in `output_dir` each containing
        # the predictions of it's respective rank
        torch.save(
            predictions, os.path.join(self.output_dir, f"pred_{trainer.global_rank}.pt")
        )
        # optionally, you can also save `batch_indices` to get the information about
        # the data index from your prediction data
        torch.save(
            batch_indices,
            os.path.join(self.output_dir, f"batch_indices_{trainer.global_rank}.pt"),
        )

    def gather_all_predictions(self):
        """Reads all saves predictions from the self.output_dir into one single
        Prediciton object respecting the original order of the samples.
        """
       files = sorted(os.listdir(self.output_dir))
       pred = flatten_predictions([torch.load(os.path.join(self.output_dir, f))[0] for f in files if "pred" in f])
       indices = flatten_predictions([torch.load(os.path.join(self.output_dir, f))[0] for f in files if "batch_indices" in f])

       TODO: this depends on your application
       return output

    def cleanup(self):
        """Cleans temporary files."""
        logger.info("Cleanup temporary folder: {}.".format(self.output_dir))
        shutil.rmtree(self.output_dir)

3 replies

ricardorei Dec 18, 2022
Author

the Trainer is then initialised with a CustomWritter callback. This not only makes sure that all predictions are stored with the corresponding indices but also makes sure that there is no conflict in writing/reading from those files.

ecolss Mar 18, 2023

@ricardorei still think this CustomWriter solution is too heavy, and I don't quite understand why there is no easy way for multi-GPU inference. Maybe the multiprocessing solution is relatively lightweight, for example https://stackoverflow.com/a/74020988/2235936

donglihe-hub Dec 7, 2023

Hi ricardorei. How do you deal with the data replication issue during the prediction process? The DistributedSampler used by lightning will make sure the number of data are equal on each process right?

will-thompson-k · 2023-04-10T09:17:49Z

will-thompson-k
Apr 10, 2023

This code is honestly not well documented.

🥁After experimenting for many hours, this worked for me🥁:

def on_predict_epoch_end(self, results):
  # gather all results (not just to node-0, but all nodes to each other, which is weird)
  # all nodes are doing this action at the same time :/
  results = self.all_gather(results)
  # output is a list of tensors, shift to cpu0 and concatentate
  # note: results is List[List[tensor]]
  results = torch.concat([x.cpu() for x in results[0]])
  #persist it -> returning does nothing
  torch.save(tensor, file_name)

The problem is if you don't call gather to all nodes, it will hang waiting for the other nodes to respond.

Native pytorch has comparable functions for gather() (here it sends it to node 0), all_gather(), all_gather_multigpu(), etc : interestingly, they don't play well with the objects being passed around by pytorch lightning.

The annoying thing you will find is that this function is called after the model returns predictions, i.e.:

results = model.predict(data,data_loader)

and if you coalesce the results returned by this line with strategy ddp, since ddp seems to copy the script N times, they never appear on the same node unless you call gather.

4 replies

will-thompson-k · 2023-08-15T15:12:46Z

will-thompson-k
Aug 15, 2023

Here is an updated gist of how to do this: https://gist.github.com/will-thompson-k/f6201b68c428d0344a6affa6d53bc91b

1 reply

loretoparisi Oct 7, 2024

https://gist.github.com/will-thompson-k/f6201b68c428d0344a6affa6d53bc91b

@will-thompson-k so this solution will not need the CustomWriter, but just

import lightning as L

model = Llama3()

  ...

from lightning.pytorch.callbacks import Callback


class MyCustomCallbacks(Callback):
     # override this method on pytorch-lightning model
     def on_predict_epoch_end(self, results):
         # gather all results onto each device
         # find created world_size from pl.trainer
         results = all_gather(results[0], WORLD_SIZE, self._device)
         # concatenate on the cpu
         results = torch.concat([x.cpu() for x in results], dim=1)
         # output will not preserve input order. 
         # suggest outputing index in predict step and sort by index value here.


# attach on predict end
callbacks=[ MyCustomCallbacks() ]
trainer = L.Trainer(...,callbacks=callbacks, strategy='ddp')

# we do not back the results here
trainer.predict(model, predict_dataloader)

or I do the override on the module Llama3 directly:

class Llama(L.LightningModule):
....

     def on_predict_epoch_end(self, results):
         # gather all results onto each device
         # find created world_size from pl.trainer
         results = all_gather(results[0], WORLD_SIZE, self._device)
         # concatenate on the cpu
         results = torch.concat([x.cpu() for x in results], dim=1)
         # output will not preserve input order. 
         # suggest outputing index in predict step and sort by index value here.

dionman · 2023-09-19T12:52:36Z

dionman
Sep 19, 2023

Is multi-gpu inference currently supported for trainers with strategy "deepspeed_stage_3"?

0 replies

Multi-GPU Inference #9259

Replies: 7 comments · 20 replies

ricardorei Oct 25, 2021 Author

ricardorei Dec 22, 2021 Author

tchaton Aug 4, 2022 Maintainer

ricardorei Dec 18, 2022 Author