index.qmd

---
title: Multi-threading/processing for large array data, in pytorch
subtitle: what I learnt reviewing cellfinder PR \#440
author: Alessandro Felder, (Matt Einhorn)
execute: 
  enabled: true
format:
    revealjs:
        theme: [default, niu-dark.scss]
        logo: img/logo_niu_dark.png
        footer: "Multi-threading/processing | 2024-09-03"
        slide-number: c
        menu:
            numbers: true
        chalkboard: true
        scrollable: true
        preview-links: false
        view-distance: 10
        mobile-view-distance: 10
        auto-animate: true
        auto-play-media: true
        code-overflow: wrap
        highlight-style: atom-one
        mermaid: 
          theme: neutral
          fontFamily: arial
          curve: linear
    html:
        theme: [default, niu-dark.scss]
        logo: img/logo_niu_dark.png
        date: "2023-07-05"
        toc: true
        code-overflow: scroll
        highlight-style: atom-one
        mermaid: 
          theme: neutral
          fontFamily: arial
          curve: linear
          margin-left: 0
        embed-resources: true
        page-layout: full
my-custom-stuff:
   my-reuseable-variable: "some stuff"
---

## Table of contents

* Context
* Threads, Processes and Queues in Python
* Multiprocessing in Pytorch
* Redesigning multiprocessing for cellfinder, in PyTorch

## Context {.smaller}

::: {.incremental}
* cellfinder classification has moved to `pytorch` (thanks, Igor!)
* Matt (developer at Cornell) has become a regular cellfinder contributor
    * knows pytorch
    * his lab needs speed (for e.g. CFOs whole-brain stained samples) 
* Matt translated the "cell candidate detection steps" to pytorch
* I needed to learn how parallelisation works in pytorch, to review the code.
    * turns out I needed to learn Python first!
:::


## Threads versus Processes[^1]

::: {.fragment .fade-in-then-semi-out}
::: {style="margin-top: 1em; font-size: 0.5em;"}
"A process is an instance of program (e.g. Jupyter notebook, Python interpreter). Processes spawn threads (sub-processes) to handle subtasks like reading keystrokes, loading HTML pages, saving files. Threads live inside processes and share the same memory space."
:::
:::

:::: {.columns}

::: {.column width="50%" style="font-size: 0.5em;"}
::: {.fragment .fade-in}
Processes

::: {.incremental}
  *  can have multiple threads
  *  can execute code simultaneously in the same python program
  *  have more overhead than threads as opening and closing processes takes more time
  *  Sharing information is slower than sharing between threads as processes do not share memory space (pickling!).
:::
:::
:::

::: {.column width="50%" style="font-size: 0.5em;"}
::: {.fragment .fade-in}
Threads

::: {.incremental}
 * are like mini-processes that live inside a process
 * share memory space and efficiently read and write to the same variables
 * cannot execute code simultaneously in the same python program (although there are workarounds)
:::
:::
:::

::::

[^1]: [Brendan Fortuner on Medium](https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b)

## A Python Queue
::: {style="text-align: center; margin-top: 1em"}
[A Python Queue](https://docs.python.org/3/library/queue.html#queue.Queue){preview-link="true" style="text-align: center"}
:::

## Multithreading

```{python}
#| echo: true
# threads share local memory (by default)
from threading import Thread
from queue import Queue

def put_hello_in_queue(q):
    q.put('hello')

if __name__ == '__main__':
    q = Queue()
    print(type(q))
    threads = []
    for i in range(7):
        t = Thread(target=put_hello_in_queue, args=(q,))
        threads.append(t)
        t.start() 
        
    for t in threads:
        t.join() 

    print([q.get() for i in range(q.qsize())])
```

## Multiprocessing

```{.python code-line-numbers="1|8-9|14|17"}
import multiprocessing as mp

def put_hello_in_queue(q):
    q.put('hello')

if __name__ == "__main__":
  ctx = mp.get_context('spawn')
  q = ctx.Queue() # multiprocessing queue contents are shared across processes
  print(type(q))
  processes = []
  for i in range(7):
      p = ctx.Process(target=put_hello_in_queue, args=(q,))
      processes.append(p)
      p.start() 
      
  for p in processes:
      p.join() 

  print([q.get() for i in range(7)])
```

* [Python Multiprocessing module](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing)
* [Multiprocessing Queue](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue)
* [Pickling](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled)

## Pytorch multiprocessing
::: {style="text-align: center; margin-top: 1em"}
[torch.multiprocessing is a wrapper around the native multiprocessing module.](https://pytorch.org/docs/stable/multiprocessing.html){preview-link="true" style="text-align: center"}
:::

## Pytorch multiprocessing
::: {style="text-align: center; margin-top: 1em"}
[Sharing CUDA tensors](https://pytorch.org/docs/stable/multiprocessing.html#sharing-cuda-tensors){preview-link="true" style="text-align: center"}
:::

## Cellfinder multiprocessing/threading
New `pytorch`-friendly implementation of parallelisation in cellfinder's cell candidate detection step

::: {style="text-align: center; margin-top: 1em"}
[threading.py](https://github.com/brainglobe/cellfinder/blob/dc1d740589f697680f3868f4a4a0662c1fef1616/cellfinder/core/tools/threading.py){preview-link="true" style="text-align: center"}
:::

::: {style="text-align: center; margin-top: 1em"}
[test_threading.py](https://github.com/brainglobe/cellfinder/blob/dc1d740589f697680f3868f4a4a0662c1fef1616/tests/core/test_unit/test_tools/test_threading.py){preview-link="true" style="text-align: center"}
:::

## Volume Filter
::: {style="text-align: center; margin-top: 1em"}
[Volume filter](https://github.com/brainglobe/cellfinder/blob/dc1d740589f697680f3868f4a4a0662c1fef1616/cellfinder/core/detect/filters/volume/volume_filter.py){preview-link="true" style="text-align: center"}
:::


## Performance and results
::: {style="text-align: center; margin-top: 1em"}
[Matt's benchmarks](https://github.com/brainglobe/cellfinder/pull/440){preview-link="true" style="text-align: center"}
:::

## Performance?
CFos data of Nic Lavoie (MIT) on our HPC, with GPU:

* old version of cellfinder: 9 hours for ~3 Mio cell candidates
* new version of cellfinder: 2 hours for ~3 Mio cell candidates

## Next steps
* Turn these slides in docs with nice explanatory images
* Tweak PR 440 (expose extra parameters)
* merge and release!


## Concluding thoughts
* I still don't understand everything
* There are ways to parallelise Python (and pytorch)
* Processes and threads are appropriate in different situations...
* ... "optimisation" of code is empirical to some extent