Documentation/Tutorial for multithreading #3078

hanbin973 · 2025-01-10T10:35:37Z

Continuing from #3077.

I think this link from numpy docs is a good starting point: https://numpy.org/doc/2.2/reference/random/multithreading.html

The bottom line is that one can execute multiple computations concurrently with concurrent.futures.ThreadPoolExecutor given that the computation-heavy parts of the program are free from GIL.

The text was updated successfully, but these errors were encountered:

hyanwong · 2025-01-10T15:22:28Z

Should this be in the tskit docs or the tutorials? If the latter, I guess it would come under the "parallelisation" tutorial mooted at tskit-dev/tutorials#151 (comment)?

hanbin973 · 2025-01-10T17:59:19Z

The most straightforward mode of parallelization is splitting the job over windows. After splitting, one can add the results (or average them by some weight) to get the final result. genetic_relatedness_vector falls into this category.

I've done some profiling and found that there is a good amount of overhead due to memory allocation for this strategy, especially in large problems. This can be avoided if we could pass a predefined array to the statistics functions and update the array "in-place" via +=. This requires to update the _tskitmodule.c to accept external arrays. The more lower-level C functions are already in-place functions, so they don't require much change. However, this might conflict with common practices in Python.

Any thoughts?

Edit: this might not be a big deal after all, at least for genetic_relatedness_vector because book keeping variables that are initialized inside the C functions are way bigger than the result array.

jeromekelleher · 2025-01-10T19:15:14Z

Are you sure it's memory allocations here and not overhead associated with seeking along the sequence? I'd be surprised if malloc overhead was significant here

hanbin973 · 2025-01-10T23:39:59Z

To answer your question, yes, malloc does matter. Here's the result from seq_length=1e7 and num_individuals=1e4 where the weight matrix 100 dimensions.

However, I think it's not necessary to change any of the API because

The major malloc happens deeper in the C API and not _tskitmodule.c, so my initial speculation was wrong. It requires a lot of work.
The problem, to the extent that I'm aware of, is largely specific to genetic_relatedness_matrix because of the high-dimensional weights. In my particular application, this dimension can go up to tens of thousands. Most statistics won't require this much weights.

For each thread, it initializes two arrays of the size num_weights * num_nodes, totaling num_weights * num_nodes * num_threads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation/Tutorial for multithreading #3078

Documentation/Tutorial for multithreading #3078

hanbin973 commented Jan 10, 2025

hyanwong commented Jan 10, 2025

hanbin973 commented Jan 10, 2025 •

edited

Loading

jeromekelleher commented Jan 10, 2025

hanbin973 commented Jan 10, 2025 •

edited

Loading

Documentation/Tutorial for multithreading #3078

Documentation/Tutorial for multithreading #3078

Comments

hanbin973 commented Jan 10, 2025

hyanwong commented Jan 10, 2025

hanbin973 commented Jan 10, 2025 • edited Loading

jeromekelleher commented Jan 10, 2025

hanbin973 commented Jan 10, 2025 • edited Loading

hanbin973 commented Jan 10, 2025 •

edited

Loading

hanbin973 commented Jan 10, 2025 •

edited

Loading