Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/Tutorial for multithreading #3078

Open
hanbin973 opened this issue Jan 10, 2025 · 4 comments
Open

Documentation/Tutorial for multithreading #3078

hanbin973 opened this issue Jan 10, 2025 · 4 comments

Comments

@hanbin973
Copy link
Contributor

Continuing from #3077.

I think this link from numpy docs is a good starting point: https://numpy.org/doc/2.2/reference/random/multithreading.html

The bottom line is that one can execute multiple computations concurrently with concurrent.futures.ThreadPoolExecutor given that the computation-heavy parts of the program are free from GIL.

@hyanwong
Copy link
Member

Should this be in the tskit docs or the tutorials? If the latter, I guess it would come under the "parallelisation" tutorial mooted at tskit-dev/tutorials#151 (comment)?

@hanbin973
Copy link
Contributor Author

hanbin973 commented Jan 10, 2025

The most straightforward mode of parallelization is splitting the job over windows. After splitting, one can add the results (or average them by some weight) to get the final result. genetic_relatedness_vector falls into this category.

I've done some profiling and found that there is a good amount of overhead due to memory allocation for this strategy, especially in large problems. This can be avoided if we could pass a predefined array to the statistics functions and update the array "in-place" via +=. This requires to update the _tskitmodule.c to accept external arrays. The more lower-level C functions are already in-place functions, so they don't require much change. However, this might conflict with common practices in Python.

Any thoughts?

  • Edit: this might not be a big deal after all, at least for genetic_relatedness_vector because book keeping variables that are initialized inside the C functions are way bigger than the result array.

@jeromekelleher
Copy link
Member

Are you sure it's memory allocations here and not overhead associated with seeking along the sequence? I'd be surprised if malloc overhead was significant here

@hanbin973
Copy link
Contributor Author

hanbin973 commented Jan 10, 2025

To answer your question, yes, malloc does matter. Here's the result from seq_length=1e7 and num_individuals=1e4 where the weight matrix 100 dimensions.
image

However, I think it's not necessary to change any of the API because

  1. The major malloc happens deeper in the C API and not _tskitmodule.c, so my initial speculation was wrong. It requires a lot of work.
  2. The problem, to the extent that I'm aware of, is largely specific to genetic_relatedness_matrix because of the high-dimensional weights. In my particular application, this dimension can go up to tens of thousands. Most statistics won't require this much weights.

For each thread, it initializes two arrays of the size num_weights * num_nodes, totaling num_weights * num_nodes * num_threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants