Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of CUDA kernel for multiple GPUs #342

Open
cyborgshead opened this issue Jun 25, 2019 · 1 comment
Open

Optimization of CUDA kernel for multiple GPUs #342

cyborgshead opened this issue Jun 25, 2019 · 1 comment

Comments

@cyborgshead
Copy link
Member

cyborgshead commented Jun 25, 2019

We have network limits which linked to the speed of processing (rank window) and size of the graph (onboard memory of GPU/GPUs)

Mainnet will be started with a pretty big rank's calculation window (>=100 blocks) and a small amount of network bandwidth which will provide time to kernel upgrade by the community and also do hardware upgrades by validators.

Now, much time take to prepare data before sending it to GPU and we have only single GPU CUDA PageRank algorithm implementation.

My proposal to starts with stand-alone optimized kernel and redefine data structures during performance research and implementation of multiple GPUs kernel. Then make refactoring of structures in cyberd and migrate to the new kernel.

References: #229

Note:

  1. Single host, single GPU
    <-----We are here----->
  2. Single host, multiple GPUs (x16 PCI Express)
  3. Multiple hosts, multiple GPUs
@serejandmyself
Copy link
Member

serejandmyself commented Oct 25, 2019

I re-wrote your task slightly - as per how I understand it. It would be good if you could correct me if I got it wrong.

In any case, I also added some (maybe) useful links to some research and articles:

Current situation:

  • Rank is calculated with a rate of 100ms
  • Data is uploaded in 3 seconds
  • This means 3mil CIDs per calculation max
  • 2d array is used GPU calculation

Problem:

  • This is a network limit due to processing speed
  • It takes too much time take to prepare data before sending it to GPU and there is only a single GPU CUDA PageRank algorithm implementation

Task:

  • Implement a CUDA kernel that is able to calculate rank on a single host but on multiple GPUs or
  • Even more desired, multiple hosts - multiple GPUs

Desired outcome:

  • To shard the knowledge graph to a number of cards on one machine (or a number of cards on multiple machines)

Steps to solve:

  • Research the current possibilities of CUDA (current and near future in order to avoid obsolete implementations)
  • Understand the efficiency of CUDA kernels
  • Understand the implementation to core
  • Describe an in-detail process of implementation (what will it break / what will have to be fixed)
  • Integrate into cyberd

What might be needed (?):

  • An algorithm developer
  • A new kernel that will calculate the Merkle tree for all rank values

Some articles and research that might be useful:
(not all might be useful)

@cyborgshead cyborgshead unpinned this issue Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants