Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hh: Improve our streaming throughput #49

Closed
cirocosta opened this issue May 13, 2019 · 7 comments
Closed

hh: Improve our streaming throughput #49

cirocosta opened this issue May 13, 2019 · 7 comments

Comments

@cirocosta
Copy link
Member

Hey,


I've been noticing that even though we recently got a better utilization of our worker nodes when it comes to system CPU usage (see "Why our system CPU usage is so high?"), our throughput when streaming between one worker to another is still pretty disapointing.


regular workload

ps.: those bits going from workers to workers are compressed (gzip) - still, there's a lot of bandwidth left to be utilized

ps.: in the image above, the thin pink line at 400Mbps is a web instance piping data from one worker to another, thus, 200 + 200 Mbps


In order to test what would be the theoritical maximum amount of throughput that we could get on GCP, I started experimenting with stressing some of the edges in terms of streaming performance.

Here are some results.

Table of contents


Machines used

In the following tests, the same machine type was utilized (2 instances) with varying disk configurations:

  • 16 vCPU Intel(R) Xeon(R) CPU @ 2.20GHz
  • 32 GB Mem
  • Ubuntu 16.04 / 4.15.0-1026-gcp

Streaming directly from one node to the other without any disk IO

In this scenario, all that we need to do is make the receive side discard all of the input that it receives.

We can do that by taking all of the contents that come from the listening socket and writing it all down to /dev/null.

What can be observed in this case is that GCP indeed delivers what it promises, giving the consistent ~10Gbps of network throughput.

This is specially true knowing that we're able to get a worker-to-worker bandwidth of consistent 10Gbs without much effort:

Receiver:
receiver

Transmitter:
transmitter

With the results above, we can then establish that without any diks IO, we can get the promised 10Gbps of network throughput.


Streaming from one node to another, writing to disk when receiving

Now, changing the receiver side to write to a disk that gets attached to the node, we're able to see how the story changes.

1TB non-SSD persistent disk

GCP promise

Operation type Read Write
Sustained random IOPS limit 750 1,500
Sustained throughput limit (MB/s) 120 120

ps.: 120 MB/s ~ 960Mbps

Actual:

  • 460 IOPS
  • 1Gbps (~125 Mbps)

Looking at the results, seems like we're hitting the byte upper limit:

Now, we can clearly see:

  • throttling hapenning at ~460 IOPS
  • 1Gbps of network throughput limited by the 1Gbps disk throughput
  • iowait times pretty high due to throttling

1TB SSD persistent disk

GCP Promise:

Operation type Read Write
Sustained random IOPS limit 25,000 1,500
Sustained throughput limit (MB/s) 480 480

ps.: 480 MB/s ~ 3.8Gbps

Actual:

  • 1560 IOPS
  • 3.3Gbps (~ 415MBps)

Now, instead of being throttled on the bytes upper limit, we're throttled at the IOPS limit, even though the bytes number is quite similar, anyway.


Extra - Local SSDs

I was curious about how much perf we could get using Local SSDs.

Without any particular tuning applied to the filesystems on them, I went forward three configs, all using the
same machine type.

1 Local SSD (375GB)

3 Local SSDs (3 x 375GB) on Software RAID 0 EXT4

3 Local SSDs (3 x 375GB) on Software RAID 0 XFS

Next steps

It's interesting to see how the software-based RAID didn't help much, and that the default configs for a 1TB
pd-ssd was the configuration that achieved the highest throughput.

Could that be a flaw in the way that I'm testing these setups? Very likely!

With some baseline of what's some of the best throughput we can get in terms of disk IO combined with network throughput, we can go ahead with some more testing in terms of either adding alternate compression algorithms, or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).

Thanks!

@jchesterpivotal
Copy link

or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).

I feel like this will be the biggest win in the long run.

However zstd looks like a quick win because (I assume) it won't require a lot of changes.

@kcmannem
Copy link
Member

We're looking to have parallel streaming in with zstd, both are quick wins but gzip uses too much cpu which could overload the workers if we just do parallel streaming.

@kcmannem
Copy link
Member

kcmannem commented May 13, 2019

The fact that raid0 doesn't seem to be helping is really strange to me. I don't know what "software raid" is but the fact that you have double the physical wires able to push data through should make plain writes faster. We should do the same test with raid and measure the write speeds without concourse. If we meet the gcp promise, its highly likely that it might be our streaming really is a big bottleneck.

Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.

@vito
Copy link
Member

vito commented May 14, 2019

Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.

Where do you see that? 🤔 I thought it was all streaming. We shell out to tar but that should stream too.

@kcmannem
Copy link
Member

@vito ohhhh that’s why we shell out, I thought it was to avoid writing to a temporary file. My other comment is invalid, it totally makes sense why were seeing this bottleneck then. It doesn’t matter if we raid or have a fast network if we can’t push bytes fast enough.

@cirocosta
Copy link
Member Author

Update:

🎊

@jchesterpivotal
Copy link

Are one or both of these likely to land in 5.4?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants