-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BP-65: implement load balance for select bookie #4247
Comments
List some data collected from the production enviroment after we lauch this feature. It can be seen that the P99 write latency has significantly decreased from a peak of more than ten to twenty seconds to less than one second. The peak standard deviation of journal disk write throughput has decreased from 25MB/s to 21MB/s. The peak write throughput range has decreased from 150MB/s to 130MB/s. There is also a significant decrease in the top 10 journal disk write throughput. Could you help to review this BP? @shoothzj @wenbingshen @eolivelli @dlg99 @jiazhai |
This is a great idea +1 |
I like this idea. I'm reviewing this proposal |
Can you please start the discussion on the [email protected] mailing list? |
Done, can see here, https://lists.apache.org/thread/ltdls9r55h70mdxc0z8k5qzvzw48nb6d |
BP
This is the master ticket for tracking BP-65 :
Proposal PR - #4246
Motivation
One of our clusters have 255 bookies, and we find that bookie's write pressure is very unbalance.
Usually there are several bookies write latency too high, which cause the message publish latency also too high in pulsar broker.
Currently, bookie have quarantine mechanism to deal with this case.
However, this mechanism is not good enough to avoid bookie high write latency
Proposed Changes
To solve this write pressure problem, we propose to introduce bookie load balance mechanism, which is supplement of current quarantine mechanism.
When we choose ensemble for ledger, if we have load information of all bookies, we can prefer to select the low-load bookie as ensemble.
And we can avoid to write into the high-load bookie, which make the bookie perform worse and cause high write latency.
Therefore, with the bookie load information, we can better avoid high write latency problem occur
We notice that bookie already has DiskWeightBasedPlacement mechanism, which is similar to load balance. We just need to enhance this mechanism,
replace it to LoadWeightBasedPlacement.
The proposed changes involves:
BaseMetricMonitor
Now BaseMetricMonitor would collect multiple load metrics, including journal IOUtil, ledger IOUtil, bookie write bytes per second, cpu usage, free disk space, total disk space.
Then we can define the bookie load pressure by these metrics.
Actually for our cluster, bookie load pressure is mainly influenced by journal IOUtil, because we use HDD as journal disk and 1 journal disk is responsible for 3 bookie.
BaseMetricMonitor would collect the metrics per second by default. But we find that some metrics is jittering so much.
So it is necessary to smooth the collected metrics, by calculating average value between 3 seconds.
This 3 second can be modified by config baseMetricMonitorMetricSlideWindowSize
If one bookie contains multiple disks, we calculate the average value.
modification of getBookieInfo restApi
bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
That means if we enable LoadWeightBasedPlacement, the restApi would contain more information.
GetBookieInfoResponse in BookkeeperProtocol would be changed.
If we disable LoadWeightBasedPlacement or the restApi is error because of timeout or throwing exception, the load information would be -1.
And we have tested the pressure of this restApi bringing to cluster. Such as for our cluster, with more than 20 brokers and more than 200 bookies,
the pressure of restApi is still acceptable.
modification of RackawareEnsemblePlacementPolicyImpl
Implement a new strategy to select bookies for LoadWeightBasedPlacement.
The target is :
So the designed strategy is :
To avoid a corner case that so many bookies being filtered, we add config
lowLoadBookieRatio
. Default if more than half of bookies are filtered, fallback to randomly selection.Notice that many bookie clients would do ensemble selections separately, the probability of each bookie should not differ too much, or it would cause write incline problem.
So we have probability smooth operation in roulette wheel selection.
Furthermore, different users can implement their own selected strategy based on their production environment.
compatibility
LoadWeightBasedPlacement is a supplemental feature, ledger replication must obey the RackAwarePolicy/RegionAwarePolicy firstly,
and then try to obey LoadWeightBasedPlacement. We can disable the feature by configuration.
Because GetBookieInfo protocol has been changed, this restApi would get error if the version of bookie-server and bookie-client is not the same one. But since this restApi is used only when enable diskWeightBasedPlacement, I think it is no problem for most people, who do not enable diskWeightBasedPlacement in client.
Performance
We have applied LoadWeightBasedPlacement to our production clusters. And the high write latency problem no matter happen.
The text was updated successfully, but these errors were encountered: