-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beat [1/4]: handle sweeper's broadcast error #8893
Beat [1/4]: handle sweeper's broadcast error #8893
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Did a quick drive-by review to load up on context. Not super familiar with the new sweeper system, so best not to count this review at all. I'm mostly interested in the block beat implementation itself (which seems to be part 2 of 3).
263006e
to
4630d4f
Compare
a0393a0
to
9d6f8e7
Compare
388981d
to
adca500
Compare
9d6f8e7
to
7183080
Compare
adca500
to
c5e6b49
Compare
7183080
to
8b95551
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high level pass looks good! I need to spend a bit more time getting familiar with sweeper stuff on the next pass though
1e898e2
to
4548971
Compare
8b95551
to
9ec032a
Compare
e0ec893
to
9180762
Compare
5113dc5
to
b0c3e2c
Compare
b0c3e2c
to
9a213eb
Compare
9a213eb
to
91b8962
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last couple of questions (thanks for the sweeper readme doc!! super helpful for loading context for this).
I think ready to ACK on next round 👍
// TxFatal is sent when the inputs in this tx cannot be retried. Txns | ||
// will end up in this state if they have encountered a non-fee related | ||
// error, which means they cannot be retried with increased budget. | ||
TxFatal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we should rename TxFailed
to something like TxFeeFailed
or TxInputsFailed
or TxFailedRetriable
or something to make it clear what the difference between that and this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also just checking my understanding: we mark a tx as this and hence all its inputs as terminal meaning we wont retry any of those inputs even if they have a deadline?
should we not re-submit those inputs so they are regrouped (or potentially handled individually)? or is that the plan in future ? just wondering if one bad input will result in us dropping a bunch of inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree the long term goal should be to make sure we know the input which is causing the problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we mark a tx as this and hence all its inputs as terminal meaning we wont retry any of those inputs even if they have a deadline?
yeah correct, so this one bad input will affect all other inputs, which is not great. There's a TODO for that (the 6th item out of ten in #8680, and we are at one🤦🏻), the idea is to use testmempoolaccept
to do a binary search on which input is causing the failure then kick it out - but i guess we need to think about how to handle it in neutrino as there's no mempool.
sweep/fee_bumper.go
Outdated
// Every result must have a tx except the error or fail case. | ||
if b.Tx == nil && b.Event != TxFatal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just checking fatal case in this commit but comment says error or fail case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok i see it is fixed in upcoming commit - so nbd. but maybe "failed or fatal" makes more sense in the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the comment
sweep/fee_bumper.go
Outdated
// If it's a failed or fatal event, it must have an error. | ||
if b.Err == nil { | ||
if b.Event == TxFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like this check should be the other way around?
// If it's a failed or fatal event, it must have an error.
if b.Event == TxFatal || b.Event == TxFailed {
if b.Err == nil {
return fmt.Errorf("%w: nil error", ErrInvalidBumpResult)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
sweep/sweeper.go
Outdated
@@ -1752,8 +1802,10 @@ func (s *UtxoSweeper) handleBumpEvent(r *bumpResp) error { | |||
case TxReplaced: | |||
return s.handleBumpEventTxReplaced(r) | |||
|
|||
// There's a fatal error in creating or publishing the tx, we will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just "creating" and not "or publishing" right? cause we only ever set the Event=TxErr in handleInitialBroadcast
afaict (at least in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleInitialBroadcast
does also broadcast it, but technically it should fail prior when calling CheckMempoolAcceptance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just "creating" and not "or publishing" right? cause we only ever set the Event=TxErr in
Correct, updated the comments - atm we only mark it as fatal in the initial tx creation.
// height as the starting height. | ||
matured, locktime := pi.isMature(uint32(s.currentHeight)) | ||
if !matured { | ||
defaultDeadline = int32(locktime + s.cfg.NoDeadlineConfTarget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to apply s.cfg.NoDeadlineConfTarget
to this input? cause isnt that only for inputs with no deadline meaning those will always be mature and wont actually end up here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadline height is also the width of the feefunction so we need to set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah suppose the locktime is 40, and the current height is 1000, it means the input cannot be swept until 1040. Then at block height 1040, the input will be swept, and the fee func will need to have a deadline delta, which is default to this config value if not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it 👌
sweep/fee_bumper.go
Outdated
|
||
// Remove the record from the maps if there's an error or the tx is | ||
// confirmed. When there's an error, it means this tx has failed its | ||
// broadcast and cannot be retried. There are two cases it may fail, | ||
// - when the budget cannot cover the fee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we make sure before attempting the sweep that the budget is always covered ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comments - notice this means the budget is used up.
// NOTE: part of the InputSet interface. | ||
func (b *BudgetInputSet) Immediate() bool { | ||
for _, inp := range b.inputs { | ||
// As long as one of the inputs is immediate, the whole set is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should not group immediate sweeps with the rest, because feels like immediate are always higher priority and we don'T want other inputs to interfere ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good q - think we need to break down the flag immediate
more - the only way to express priority in the new sweeper is to use the deadline and we need to stick to it. Atm the immediate
flag is more like broadcast without waiting for the next block
flag. If we wanna change the priority it should be done via changing the deadline.
if req.Immediate { | ||
t.handleInitialBroadcast(record, requestID) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there is no guarantee that the inputs will already be registered in the inputset, before the block comes in so there exist the worst case where we need to wait another block ?
func (t *TxPublisher) Broadcast(req *BumpRequest) (<-chan *BumpResult, error) { | ||
log.Tracef("Received broadcast request: %s", lnutils.SpewLogClosure( | ||
req)) | ||
func (t *TxPublisher) Broadcast(req *BumpRequest) <-chan *BumpResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely preexisting, just a nit, feel free to ignore :func (t *TxPublisher) broadcast(requestID uint64) (*BumpResult, error) {
// has been reached or not. The locktime found is also returned. | ||
func (p *SweeperInput) isMature(currentHeight uint32) (bool, uint32) { | ||
locktime, _ := p.RequiredLockTime() | ||
if currentHeight < locktime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Maybe a bit offtopic, but if no locktime is required will the final sweep tx have at least a locktime of the currentBlockHeight, I read somewhere is kind of beneficial to prevent miners from some reordering of transactions iirc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it prevents fee sniping - i think we always set that field when creating the tx (supposed there's no CLTV).
// height as the starting height. | ||
matured, locktime := pi.isMature(uint32(s.currentHeight)) | ||
if !matured { | ||
defaultDeadline = int32(locktime + s.cfg.NoDeadlineConfTarget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadline height is also the width of the feefunction so we need to set it.
// means they will be removed from the sweeper and never be tried | ||
// again. | ||
// | ||
// TODO(yy): Find out which input is causing the failure and fail that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very important todo, I would say it even makes sense to open a tracking issue wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tho I think my next PR in the sweeper series would be removing anchor sweeping first, which is more urgent as we rarely see broadcast error.
@@ -1665,6 +1665,14 @@ func (s *UtxoSweeper) monitorFeeBumpResult(set InputSet, | |||
// in sweeper and rely solely on this event to mark | |||
// inputs as Swept? | |||
if r.Event == TxConfirmed || r.Event == TxFailed { | |||
// Exit if the tx is failed to be created. | |||
if r.Tx == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this an error with the even is TxConfirmed but no Tx is found ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah and it will be caught when we receive the bump event and validate it.
@@ -1690,7 +1698,10 @@ func (s *UtxoSweeper) handleBumpEventTxFailed(resp *bumpResp) { | |||
r := resp.result | |||
tx, err := r.Tx, r.Err | |||
|
|||
log.Warnf("Fee bump attempt failed for tx=%v: %v", tx.TxHash(), err) | |||
if tx != nil { | |||
log.Warnf("Fee bump attempt failed for tx=%v: %v", tx.TxHash(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add in the log warning => regrouping inputs for this bump ?
Also updated the loggings. This new state will be used in the following commit.
This prepares the following commit where we now let the fee bumpr decides whether to broadcast immediately or not.
This commit changes how inputs are handled upon receiving a bump result. Previously the inputs are taken from the `BumpResult.Tx`, which is now instead being handled locally as we will remember the input set when sending the bump request, and handle this input set when a result is received.
This commit adds a new method `handleInitialBroadcast` to handle the initial broadcast. Previously we'd broadcast immediately inside `Broadcast`, which soon will not work after the `blockbeat` is implemented as the action to publish is now always triggered by a new block. Meanwhile, we still keep the option to bypass the block trigger so users can broadcast immediately by setting `Immediate` to true.
Previously in `markInputFailed`, we'd remove all inputs under the same group via `removeExclusiveGroup`. This is wrong as when the current sweep fails for this input, it shouldn't affect other inputs.
Also updated `handlePendingSweepsReq` to skip immature inputs so the returned results are the same as those in pre-0.18.0.
With the combination of the following commit we can have a more granular control over the bump result when handling it in the sweeper.
After previous commit, it should be clear that the tx may be failed to created in a `TxFailed` event. We now make sure to catch it to avoid panic.
91b8962
to
9333ba4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚢 🫀
0dc5191
into
lightningnetwork:yy-feature-blockbeat
NOTE: itest is fixed in the final PR #9227.
Depends on
TestChannelLinkCancelFullCommitment
#9221This PR prepares the incoming blockbeat PRs, the changes are,
BlockEpoch
now has the block data instead of the block header. This block data is used in the incoming blockbeat PR to query spending transactions.TxError
. Inputs resulting in this state will be removed from the sweeper.