Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beat [1/4]: handle sweeper's broadcast error #8893

Conversation

yyforyongyu
Copy link
Member

@yyforyongyu yyforyongyu commented Jul 4, 2024

NOTE: itest is fixed in the final PR #9227.

Depends on

This PR prepares the incoming blockbeat PRs, the changes are,

  • BlockEpoch now has the block data instead of the block header. This block data is used in the incoming blockbeat PR to query spending transactions.
  • Added a new sweeping state, TxError. Inputs resulting in this state will be removed from the sweeper.
  • check input's CSV and CLTV in the sweeper, and skip sweeping them if not matured.
  • When calculating the deadline, make sure it's derived from mature height, not current height.

@yyforyongyu yyforyongyu added this to the v0.18.2 milestone Jul 4, 2024
Copy link
Contributor

coderabbitai bot commented Jul 4, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yyforyongyu yyforyongyu mentioned this pull request Jul 4, 2024
2 tasks
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Did a quick drive-by review to load up on context. Not super familiar with the new sweeper system, so best not to count this review at all. I'm mostly interested in the block beat implementation itself (which seems to be part 2 of 3).

chainntnfs/best_block_view_test.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
@yyforyongyu yyforyongyu force-pushed the yy-itest-miner branch 2 times, most recently from 263006e to 4630d4f Compare July 11, 2024 09:39
@yyforyongyu yyforyongyu force-pushed the yy-prepare-blockbeat branch from a0393a0 to 9d6f8e7 Compare July 11, 2024 09:45
@saubyk saubyk requested a review from ellemouton July 11, 2024 15:17
@saubyk saubyk added the P1 MUST be fixed or reviewed label Jul 15, 2024
@yyforyongyu yyforyongyu force-pushed the yy-itest-miner branch 2 times, most recently from 388981d to adca500 Compare July 15, 2024 18:02
@yyforyongyu yyforyongyu force-pushed the yy-prepare-blockbeat branch from 9d6f8e7 to 7183080 Compare July 15, 2024 18:04
@yyforyongyu yyforyongyu force-pushed the yy-prepare-blockbeat branch from 7183080 to 8b95551 Compare July 15, 2024 21:47
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level pass looks good! I need to spend a bit more time getting familiar with sweeper stuff on the next pass though

chainntnfs/interface.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper_test.go Outdated Show resolved Hide resolved
sweep/fee_bumper_test.go Show resolved Hide resolved
@yyforyongyu yyforyongyu force-pushed the yy-itest-miner branch 4 times, most recently from 1e898e2 to 4548971 Compare July 18, 2024 01:36
@yyforyongyu yyforyongyu changed the title Beat [1/3]: handle sweeper's broadcast error Beat [1/4]: handle sweeper's broadcast error Jul 18, 2024
@yyforyongyu yyforyongyu force-pushed the yy-prepare-blockbeat branch from 8b95551 to 9ec032a Compare July 18, 2024 07:27
@saubyk saubyk modified the milestones: v0.18.3, v0.19.0 Jul 18, 2024
@yyforyongyu yyforyongyu force-pushed the yy-itest-miner branch 2 times, most recently from e0ec893 to 9180762 Compare July 23, 2024 16:49
@guggero guggero deleted the branch lightningnetwork:yy-feature-blockbeat July 23, 2024 18:12
@yyforyongyu yyforyongyu force-pushed the yy-prepare-blockbeat branch 2 times, most recently from 5113dc5 to b0c3e2c Compare November 7, 2024 12:44
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last couple of questions (thanks for the sweeper readme doc!! super helpful for loading context for this).

I think ready to ACK on next round 👍

// TxFatal is sent when the inputs in this tx cannot be retried. Txns
// will end up in this state if they have encountered a non-fee related
// error, which means they cannot be retried with increased budget.
TxFatal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we should rename TxFailed to something like TxFeeFailed or TxInputsFailed or TxFailedRetriable or something to make it clear what the difference between that and this is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just checking my understanding: we mark a tx as this and hence all its inputs as terminal meaning we wont retry any of those inputs even if they have a deadline?

should we not re-submit those inputs so they are regrouped (or potentially handled individually)? or is that the plan in future ? just wondering if one bad input will result in us dropping a bunch of inputs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree the long term goal should be to make sure we know the input which is causing the problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we mark a tx as this and hence all its inputs as terminal meaning we wont retry any of those inputs even if they have a deadline?

yeah correct, so this one bad input will affect all other inputs, which is not great. There's a TODO for that (the 6th item out of ten in #8680, and we are at one🤦🏻), the idea is to use testmempoolaccept to do a binary search on which input is causing the failure then kick it out - but i guess we need to think about how to handle it in neutrino as there's no mempool.

Comment on lines 268 to 269
// Every result must have a tx except the error or fail case.
if b.Tx == nil && b.Event != TxFatal {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking fatal case in this commit but comment says error or fail case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok i see it is fixed in upcoming commit - so nbd. but maybe "failed or fatal" makes more sense in the comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the comment

Comment on lines 283 to 285
// If it's a failed or fatal event, it must have an error.
if b.Err == nil {
if b.Event == TxFailed {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like this check should be the other way around?

// If it's a failed or fatal event, it must have an error.
	if b.Event == TxFatal || b.Event == TxFailed {
		if b.Err == nil {
			return fmt.Errorf("%w: nil error", ErrInvalidBumpResult)
		}
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Outdated
@@ -1752,8 +1802,10 @@ func (s *UtxoSweeper) handleBumpEvent(r *bumpResp) error {
case TxReplaced:
return s.handleBumpEventTxReplaced(r)

// There's a fatal error in creating or publishing the tx, we will
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just "creating" and not "or publishing" right? cause we only ever set the Event=TxErr in handleInitialBroadcast afaict (at least in this PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleInitialBroadcast does also broadcast it, but technically it should fail prior when calling CheckMempoolAcceptance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just "creating" and not "or publishing" right? cause we only ever set the Event=TxErr in

Correct, updated the comments - atm we only mark it as fatal in the initial tx creation.

sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
// height as the starting height.
matured, locktime := pi.isMature(uint32(s.currentHeight))
if !matured {
defaultDeadline = int32(locktime + s.cfg.NoDeadlineConfTarget)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to apply s.cfg.NoDeadlineConfTarget to this input? cause isnt that only for inputs with no deadline meaning those will always be mature and wont actually end up here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadline height is also the width of the feefunction so we need to set it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah suppose the locktime is 40, and the current height is 1000, it means the input cannot be swept until 1040. Then at block height 1040, the input will be swept, and the fee func will need to have a deadline delta, which is default to this config value if not set.

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it 👌


// Remove the record from the maps if there's an error or the tx is
// confirmed. When there's an error, it means this tx has failed its
// broadcast and cannot be retried. There are two cases it may fail,
// - when the budget cannot cover the fee.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we make sure before attempting the sweep that the budget is always covered ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments - notice this means the budget is used up.

// NOTE: part of the InputSet interface.
func (b *BudgetInputSet) Immediate() bool {
for _, inp := range b.inputs {
// As long as one of the inputs is immediate, the whole set is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not group immediate sweeps with the rest, because feels like immediate are always higher priority and we don'T want other inputs to interfere ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good q - think we need to break down the flag immediate more - the only way to express priority in the new sweeper is to use the deadline and we need to stick to it. Atm the immediate flag is more like broadcast without waiting for the next block flag. If we wanna change the priority it should be done via changing the deadline.

Comment on lines +352 to +409
if req.Immediate {
t.handleInitialBroadcast(record, requestID)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there is no guarantee that the inputs will already be registered in the inputset, before the block comes in so there exist the worst case where we need to wait another block ?

sweep/fee_bumper.go Outdated Show resolved Hide resolved
func (t *TxPublisher) Broadcast(req *BumpRequest) (<-chan *BumpResult, error) {
log.Tracef("Received broadcast request: %s", lnutils.SpewLogClosure(
req))
func (t *TxPublisher) Broadcast(req *BumpRequest) <-chan *BumpResult {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely preexisting, just a nit, feel free to ignore :func (t *TxPublisher) broadcast(requestID uint64) (*BumpResult, error) {

// has been reached or not. The locktime found is also returned.
func (p *SweeperInput) isMature(currentHeight uint32) (bool, uint32) {
locktime, _ := p.RequiredLockTime()
if currentHeight < locktime {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Maybe a bit offtopic, but if no locktime is required will the final sweep tx have at least a locktime of the currentBlockHeight, I read somewhere is kind of beneficial to prevent miners from some reordering of transactions iirc ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it prevents fee sniping - i think we always set that field when creating the tx (supposed there's no CLTV).

// height as the starting height.
matured, locktime := pi.isMature(uint32(s.currentHeight))
if !matured {
defaultDeadline = int32(locktime + s.cfg.NoDeadlineConfTarget)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadline height is also the width of the feefunction so we need to set it.

// means they will be removed from the sweeper and never be tried
// again.
//
// TODO(yy): Find out which input is causing the failure and fail that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very important todo, I would say it even makes sense to open a tracking issue wdyt ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8680

Tho I think my next PR in the sweeper series would be removing anchor sweeping first, which is more urgent as we rarely see broadcast error.

@@ -1665,6 +1665,14 @@ func (s *UtxoSweeper) monitorFeeBumpResult(set InputSet,
// in sweeper and rely solely on this event to mark
// inputs as Swept?
if r.Event == TxConfirmed || r.Event == TxFailed {
// Exit if the tx is failed to be created.
if r.Tx == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this an error with the even is TxConfirmed but no Tx is found ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and it will be caught when we receive the bump event and validate it.

@@ -1690,7 +1698,10 @@ func (s *UtxoSweeper) handleBumpEventTxFailed(resp *bumpResp) {
r := resp.result
tx, err := r.Tx, r.Err

log.Warnf("Fee bump attempt failed for tx=%v: %v", tx.TxHash(), err)
if tx != nil {
log.Warnf("Fee bump attempt failed for tx=%v: %v", tx.TxHash(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add in the log warning => regrouping inputs for this bump ?

yyforyongyu and others added 12 commits November 14, 2024 16:23
Also updated the loggings. This new state will be used in the following
commit.
This prepares the following commit where we now let the fee bumpr
decides whether to broadcast immediately or not.
This commit changes how inputs are handled upon receiving a bump result.
Previously the inputs are taken from the `BumpResult.Tx`, which is now
instead being handled locally as we will remember the input set when
sending the bump request, and handle this input set when a result is
received.
This commit adds a new method `handleInitialBroadcast` to handle the
initial broadcast. Previously we'd broadcast immediately inside
`Broadcast`, which soon will not work after the `blockbeat` is
implemented as the action to publish is now always triggered by a new
block. Meanwhile, we still keep the option to bypass the block trigger
so users can broadcast immediately by setting `Immediate` to true.
Previously in `markInputFailed`, we'd remove all inputs under the same
group via `removeExclusiveGroup`. This is wrong as when the current
sweep fails for this input, it shouldn't affect other inputs.
Also updated `handlePendingSweepsReq` to skip immature inputs so the
returned results are the same as those in pre-0.18.0.
With the combination of the following commit we can have a more granular
control over the bump result when handling it in the sweeper.
After previous commit, it should be clear that the tx may be failed to
created in a `TxFailed` event. We now make sure to catch it to avoid
panic.
@yyforyongyu yyforyongyu changed the base branch from yy-feature-blockbeat to master November 14, 2024 09:11
@yyforyongyu yyforyongyu changed the base branch from master to yy-feature-blockbeat November 14, 2024 09:11
@ellemouton ellemouton self-requested a review November 14, 2024 16:23
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢 🫀

@yyforyongyu yyforyongyu merged commit 0dc5191 into lightningnetwork:yy-feature-blockbeat Nov 15, 2024
18 of 30 checks passed
@yyforyongyu yyforyongyu mentioned this pull request Nov 27, 2024
@yyforyongyu yyforyongyu deleted the yy-prepare-blockbeat branch December 20, 2024 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog P1 MUST be fixed or reviewed utxo sweeping
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants