Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing+htlcswitch: fix stuck inflight payments #9150

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

yyforyongyu
Copy link
Member

@yyforyongyu yyforyongyu commented Oct 2, 2024

Fix #8975

This PR aims at fixing a scenario where the payment could be stuck. Major changes are,

  • all the db updates for a given payment now happens in a single goroutine (resumePayment)
  • minor optimization on sphinx circuit creation - the error decryptor is now conditionally created when needed, i.e., when processing an update_fail_htlc.

TODOs

  • fix unit test in htlcswitch
  • fix unit test in routing

Copy link
Contributor

coderabbitai bot commented Oct 2, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@calvinrzachman calvinrzachman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi YY, made a quick pass through the change set. Looks pretty cool to me! Had a couple questions to make sure I understand the approach and also on whether the modification of GetAttemptResult is strictly necessary for this bug fix as it may have implications (albeit slight) on how the ChannelRouter can be used remotely.

@@ -123,7 +123,8 @@ type PaymentAttemptDispatcher interface {
// longer be in flight. The switch shutting down is signaled by
// closing the channel. If the attemptID is unknown,
// ErrPaymentIDNotFound will be returned.
GetAttemptResult(attemptID uint64, paymentHash lntypes.Hash,
GetAttemptResult(attempt *channeldb.HTLCAttempt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this PR moves to transporting the hop + session key information across the interface boundary by way of channeldb.HTLCAttempt type. This is what allows the forwarding error decryption to be performed one layer deeper within the Switch rather than within the ChannelRouter. This may have implications for ability to support oblivious payments by providing optionality in server versus client side decryption of forwarding errors.

Maybe this is too future thinking to be relevant, but such an opportunity is a bit more natural when you consider alternative deployment scenarios in which ChannelRouter and Switch run remotely in different processes - perhaps both controlled by the same entity (eg: gRPC reverse proxy to make multiple nodes act as a single logical node) or controlled by different parties in the case where the client would like to build onions and send payments in a way that is private from a hosted node infrastructure provider.

Whether forwarding error decryption should be conducted server/Switch side or by the caller of GetAttemptResult could previously be signaled by the presence of the deobfuscator:

// extractResult uses the given deobfuscator to extract the payment result from
// the given network message.
func (s *Switch) extractResult(deobfuscator ErrorDecrypter, n *networkResult,
	attemptID uint64, paymentHash lntypes.Hash) (*PaymentResult, error) {

...

// If the caller did not provide a deobfuscator, then we'll
// return the onion-encrypted blob that details why the HTLC was
// failed. This blob is only fully decryptable by the entity
// which built the onion packet.
if deobfuscator == nil {
	return &PaymentResult{
		EncryptedError: htlc.Reason,
	}, nil
}

I could imagine a similar approach to optional Switch error decryption by constructing a HTLCAttempt object which does not set the fields needed to perform forwarding error decryption in the Switch, but the attempt.Circuit() call would have to allow that information not being there so the Switch could instead return the encrypted error blob via the result.

In our case, we control both the remote ChannelRouter and the lnd/Switch instances, so whether the forwarding error decryption occurs in the router or switch isn't a huge concern since privacy across the RPC boundary is not relevant.

From the perspective of a router client controlled separately from the entity operating the lnd/Switch instances: With the removal of any ChannelRouter error decryption logic, the client application could not re-use the ChannelRouter type, but it could at least in principle provide a different implementation of a path-finding, onion building, and payment life-cycle manager component that could do client-side decryption of forwarding errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?

Maybe we like the improvements this brings better than the currently somewhat hypothetical ability to allow ChannelRouter re-use in clients seeking to maintain privacy from hosted node providers. One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I followed - if you wanna create onions then you should use SendToOnions?

Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?

Nope it's not needed but an improvement - an easy way to measure it is to run the payment benchmark, which increases a bit. This is more of a logical code separation since there's no need to re-create the error decryptor if we know it's not a fail but a settle.

One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.

I don't think that's the case, all paymentLifecycle needs is to call RequestRoute and pass the route to the htlcswitch.

Anyway I removed this change to limit the scope of this PR, think we'll take a look at it again when working on #8834.

result *htlcswitch.PaymentResult) bool {

// Save the keys so we know which items to delete from the map.
attempts = append(attempts, a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only mark an attempt result for deletion if it is actually successfully handled without error by the call to handleAttemptResult below? If a result is not deleted from the map, it will be reprocessed when this function is again called on the next life-cycle loop iteration. On one hand retries sound good, but in this context maybe that risks a perpetually growing result map if certain results are borked and never able to be processed successfully 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I thought about this too - since the only error we get here is DB-related, I don't think retrying it again would help. This is also the current behavior. But I guess some logs would help so will add!

routing/payment_lifecycle.go Show resolved Hide resolved
routing/payment_lifecycle.go Show resolved Hide resolved
routing/payment_lifecycle.go Outdated Show resolved Hide resolved
@ziggie1984 ziggie1984 self-requested a review October 14, 2024 06:34
@yyforyongyu yyforyongyu force-pushed the fix-stuck-payment branch 2 times, most recently from abd6e94 to 930fadf Compare October 14, 2024 12:35
@yyforyongyu yyforyongyu added payments Related to invoices/payments bug fix size/kilo medium, proper context needed, less than 1000 lines labels Oct 14, 2024
@yyforyongyu yyforyongyu added this to the v0.19.0 milestone Oct 14, 2024
@yyforyongyu yyforyongyu self-assigned this Oct 14, 2024
@yyforyongyu yyforyongyu marked this pull request as ready for review October 14, 2024 12:37
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor bringing the result processing into one goroutine.

I might be wrong here but afaict we also need to make sure we cancel the trackpayment subscription when canceling the payment. Right now when encountering a switch result which would trigger a payment cancelation, we would stop all the result collectors never being able to resolve still inflight HTLCs (thinking of the MPP_Timeout). We would only do this after restart, so I think before canceling a payment we should make sure all htlcs are resolved or at least also abort the trackpayment request because it has not chance of resolving inflight htlcs because the payment_lifecycle got canceled.

routing/payment_lifecycle_test.go Show resolved Hide resolved
channeldb/mp_payment.go Show resolved Hide resolved

// refreshPayment returns the latest payment found in the control tower after
// all its attempt results are processed.
func (p *paymentLifecycle) refreshPayment() (DBMPPayment,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In this commit it does not really look like we are refreshing something, just fetching the payment ..., only makes sense when looking into the following commits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reading from the db to get the latest state is by definition "refreshing".

routing/payment_lifecycle.go Show resolved Hide resolved
}

// We successfully got a payment result back from the switch.
log.Debugf("Payment %v succeeded with pid=%v", p.identifier,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Payment => HTLC Attempt succeeded ...? Or are we assuming as soon as one attempt succeeds the whole payment succeeds because the preimage was revealed ? But can we be sure in an MPP case that the whole amount was settled ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah nice call, must be leftover and now updated - and we always check the inflight htlcs before considering the payment as finalized or not.

routing/payment_lifecycle.go Show resolved Hide resolved
routing/payment_lifecycle.go Show resolved Hide resolved
func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt,
result *htlcswitch.PaymentResult) (*attemptResult, error) {

// In case of a payment failure, fail the attempt with the control
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can also not only fail the attempt but the whole payment so we should maybe be more detailed here in the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments.

return p.handleSwitchErr(attempt, result.Error)
}

// We successfully got a payment result back from the switch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not only a result but a positive result, the case where the payment failed is handled in handleSwitchErr ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments

routing/payment_lifecycle.go Show resolved Hide resolved
Copy link
Member Author

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here but afaict we also need to make sure we cancel the trackpayment subscription when canceling the payment.

Checked the trackpayment impl and realized we do send all the updates and optionally terminate the subscription here (maybe we need to take a second look at the subscribe all case),

if terminal {
close(subscriber.queue.ChanIn())
}

routing/payment_lifecycle_test.go Show resolved Hide resolved
channeldb/mp_payment.go Show resolved Hide resolved
channeldb/mp_payment.go Show resolved Hide resolved
channeldb/mp_payment.go Show resolved Hide resolved
routing/payment_lifecycle.go Show resolved Hide resolved

// refreshPayment returns the latest payment found in the control tower after
// all its attempt results are processed.
func (p *paymentLifecycle) refreshPayment() (DBMPPayment,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah rename it to be more verbose.

func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt,
result *htlcswitch.PaymentResult) (*attemptResult, error) {

// In case of a payment failure, fail the attempt with the control
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments.

routing/payment_lifecycle.go Show resolved Hide resolved
return p.handleSwitchErr(attempt, result.Error)
}

// We successfully got a payment result back from the switch.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments

}

// We successfully got a payment result back from the switch.
log.Debugf("Payment %v succeeded with pid=%v", p.identifier,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah nice call, must be leftover and now updated - and we always check the inflight htlcs before considering the payment as finalized or not.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Oct 28, 2024

if terminal {
close(subscriber.queue.ChanIn())
}

because htlcs are still in flight, the payment state is not terminated so we will not close the channel here, given the fact that we still have inflight_htlcs. However we will close all result_collectors which makes it impossible for the track_payment stream to report back.

@yyforyongyu
Copy link
Member Author

given the fact that we still have inflight_htlcs. However we will close all result_collectors which makes it impossible for the track_payment stream to report back.

I think that's not the case anymore since we fixed the stuck payment here? Also even if it's not fixed before re the stuckness, the track payment IMO is behaving correctly because it shouldn't quit unless the payment is terminated.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Nov 3, 2024

So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors. This would lead also to the case that trackPayment subscriber would be hang indefintily. This is now solved by waiting for all results and updating the payment db. Let me know if this approach is good @yyforyongyu

Thoughts about the current change.

So the reason why we should wait for all shards to resolve is the following: If we launch an MPP payment for example and receive the first error for one of the Shards, we would mark the payment as failed, closing all the collectors in the process. THis means our payment has never the chance to resolve and the trackPayment subscribers will wait indefinitly because the payment never reached a terminal condition (still inflight payments).

@ziggie1984 ziggie1984 force-pushed the fix-stuck-payment branch 2 times, most recently from d5f5a97 to 30f5826 Compare November 3, 2024 21:46
@yyforyongyu
Copy link
Member Author

So the reason why we should wait for all shards to resolve is the following

The assessment is correct, however I don't think we want to add another state to decide whether we should exit the payment lifecycle or not, instead we should stick to p.decideNextStep so we only have a single place to act this logic, which I pushed a fix for this case.

So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors.

I did try to write an itest for this case but failed. Could you elaborate on this case?

Also I'd like to limit the scope of this PR to focus on fixing this case only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes.

Moving forward, I think the handleSwitchErr may need a closer look,

var rtErr htlcswitch.ClearTextError
ok := errors.As(sendErr, &rtErr)
if !ok {
return p.failPaymentAndAttempt(
attemptID, &internalErrorReason, sendErr,
)
}

We need to revisit the errors returned from the link and decide which error causes an attempt failure and which causes a payment failure. I suspect we are prematurely marking the payment as failed when it should instead mark the attempt as failed.

log.Errorf("Failed to find a route for payment %v: %v",
p.identifier, err)

// The payment is now marked as failed, we'll continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this comment holds tho because in the requestRoute func, we would not necessarily fail the payment depending on the error:

	var routeErr noRouteError
	if !errors.As(err, &routeErr) {
		return nil, fmt.Errorf("requestRoute got: %w", err)
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice yeah it was handled there already, one less commit

Copy link
Collaborator

@ziggie1984 ziggie1984 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I think we need to keep this exception, imagine we launch a payment, we need to shard the payment, so we send one paymenet successfully now we try the next shard, run out of routes ? We would then close the payment_lifecycle and remove all the result collectors again (for the first shard which will then never have the chance to get resolved)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's no route error we'd return a nil error and continue the lifecycle due to nil route below.

Copy link
Collaborator

@ziggie1984 ziggie1984 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm aren't we returning an error when it's no NoRoute err ?:

return nil, fmt.Errorf("requestRoute got: %w", err)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case we also wanna wait for all payment to resolve tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case something seriously wrong happened and we should exit, of course it relies on the API to promise what it does - if not we need to fix it in the router. Otherwise, we will end up in an infinite loop here because the payment is not failed and we will keep sending new attempts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, makes sense to exit. My only concern here is then that we are closing the result collectors, which is unfortunate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as the router and switch are active we probably should keep them alive.

@ziggie1984
Copy link
Collaborator

Also I'd like to limit the scope of this PR to focus on fixing #8975 (comment) only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes.

Looking at your proposal it looks like it might solve the issue, because we would always wait in the decideNextStep. I am more in favour of not doing so much exceptions and special casing a lot of errors, I would rather prefer the approach were we error out and just wait for all collectors everytime but I can agree on both solutions.

@yyforyongyu
Copy link
Member Author

I am more in favour of not doing so much exceptions and special casing a lot of errors, I would rather prefer the approach were we error out and just wait for all collectors everytime but I can agree on both solutions.

I actually think it's quite the opposite - if we are using another counter and read the chan it creates more questions, e.g., what if there are multiple results being sent to resultCollected? what if resultCollected is empty and blocking? By using the counter approach not only do we need to create more states to deal with, but we also doge the issue in decideNextStep or registerAttempt, and likely to hide future issues - the commit I put there is a hack, but it doesn't pretend the issue is fixed - we need to go deep into the switch and understand when to fail the payment vs the attempt.

That being said, I'm considering removing it too as it's hacky. As for the edge case, say a payment made of three HTLCs,

  1. two of the HTLCs are sent, the third one is blocking on requestRoute
  2. one of the already sent HTLCs has failed, causing the payment to be marked as failed
  3. when sending the third one, the lifecycle loop will error out because registerAttempt will fail due to Registrable returning an error.

If we can confirm that step 2 indeed happens, then we can provide a solid fix to make sure we don't fail the payment prematurely, then we don't need the hack. On the other hand, by using the counter approach, we won't notice this issue and won't attempt a fix.

@ziggie1984
Copy link
Collaborator

two of the HTLCs are sent, the third one is blocking on requestRoute

could you elaborate why we would block at requestRoute imo we fetch the route there and if its not available we fail ? We only block in the decideNextStep or ?

Yeah I think your example might happen not sure how often tho but it seems reasonable to assume.

My current concern and maybe you can jump in and mitigate this case is the following:

Our SendPaymentV2 probably the most used case for sending payments, never listens to the outcome of the payment_lifecycle but it subscribes to the payment and listens to the control_tower to signal the terminal state. So in my opinion we should make sure the control_tower always is able to resolve the payment hence waiting for all shards, and not let the payment_lifecycle terminate the ability for the control_tower to resolve payments by closing the result collectors.

So I am not sure here, maybe we should really invest the time and find some proper solution rather than get this fix it. Maybe my concerns are overrated and we never run in those edge cases that often anymore given the fact that we know sync the payment in one loop ...

@yyforyongyu
Copy link
Member Author

could you elaborate why we would block at requestRoute imo we fetch the route there and if its not available we fail ? We only block in the decideNextStep or ?

I also don't know how it could happen😂, just making assumptions. If requestRoute always returns quickly, then I doubt we will ever run into this edge case because we'd send out all three HTLCs immediately.

but it subscribes to the payment and listens to the control_tower to signal the terminal state.

now that you mention it I think the fix is to notify it in the payment lifecycle, something like this.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Nov 4, 2024

now that you mention it I think the fix is to notify it in the payment lifecycle, something like this.

I think this can also be a solution. Not super happy with it, because we kinda prevent the payment from being marked as failed until a restart but at least the subscriber does not hang indefinitely.

Maybe this issue should be taken care of in the payment sql refactor I am doing and let's keep a TODO here ?. However @hieblmi might need this feature for his static address functionality.

i think we have all the cards on the table, just need to decide which direction to go. (cc @saubyk I think we need to discuss this one in the after next standup)

@yyforyongyu
Copy link
Member Author

!lightninglabs-deploy mute 336h

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are very close, I think the only thing we should do is that we always mark a payment as failed in resumePayment and wait for all active resolvers. This makes sure we do not start any more attempts but also collect all ongoing htlcs attempts and in the process also signaling this to the trackPayment command. So basically in

exitWithErr := func(err error) ([32]byte, *route.Route, error) {
		log.Errorf("Payment %v with status=%v failed: %v",
			p.identifier, payment.GetStatus(), err)
		return [32]byte{}, nil, err
	}

just mark the payment as failed and continue until all resolvers are finsihed, which will that exit this process.

Even when having a timeout set for the payment,this timeout should just set the payment as failed and then when all attempts are resolved we would exit via:

if !wait {
			return stepExit, nil
		}

routing/payment_lifecycle.go Show resolved Hide resolved
return stepExit, err
}

case <-p.resultCollected:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add a comment that we are here waiting for the result form the switch

log.Errorf("Error reporting payment success to mc: %v", err)
}

// In case of success we atomically store settle result to the DB move
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: store settle resutls to the DB and move ...

@@ -803,8 +803,8 @@ func (p *paymentLifecycle) handleSwitchErr(attempt *channeldb.HTLCAttempt,
// case we can safely send a new payment attempt, and wait for its
// result to be available.
if errors.Is(sendErr, htlcswitch.ErrPaymentIDNotFound) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should differentiate between the payment not found in the network resutts or the attempts, there are imo two different things:

func fetchResult(tx kvdb.RTx, pid uint64) (*networkResult, error) {
var attemptIDBytes [8]byte
binary.BigEndian.PutUint64(attemptIDBytes[:], pid)
networkResults := tx.ReadBucket(networkResultStoreBucketKey)
if networkResults == nil {
return nil, ErrPaymentIDNotFound
}
// Check whether a result is already available.
resultBytes := networkResults.Get(attemptIDBytes[:])
if resultBytes == nil {
return nil, ErrPaymentIDNotFound
}
// Decode the result we found.
r := bytes.NewReader(resultBytes)
return deserializeNetworkResult(r)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt ?

@@ -262,6 +262,16 @@ lifecycle:
// We found a route to try, create a new HTLC attempt to try.
attempt, err := p.registerAttempt(rt, ps.RemainingAmt)
if err != nil {
// If the error is due to we cannot register another
Copy link
Collaborator

@ziggie1984 ziggie1984 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should in neither case when receiving an error also in requestRoute just exist the payment without marking it as failed not collecting all results. So what we should do is, every error that occurs should fail the payment in the paymentDB and wait for all collectors to resolve.

This commit caches the creation of sphinx circuit and onion blob to
avoid re-creating them again.
To shorten the method `resumePayment` and make each step more clear.
yyforyongyu and others added 6 commits January 6, 2025 14:31
To further shorten the lifecycle loop.
This commit refactors `collectResultAsync` such that this method is now
only responsible for collecting results from the switch. A new method
`processSwitchResults` is added to process these results in the same
goroutine where we fetch the payment from db, to make sure the lifecycle
loop always have a consistent view of a given payment.
@yyforyongyu yyforyongyu force-pushed the fix-stuck-payment branch 2 times, most recently from fc8d767 to 56ddaee Compare January 6, 2025 07:54
@lightninglabs-deploy
Copy link

@yyforyongyu, remember to re-request review from reviewers when ready

@saubyk saubyk requested a review from bitromortac January 21, 2025 17:42
@ziggie1984 ziggie1984 self-requested a review January 23, 2025 09:43
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close ⚡️

I still think that we should never really abort the payment lifecycle as long as we have pending htlcs we are awaiting the result from. We might only think of quitting the payment if we cannot interpret one of the results in the new result map from the switch, because then we would kinda end up in a cycle always reprocessing. But I really do not understand why we need to cancel the payment lifecylce because based on the RPC calls, we always listen to the TrackPayment stream nonetheless, so we are not really listening to the outcome in SendPaymentAsync (SendPaymentV2).

But maybe in a follow-up PR because this PR already resolves some problems in our payment flow.

ps := payment.GetState()
remainingFees := p.calcFeeBudget(ps.FeesPaid)

log.Debugf("Payment %v: status=%v, active_shards=%v, rem_value=%v, "+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be very helpful to add the number of settled and canceled htlcs as well in the log msg.

Comment on lines 149 to 150
// NOTE: we don't check `p.quit` since `decideNextStep` is
// running in the same goroutine as `resumePayment`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you detail why we are not listening for p.quit, I don't really undestand this comment.

// Save the result and process it in the next main loop.
p.switchResults.Store(attempt, result)

// Signal that a result has been collected.
select {
// Send the signal or quit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you be more detailed which other process is waiting for the signal here, otherwise we can also remove this comment because it just describes the code.

// means the payment lifecycle needs to be terminated.
_, err := p.handleAttemptResult(a, result)
if err != nil {
errReturned = err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a log here that the handling of the result failed and we are therefore exiting the payment loop ?

func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt,
result *htlcswitch.PaymentResult) (*attemptResult, error) {

// If the result has an error, we need to further process it by failing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question I have here, what if we still are in the process of collecting an HTLC result (it is not in the map yet, now we exit the Payment loop and the trackpayment stream will never have the chance to resolve the payment because we closed the result collector ? Do we want it this way ?

@@ -841,9 +841,9 @@ func TestPaymentControlMultiShard(t *testing.T) {
b.AttemptID = 3
_, err = pControl.RegisterAttempt(info.PaymentIdentifier, &b)
if test.settleFirst {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have the if else here if both do the same thing ?

//
// TODO(yy): remove this check once we have a finer
// control over errors returned from the switch.
if errors.Is(err, channeldb.ErrRegisterAttempt) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this case, because when should this ever happen we now always have one single send and process result thread, so we would already know probably in the decideNextStep func ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix payments Related to invoices/payments size/kilo medium, proper context needed, less than 1000 lines
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight
4 participants