-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instances lost outbound internet access after changing silo default IP pool #7297
Comments
I believe you're correct there, Angela. We basically need a link between the Ideally we would add it as a second pool linked to the gateway, but the API would prevent that today? I don't know whether this behaviour is particularly good, though. We could change things on the OPTE side so that any untagged IPs (i.e., no pool link) can be used as a lower priority option than any explicit IGW matches. The instance is already receiving traffic on that IP, so I think that making sure we can always use it is probably the right thing here. |
@FelixMcFelix: Thanks for weighing in! If this particular problem only happens when there are instances with external IPs on pools that have been untagged from silo default during IGW migration, we may not want to change how things work to accommodate the one-time trigger of this problem. We can perhaps handle this by documenting the procedures (for us or the customer) to restore the outbound internet connectivity in the release notes. It's quite possible that nobody happened to make the default IP pool change besides us while there are instances still using the old default pool. But if the problem may happen in other situations, we'll need to come up with a solution and understand the trade-offs. |
The issue title has been updated as it can also happen outside of release 11 internet gateway migration. Basically, any time an instance has none of its external IPs (ephemeral, floating, or SNAT) matching with the system-created There are two scenarios where the mismatch can happen:
Depending on the intended outcome, we can rectify the instance outbound access issue with one of the following ways,
|
@iliana - In order to fix the outbound access issue of iliana's VM (or any other instances in the Oxide silo with the same issue), we can use one of the methods mentioned above. I think the easiest way is the first one, i.e. simply attach the Also copying @augustuswm here in case you get reports of the same issue from other users. I'm going to document this in the Troubleshooting guide and also in the release notes as a known issue. I'll also bring this up for triage with the network team since it'll catch user by surprise when instances with an external IP would somehow not have outbound access. |
In the particular case of this VM ("axiomatic"), I'm going to go ahead and recreate it with an internal IP, since it doesn't really need to be public. |
Created https://github.com/oxidecomputer/docs/pull/508 to add this to Known issues in r12 release notes and the workarounds for fixing outbound access of the affected instances. |
@iliana has an instance with an ephemeral IP losing its outbound internet access on rack3 (inbound access is working just fine). I took a look at its opte entries and noticed that its router-target in the Outbound Rules section didn't have an internet gateway uuid (
meta: router-target=ig
).For a comparison, this is how the output looks like for another instance with an ephemeral IP in the same IP pool (we see
meta: router-target=ig=46452e5f-1ddc-4b7c-9013-114d1a26d936
):The port in question does have the correct internet gateway id captured in the opte router output:
I wonder if this is because the default IP pool for the silo was changed between when the ephemeral IP was allocated and when the migration script
schema/crdb/internet-gateway/up13.sql
was executed. The migration script auto-created a default gateway attached to the current default IP pool (the pool namedeng-vpn
) while the instance has its external IP in the original default pool namedpublic
. @FelixMcFelix - thoughts?The text was updated successfully, but these errors were encountered: