Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instances lost outbound internet access after changing silo default IP pool #7297

Open
askfongjojo opened this issue Dec 28, 2024 · 6 comments
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

@iliana has an instance with an ephemeral IP losing its outbound internet access on rack3 (inbound access is working just fine). I took a look at its opte entries and noticed that its router-target in the Outbound Rules section didn't have an internet gateway uuid (meta: router-target=ig).

BRM42220054 # opteadm dump-layer nat -p opte4
Port opte4 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP          SPORT  DST IP          DPORT  HITS  ACTION
TCP    92.255.85.253   40574  45.154.216.171  22     0     NAT
TCP    92.255.85.253   40576  45.154.216.171  22     0     NAT
[SNIP]

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP      SPORT  DST IP          DPORT  HITS  ACTION
TCP    172.30.0.6  22     92.255.85.253   40574  1     NAT
[SNIP]

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS    PREDICATES                   ACTION
5    10   155268  inner.ip.dst=45.154.216.171  "Stateful: 172.30.0.6 <=> (external)"
DEF  --   17176   --                           "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS   PREDICATES                    ACTION
15   10   0      inner.ether.ether_type=IPv4   "Stateful: 172.30.0.6 <=> 45.154.216.171"
                 meta: router-target=ig        
                                               
16   100  0      inner.ether.ether_type=IPv4   "Stateful: 45.154.216.124:16384-32767"
                 meta: router-target=ig        
                                               
17   255  28267  meta: router-target-class=ig  "Deny"
DEF  --   100    --                            "allow"

For a comparison, this is how the output looks like for another instance with an ephemeral IP in the same IP pool (we see meta: router-target=ig=46452e5f-1ddc-4b7c-9013-114d1a26d936):

BRM42220054 # opteadm dump-layer nat -p opte6
Port opte6 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP          SPORT  DST IP          DPORT  HITS  ACTION
TCP    3.136.208.236   45761  45.154.216.194  49203  0     NAT
TCP    60.167.165.58   48622  45.154.216.194  22     1     NAT
[SNIP]

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP      SPORT  DST IP          DPORT  HITS  ACTION
TCP    172.30.0.5  49203  3.136.208.236   45761  0     NAT
[SNIP]

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS    PREDICATES                   ACTION
4    10   197877  inner.ip.dst=45.154.216.194  "Stateful: 172.30.0.5 <=> (external)"
DEF  --   11004   --                           "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS   PREDICATES                                                   ACTION
12   10   13687  inner.ether.ether_type=IPv4                                  "Stateful: 172.30.0.5 <=> 45.154.216.194"
                 meta: router-target=ig=46452e5f-1ddc-4b7c-9013-114d1a26d936  
                                                                              
13   100  0      inner.ether.ether_type=IPv4                                  "Stateful: 45.154.216.87:32768-49151"
                 meta: router-target=ig=46452e5f-1ddc-4b7c-9013-114d1a26d936  
                                                                              
14   255  0      meta: router-target-class=ig                                 "Deny"
DEF  --   5908   --                                                           "allow"

The port in question does have the correct internet gateway id captured in the opte router output:

BRM42220054 # opteadm dump-layer router -p opte4
Port opte4 - Layer router
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS    PREDICATES  ACTION
DEF  --   212410  --          "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS    PREDICATES                         ACTION
1    31   100     inner.ip.dst=172.30.0.0/22         "Meta: Target = Subnet: 172.30.0.0/22"
2    75   110613  inner.ip.dst=0.0.0.0/0             "Meta: Target = IG(Some(00f46642-721c-45aa-b4da-0534ab36b49f))"
0    139  0       inner.ip6.dst=fd37:ff93:8bab::/64  "Meta: Target = Subnet: fd37:ff93:8bab::/64"
3    267  0       inner.ip6.dst=::/0                 "Meta: Target = IG(Some(00f46642-721c-45aa-b4da-0534ab36b49f))"
DEF  --   0       --                                 "deny"

I wonder if this is because the default IP pool for the silo was changed between when the ephemeral IP was allocated and when the migration script schema/crdb/internet-gateway/up13.sql was executed. The migration script auto-created a default gateway attached to the current default IP pool (the pool named eng-vpn) while the instance has its external IP in the original default pool named public. @FelixMcFelix - thoughts?

@FelixMcFelix
Copy link
Contributor

FelixMcFelix commented Dec 30, 2024

I believe you're correct there, Angela. We basically need a link between the public pool and an internet gateway for that association to exist in the nat table. The simplest fix here would be to move the link on the default internet gateway over to public -- which would upset the other two instances who do have IPs in eng-vpn. In which case, we could move this instance to another VPC subnet with a custom router (and public-specific gateway).

Ideally we would add it as a second pool linked to the gateway, but the API would prevent that today?

I don't know whether this behaviour is particularly good, though. We could change things on the OPTE side so that any untagged IPs (i.e., no pool link) can be used as a lower priority option than any explicit IGW matches. The instance is already receiving traffic on that IP, so I think that making sure we can always use it is probably the right thing here.

@askfongjojo
Copy link
Author

@FelixMcFelix: Thanks for weighing in! If this particular problem only happens when there are instances with external IPs on pools that have been untagged from silo default during IGW migration, we may not want to change how things work to accommodate the one-time trigger of this problem. We can perhaps handle this by documenting the procedures (for us or the customer) to restore the outbound internet connectivity in the release notes. It's quite possible that nobody happened to make the default IP pool change besides us while there are instances still using the old default pool.

But if the problem may happen in other situations, we'll need to come up with a solution and understand the trade-offs.

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Dec 30, 2024
@morlandi7 morlandi7 added this to the 12 milestone Jan 2, 2025
@askfongjojo askfongjojo modified the milestones: 12, 13 Jan 7, 2025
@askfongjojo askfongjojo changed the title Instances with ephemeral IP in previous default IP pool lost outbound internet access after internet gateway migration Instances lost outbound internet access after changing silo default IP pool Jan 7, 2025
@askfongjojo
Copy link
Author

askfongjojo commented Jan 7, 2025

The issue title has been updated as it can also happen outside of release 11 internet gateway migration. Basically, any time an instance has none of its external IPs (ephemeral, floating, or SNAT) matching with the system-created default internet gateway IP pool, the packets using the system IG will be dropped.

There are two scenarios where the mismatch can happen:

  1. Some instances were created prior to r11 (using old pool for both of its ephemeral IP and SNAT) and the default IP pool was changed after the instance creation. In this case, the default IG created during r11 upgrade is linked to new IP pool and it doesn't serve outbound requests from the previously created instances.
  2. Post r11, the default IP pool got changed and new instances are created using the new default pool. The default IG is still linked to the old IP pool and doesn't serve outbound requests from the new instances.

Depending on the intended outcome, we can rectify the instance outbound access issue with one of the following ways,

  • If the goal is to keep using both IP pools, link both the old and new ip pools to the default gateway via oxide internet-gateway ip-pool attach --ip-pool $POOL (where $POOL is the old pool in Case 1, and new pool in Case 2). This has to be done for each of the VPCs in all the projects of the silo.
  • If the goal is to totally replace the use of the old pool with the new one: In Case 1, delete and recreate old/existing instances (without destroying the disks) so that they get all of their external IP allocated from the new default pool. In Case 2, attach the new pool to the default gateway, detach the old one from it, and delete/recreate any instances created prior to the silo default ip pool setting change. The gateway IP pool change will need to be done on all VPCs in the silo.
  • If the goal is to use the non-default IP pool only for certain special cases without recreating the instances or linking both IP pools to the default gateway, user can create a floating IP from the pool matching the default gateway and attach it to their instance. Another alternative is to create a custom gateway in their VPC linked to the IP pool matching the one used by the instances, create a custom router with the custom gateway as target, and attach the custom router to the VPC subnets.

@askfongjojo
Copy link
Author

@iliana - In order to fix the outbound access issue of iliana's VM (or any other instances in the Oxide silo with the same issue), we can use one of the methods mentioned above. I think the easiest way is the first one, i.e. simply attach the public pool to the internet gateway in the VPC associated with your VM.

Also copying @augustuswm here in case you get reports of the same issue from other users.

I'm going to document this in the Troubleshooting guide and also in the release notes as a known issue. I'll also bring this up for triage with the network team since it'll catch user by surprise when instances with an external IP would somehow not have outbound access.

@iliana
Copy link
Contributor

iliana commented Jan 7, 2025

In the particular case of this VM ("axiomatic"), I'm going to go ahead and recreate it with an internal IP, since it doesn't really need to be public.

@askfongjojo
Copy link
Author

Created https://github.com/oxidecomputer/docs/pull/508 to add this to Known issues in r12 release notes and the workarounds for fixing outbound access of the affected instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

No branches or pull requests

4 participants