Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Refresh client cert when CA rotation #934

Merged
merged 2 commits into from
Feb 7, 2025
Merged

Conversation

hessjcg
Copy link
Contributor

@hessjcg hessjcg commented Jan 31, 2025

When client CA is rotated, this can cause TLS read errors after the Dialer.Dial() has returned. The certificate should be refreshed.

This adds logic to the dialer to refresh the connection cache if a TLS error occurs after net.Conn is returned to the database driver.

Fixes #932

@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch from 6a61292 to 20c6162 Compare January 31, 2025 22:27
@hessjcg hessjcg force-pushed the gh-932-test-certs branch 2 times, most recently from 28e50db to 3a7c320 Compare January 31, 2025 22:31
@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch from 20c6162 to bb306f8 Compare January 31, 2025 22:32
@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch from bb306f8 to 0ae60ad Compare January 31, 2025 22:39
@hessjcg hessjcg marked this pull request as ready for review January 31, 2025 22:41
@hessjcg hessjcg requested a review from a team as a code owner January 31, 2025 22:41
@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch 2 times, most recently from 95d468c to fb0dcbf Compare January 31, 2025 22:47
dialer.go Outdated
Comment on lines 488 to 496
key := createKey(i)
if cachedC, ok := d.cache[key]; ok && cachedC == c {
delete(d.cache, key)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a comment around this logic?

What case is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments updated. removeCached() may be called multiple times for the same monitoredCache instance, for example if several sockets are open and they all error at the same time.

The logic here will ensure that the cache entry is safely removed from the cache. This accounts for the case where this instance was already removed, and for the case where it was removed and replaced with a new instance.

dialer_test.go Outdated
Comment on lines 1196 to 1198
// Note: this succeeds with lazy refresh, but fails with lazy.
// because dialer.ForceRefresh does not block connections while the
// refresh is in progress.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a typo in this comment...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 This looks like a dev-artifact that isn't true anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Deleted the comment - it was a dev artifact.

dialer.go Outdated
}, d.dialerID, cn.String())
},
func(err error) {
// ignore EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we expand this comment on why we're ignoring EOF? I think EOFs are the server sending an RST packet, but that's not obvious knowledge for future maintainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

dialer.go Outdated
@@ -435,7 +437,22 @@ func (d *Dialer) Dial(ctx context.Context, icn string, opts ...DialOption) (conn
iConn := newInstrumentedConn(tlsConn, func() {
n := atomic.AddUint64(c.openConnsCount, ^uint64(0)) // c.openConnsCount = c.openConnsCount - 1
trace.RecordOpenConnections(context.Background(), int64(n), d.dialerID, cn.String())
}, d.dialerID, cn.String())
},
func(err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace nit: putting this up a line would make it more obvious it's an argument at a glance.

Alternatively, you could pull this anonymous function out into a variable:

errFunc := func(err error) {
  // ...
}
iConn := newInstrumentedConn(/*...*/, errFunc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -446,12 +463,19 @@ func (d *Dialer) Dial(ctx context.Context, icn string, opts ...DialOption) (conn
}
return iConn, nil
}
func (d *Dialer) isTLSError(err error) bool {
if nErr, ok := err.(net.Error); ok {
return !nErr.Timeout() && // it's a permanent net error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you manually tested this logic as part of doing a cert rotation to confirm it's working as intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Also the new tests in dialer_test.go create genuine TLS certificate errors, so this is covered by unit tests too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding an e2e test for this, but otherwise this seems fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens on a support action: when the client CA is rotated, but the server certificate remains the same. There is no public API for that. So I decided e2e was overkill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. I missed that detail. I agree then -- no e2e is fine.

dialer_test.go Outdated
Comment on lines 1196 to 1198
// Note: this succeeds with lazy refresh, but fails with lazy.
// because dialer.ForceRefresh does not block connections while the
// refresh is in progress.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 This looks like a dev-artifact that isn't true anymore?

dialer_test.go Outdated
// Expect an error on read. This should trigger the dialer to refresh.
_, err = io.ReadAll(conn)
if err != nil {
t.Log("Got error on read as expected.", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Logf("Got error on read as expected: %v")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

dialer_test.go Outdated
t.Log("Second attempt should fail...")
_, err := d.Dial(context.Background(), "my-project:my-region:my-instance")
if err != nil {
t.Log("Got error on dial as expected.", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Got/got/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dialer_test.go Outdated
if err != nil {
t.Log("Got error on dial as expected.", err)
} else {
t.Fatal("Want dial error, got no error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Want/want/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dialer_test.go Outdated
t.Log("Third attempt OK.")
}

func TestDialerRefreshesAfterRotateServerCA(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this test and the one above be combined into a table test by parameterizing the only differences between the two tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test parameterized.

if f.serverCAMode == "" || f.serverCAMode == "GOOGLE_MANAGED_INTERNAL_CA" {
// legacy server mode, return only the server cert
return toPEMFormat(f.certs.serverCert)
}
return toPEMFormat(f.certs.casServerCertificate, f.certs.serverIntermediateCaCert, f.certs.serverCaCert)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's omit these changes from the PR since they're unrelated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

r := &Request{
reqMethod: http.MethodGet,
reqPath: fmt.Sprintf("/sql/v1beta4/projects/%s/instances/%s/connectSettings", i.project, i.name),
reqCt: ct,
handle: func(resp http.ResponseWriter, _ *http.Request) {
var ips []*sqladmin.IpMapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a necessary change? Looks like formatting to me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is necessary. The Server CA changes during the test. The ServerCaCert response field needs to hold the up-to-date value. Thus, this method needs to create the response at the time that the request is made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- a comment to that effect above i.serverCACert() would help future maintainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hessjcg hessjcg force-pushed the gh-932-test-certs branch 2 times, most recently from bb7814e to 78d62bf Compare February 5, 2025 20:48
@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch 2 times, most recently from 4b7b13e to 27125b6 Compare February 5, 2025 20:56
Base automatically changed from gh-932-test-certs to main February 5, 2025 21:00
@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch from 27125b6 to 0f5f778 Compare February 5, 2025 21:01
@@ -446,12 +463,19 @@ func (d *Dialer) Dial(ctx context.Context, icn string, opts ...DialOption) (conn
}
return iConn, nil
}
func (d *Dialer) isTLSError(err error) bool {
if nErr, ok := err.(net.Error); ok {
return !nErr.Timeout() && // it's a permanent net error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding an e2e test for this, but otherwise this seems fine.

) {
d.logger.Debugf(
ctx,
"[%v] Removing connection info from cache: %v",
i.String(),
err,
)

// If this instance of monitoredCache is still in the cache, remove it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we just always call delete, regardless of whether the monitoredCache is present or not in cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removeCached() method can be called multiple times for the same monitoredCache instance, and needs to be idempotent. Therefore, it should only delete the cache entry if the monitoredCache instance is the same. Otherwise, it is not idempotent and will cause incorrect behavior.

Here's an example scenario:

Suppose at the start that d.cache["proj:reg:inst"] contains a valid monitoredCache instance.

Then, the instance "proj:reg:inst" Client CA certificates for are rotated. Now the dialer no longer has valid metadata for this instance, new sockets will fail with a TLS error after being returned from d.Dial().

The following actions occur across 3 different goroutines. By chance, they occur in this specific order.

  • (goroutine 1) The application calls d.Dial("proj:reg:inst"), it establishes a connection that is about to fail.
  • (goroutine 2) The application calls d.Dial("proj:reg:inst"), it establishes a second connection that is about to fail.
  • (goroutine 1) The first connection fails, and removeCached(c) is called. d.cache["proj:reg:inst"] == c, so the the invalid monitoredCache instance removed from the d.cache. Then the invalid monitoredCache instance is closed.
  • (goroutine 3) Then the application calls d.Dial() again, Dialer does not contain a cache entry, so it fetches the instance data and creates a new, valid monitoredCache instance. A third successful connection is created.

Now d.cache contains a new, valid monitoredCache instance that was created during the third call to d.Dial(). However the second connection hasn't failed yet. Then...

  • (goroutine 2) The second connection fails, and removeCached(c) is called for the old, invalid instance. d.cache["proj:reg:inst"] != c. Instead, the cache holds the valid, new monitoredCache instance.

The removeCache(c) method should leave the new, valid monitoredCache instance in the cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's probably an edge case, but also worth solving for. Maybe add a note like, "in the case of multiple goroutines creating cache entries, only purge the cache if it's for a known failed attempt and don't overwrite separate goroutine entries." This is a subtle little detail otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just:

// If this instance was already removed from the cache or *a separate goroutine* replaced it with
// a new instance, do nothing.

dialer_test.go Outdated
wantErrorOnDial bool
wantErrorOnRead bool
useLazyRefresh bool
}{{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't consistently formatted as below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

dialer_test.go Outdated

opts := []Option{
WithTokenSource(mock.EmptyTokenSource{}),
//WithDebugLogger(&dialerTestLogger{t: t}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uncommented it. It helps to understand the test.

r := &Request{
reqMethod: http.MethodGet,
reqPath: fmt.Sprintf("/sql/v1beta4/projects/%s/instances/%s/connectSettings", i.project, i.name),
reqCt: ct,
handle: func(resp http.ResponseWriter, _ *http.Request) {
var ips []*sqladmin.IpMapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- a comment to that effect above i.serverCACert() would help future maintainers.

@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch 2 times, most recently from eb3a6db to 6b9221f Compare February 6, 2025 18:14
@hessjcg hessjcg requested a review from enocom February 6, 2025 20:32
) {
d.logger.Debugf(
ctx,
"[%v] Removing connection info from cache: %v",
i.String(),
err,
)

// If this instance of monitoredCache is still in the cache, remove it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's probably an edge case, but also worth solving for. Maybe add a note like, "in the case of multiple goroutines creating cache entries, only purge the cache if it's for a known failed attempt and don't overwrite separate goroutine entries." This is a subtle little detail otherwise.

) {
d.logger.Debugf(
ctx,
"[%v] Removing connection info from cache: %v",
i.String(),
err,
)

// If this instance of monitoredCache is still in the cache, remove it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just:

// If this instance was already removed from the cache or *a separate goroutine* replaced it with
// a new instance, do nothing.

Copy link
Collaborator

@jackwotherspoon jackwotherspoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

Thanks for the extra assist here @enocom 👏

@hessjcg hessjcg force-pushed the gh-932-refresh-on-tls-error branch from 6b9221f to 76c0a59 Compare February 7, 2025 18:59
@hessjcg hessjcg merged commit c22e2d4 into main Feb 7, 2025
14 checks passed
@hessjcg hessjcg deleted the gh-932-refresh-on-tls-error branch February 7, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When the server CA certificates are rotated, refresh the client certificates.
3 participants