-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Refresh client cert when CA rotation #934
Conversation
6a61292
to
20c6162
Compare
28e50db
to
3a7c320
Compare
20c6162
to
bb306f8
Compare
3a7c320
to
d237673
Compare
bb306f8
to
0ae60ad
Compare
d237673
to
1d5dd2e
Compare
95d468c
to
fb0dcbf
Compare
dialer.go
Outdated
key := createKey(i) | ||
if cachedC, ok := d.cache[key]; ok && cachedC == c { | ||
delete(d.cache, key) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add a comment around this logic?
What case is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments updated. removeCached() may be called multiple times for the same monitoredCache instance, for example if several sockets are open and they all error at the same time.
The logic here will ensure that the cache entry is safely removed from the cache. This accounts for the case where this instance was already removed, and for the case where it was removed and replaced with a new instance.
dialer_test.go
Outdated
// Note: this succeeds with lazy refresh, but fails with lazy. | ||
// because dialer.ForceRefresh does not block connections while the | ||
// refresh is in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a typo in this comment...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 This looks like a dev-artifact that isn't true anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Deleted the comment - it was a dev artifact.
dialer.go
Outdated
}, d.dialerID, cn.String()) | ||
}, | ||
func(err error) { | ||
// ignore EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we expand this comment on why we're ignoring EOF? I think EOFs are the server sending an RST packet, but that's not obvious knowledge for future maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
dialer.go
Outdated
@@ -435,7 +437,22 @@ func (d *Dialer) Dial(ctx context.Context, icn string, opts ...DialOption) (conn | |||
iConn := newInstrumentedConn(tlsConn, func() { | |||
n := atomic.AddUint64(c.openConnsCount, ^uint64(0)) // c.openConnsCount = c.openConnsCount - 1 | |||
trace.RecordOpenConnections(context.Background(), int64(n), d.dialerID, cn.String()) | |||
}, d.dialerID, cn.String()) | |||
}, | |||
func(err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace nit: putting this up a line would make it more obvious it's an argument at a glance.
Alternatively, you could pull this anonymous function out into a variable:
errFunc := func(err error) {
// ...
}
iConn := newInstrumentedConn(/*...*/, errFunc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -446,12 +463,19 @@ func (d *Dialer) Dial(ctx context.Context, icn string, opts ...DialOption) (conn | |||
} | |||
return iConn, nil | |||
} | |||
func (d *Dialer) isTLSError(err error) bool { | |||
if nErr, ok := err.(net.Error); ok { | |||
return !nErr.Timeout() && // it's a permanent net error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you manually tested this logic as part of doing a cert rotation to confirm it's working as intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Also the new tests in dialer_test.go create genuine TLS certificate errors, so this is covered by unit tests too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding an e2e test for this, but otherwise this seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens on a support action: when the client CA is rotated, but the server certificate remains the same. There is no public API for that. So I decided e2e was overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it. I missed that detail. I agree then -- no e2e is fine.
dialer_test.go
Outdated
// Note: this succeeds with lazy refresh, but fails with lazy. | ||
// because dialer.ForceRefresh does not block connections while the | ||
// refresh is in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 This looks like a dev-artifact that isn't true anymore?
dialer_test.go
Outdated
// Expect an error on read. This should trigger the dialer to refresh. | ||
_, err = io.ReadAll(conn) | ||
if err != nil { | ||
t.Log("Got error on read as expected.", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Logf("Got error on read as expected: %v")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
dialer_test.go
Outdated
t.Log("Second attempt should fail...") | ||
_, err := d.Dial(context.Background(), "my-project:my-region:my-instance") | ||
if err != nil { | ||
t.Log("Got error on dial as expected.", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Got/got/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dialer_test.go
Outdated
if err != nil { | ||
t.Log("Got error on dial as expected.", err) | ||
} else { | ||
t.Fatal("Want dial error, got no error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Want/want/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dialer_test.go
Outdated
t.Log("Third attempt OK.") | ||
} | ||
|
||
func TestDialerRefreshesAfterRotateServerCA(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this test and the one above be combined into a table test by parameterizing the only differences between the two tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test parameterized.
internal/mock/cloudsql.go
Outdated
if f.serverCAMode == "" || f.serverCAMode == "GOOGLE_MANAGED_INTERNAL_CA" { | ||
// legacy server mode, return only the server cert | ||
return toPEMFormat(f.certs.serverCert) | ||
} | ||
return toPEMFormat(f.certs.casServerCertificate, f.certs.serverIntermediateCaCert, f.certs.serverCaCert) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's omit these changes from the PR since they're unrelated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
r := &Request{ | ||
reqMethod: http.MethodGet, | ||
reqPath: fmt.Sprintf("/sql/v1beta4/projects/%s/instances/%s/connectSettings", i.project, i.name), | ||
reqCt: ct, | ||
handle: func(resp http.ResponseWriter, _ *http.Request) { | ||
var ips []*sqladmin.IpMapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a necessary change? Looks like formatting to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is necessary. The Server CA changes during the test. The ServerCaCert response field needs to hold the up-to-date value. Thus, this method needs to create the response at the time that the request is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- a comment to that effect above i.serverCACert()
would help future maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bb7814e
to
78d62bf
Compare
4b7b13e
to
27125b6
Compare
wip: merged down new tests
27125b6
to
0f5f778
Compare
@@ -446,12 +463,19 @@ func (d *Dialer) Dial(ctx context.Context, icn string, opts ...DialOption) (conn | |||
} | |||
return iConn, nil | |||
} | |||
func (d *Dialer) isTLSError(err error) bool { | |||
if nErr, ok := err.(net.Error); ok { | |||
return !nErr.Timeout() && // it's a permanent net error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding an e2e test for this, but otherwise this seems fine.
) { | ||
d.logger.Debugf( | ||
ctx, | ||
"[%v] Removing connection info from cache: %v", | ||
i.String(), | ||
err, | ||
) | ||
|
||
// If this instance of monitoredCache is still in the cache, remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we just always call delete, regardless of whether the monitoredCache is present or not in cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removeCached()
method can be called multiple times for the same monitoredCache instance, and needs to be idempotent. Therefore, it should only delete the cache entry if the monitoredCache instance is the same. Otherwise, it is not idempotent and will cause incorrect behavior.
Here's an example scenario:
Suppose at the start that d.cache["proj:reg:inst"]
contains a valid monitoredCache instance.
Then, the instance "proj:reg:inst" Client CA certificates for are rotated. Now the dialer no longer has valid metadata for this instance, new sockets will fail with a TLS error after being returned from d.Dial()
.
The following actions occur across 3 different goroutines. By chance, they occur in this specific order.
- (goroutine 1) The application calls
d.Dial("proj:reg:inst")
, it establishes a connection that is about to fail. - (goroutine 2) The application calls
d.Dial("proj:reg:inst")
, it establishes a second connection that is about to fail. - (goroutine 1) The first connection fails, and
removeCached(c)
is called.d.cache["proj:reg:inst"] == c
, so the the invalid monitoredCache instance removed from thed.cache
. Then the invalid monitoredCache instance is closed. - (goroutine 3) Then the application calls d.Dial() again, Dialer does not contain a cache entry, so it fetches the instance data and creates a new, valid monitoredCache instance. A third successful connection is created.
Now d.cache
contains a new, valid monitoredCache instance that was created during the third call to d.Dial()
. However the second connection hasn't failed yet. Then...
- (goroutine 2) The second connection fails, and
removeCached(c)
is called for the old, invalid instance.d.cache["proj:reg:inst"] != c
. Instead, the cache holds the valid, new monitoredCache instance.
The removeCache(c) method should leave the new, valid monitoredCache instance in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That's probably an edge case, but also worth solving for. Maybe add a note like, "in the case of multiple goroutines creating cache entries, only purge the cache if it's for a known failed attempt and don't overwrite separate goroutine entries." This is a subtle little detail otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe just:
// If this instance was already removed from the cache or *a separate goroutine* replaced it with
// a new instance, do nothing.
dialer_test.go
Outdated
wantErrorOnDial bool | ||
wantErrorOnRead bool | ||
useLazyRefresh bool | ||
}{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this isn't consistently formatted as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
dialer_test.go
Outdated
|
||
opts := []Option{ | ||
WithTokenSource(mock.EmptyTokenSource{}), | ||
//WithDebugLogger(&dialerTestLogger{t: t}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I uncommented it. It helps to understand the test.
r := &Request{ | ||
reqMethod: http.MethodGet, | ||
reqPath: fmt.Sprintf("/sql/v1beta4/projects/%s/instances/%s/connectSettings", i.project, i.name), | ||
reqCt: ct, | ||
handle: func(resp http.ResponseWriter, _ *http.Request) { | ||
var ips []*sqladmin.IpMapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- a comment to that effect above i.serverCACert()
would help future maintainers.
eb3a6db
to
6b9221f
Compare
) { | ||
d.logger.Debugf( | ||
ctx, | ||
"[%v] Removing connection info from cache: %v", | ||
i.String(), | ||
err, | ||
) | ||
|
||
// If this instance of monitoredCache is still in the cache, remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That's probably an edge case, but also worth solving for. Maybe add a note like, "in the case of multiple goroutines creating cache entries, only purge the cache if it's for a known failed attempt and don't overwrite separate goroutine entries." This is a subtle little detail otherwise.
) { | ||
d.logger.Debugf( | ||
ctx, | ||
"[%v] Removing connection info from cache: %v", | ||
i.String(), | ||
err, | ||
) | ||
|
||
// If this instance of monitoredCache is still in the cache, remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe just:
// If this instance was already removed from the cache or *a separate goroutine* replaced it with
// a new instance, do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
Thanks for the extra assist here @enocom 👏
6b9221f
to
76c0a59
Compare
When client CA is rotated, this can cause TLS read errors after the
Dialer.Dial()
has returned. The certificate should be refreshed.This adds logic to the dialer to refresh the connection cache if a TLS error occurs after
net.Conn
is returned to the database driver.Fixes #932