Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-8292] Create Single App Connection and Use for Net Appender and Restart Checker #4746

Merged
merged 52 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
09186d8
barely stub out new connection wrapper
bashar-515 Jan 23, 2025
e18211c
checkpoint
bashar-515 Jan 24, 2025
1306520
repeatedly retry connecting to App
bashar-515 Jan 27, 2025
5e5a791
no longer store initial error
bashar-515 Jan 27, 2025
7543b72
lint
bashar-515 Jan 27, 2025
b333e48
lint again
bashar-515 Jan 27, 2025
8827cd4
remove TODO comment
bashar-515 Jan 27, 2025
67d8e55
lint
bashar-515 Jan 27, 2025
1394587
log error
bashar-515 Jan 27, 2025
88e576b
return error
bashar-515 Jan 27, 2025
c9d46a9
lint
bashar-515 Jan 27, 2025
a974c16
use sublogger
bashar-515 Jan 27, 2025
1fd26c1
create context.Context within NewAppConn()
bashar-515 Jan 27, 2025
0511a68
use context.Context passed in as function parameter in repeated dials…
bashar-515 Jan 27, 2025
b87a1b5
no longer store error
bashar-515 Jan 27, 2025
5c4f7d3
always dial again
bashar-515 Jan 27, 2025
c4799cf
rename logger
bashar-515 Jan 27, 2025
11aa86f
add TODO comment
bashar-515 Jan 27, 2025
92ceb64
use unified connection in RestartChecker
bashar-515 Jan 27, 2025
a8a5af3
remove context.Context
bashar-515 Jan 27, 2025
1baabf6
lint
bashar-515 Jan 27, 2025
eb11c9e
check parent context
bashar-515 Jan 27, 2025
6f3f163
check parent context
bashar-515 Jan 27, 2025
a3b9956
restore web/server/restart_checker.go file
bashar-515 Jan 28, 2025
6ecf9ef
revert to checkpoint
bashar-515 Jan 28, 2025
6303b44
create connection to App before checking cloud config
bashar-515 Jan 28, 2025
c8c1b5d
refactor error conditions
bashar-515 Jan 28, 2025
43d718f
defer context cancel and eliminate race condition
bashar-515 Jan 28, 2025
1963013
lint
bashar-515 Jan 28, 2025
5e2b24b
release lock sooner
bashar-515 Jan 28, 2025
59101de
use stoppable workers
bashar-515 Jan 28, 2025
e2a592e
add comment to Close()
bashar-515 Jan 28, 2025
c626935
lint
bashar-515 Jan 28, 2025
51a9b3d
make comments more concise
bashar-515 Jan 28, 2025
2e9033b
only stop stoppable workers if non-nil
bashar-515 Jan 28, 2025
31dee57
change spacing
bashar-515 Jan 28, 2025
f1615a2
rephrase comments
bashar-515 Jan 28, 2025
f035a59
lint
bashar-515 Jan 28, 2025
93b7314
create connection to app without checking conditions
bashar-515 Jan 28, 2025
48dc67b
check for non-nil cloud config before creating connection to App
bashar-515 Jan 29, 2025
f0c8f07
use global connection to App in restart checker
bashar-515 Jan 29, 2025
40e87cf
clean up code
bashar-515 Jan 29, 2025
3ab3a53
lint
bashar-515 Jan 29, 2025
ded9d43
add to server connection to App
bashar-515 Jan 29, 2025
ff6953e
no longer check if error when intially dialing is due to time out
bashar-515 Jan 29, 2025
3da6860
use Debugw()
bashar-515 Jan 29, 2025
8f67114
lint
bashar-515 Jan 29, 2025
725d4d9
comment that the connection to App can be nil
bashar-515 Jan 29, 2025
24ba32f
only acquire lock if dial succeeds
bashar-515 Jan 29, 2025
728d692
lint
bashar-515 Jan 29, 2025
78fa924
remove comment from final return statment in AppConn constructor
bashar-515 Jan 29, 2025
b477b5a
remove TODO comment
bashar-515 Jan 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions config/reader.go
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ func isLocationSecretsEqual(prevCloud, cloud *Cloud) bool {
return true
}

func getTimeoutCtx(ctx context.Context, shouldReadFromCache bool, id string) (context.Context, func()) {
func GetTimeoutCtx(ctx context.Context, shouldReadFromCache bool, id string) (context.Context, func()) {
timeout := readTimeout
// When environment indicates we are behind a proxy, bump timeout. Network
// operations tend to take longer when behind a proxy.
Expand Down Expand Up @@ -257,7 +257,7 @@ func readFromCloud(
if !cfg.Cloud.SignalingInsecure && (checkForNewCert || tls.certificate == "" || tls.privateKey == "") {
logger.Debug("reading tlsCertificate from the cloud")

ctxWithTimeout, cancel := getTimeoutCtx(ctx, shouldReadFromCache, cloudCfg.ID)
ctxWithTimeout, cancel := GetTimeoutCtx(ctx, shouldReadFromCache, cloudCfg.ID)
certData, err := readCertificateDataFromCloudGRPC(ctxWithTimeout, cloudCfg, logger)
if err != nil {
cancel()
Expand Down Expand Up @@ -609,7 +609,7 @@ func processConfig(unprocessedConfig *Config, fromCloud bool, logger logging.Log
func getFromCloudOrCache(ctx context.Context, cloudCfg *Cloud, shouldReadFromCache bool, logger logging.Logger) (*Config, bool, error) {
var cached bool

ctxWithTimeout, cancel := getTimeoutCtx(ctx, shouldReadFromCache, cloudCfg.ID)
ctxWithTimeout, cancel := GetTimeoutCtx(ctx, shouldReadFromCache, cloudCfg.ID)
defer cancel()

cfg, errorShouldCheckCache, err := getFromCloudGRPC(ctxWithTimeout, cloudCfg, logger)
Expand Down
85 changes: 85 additions & 0 deletions grpc/app_conn.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
package grpc

import (
"context"
"net/url"
"time"

"github.com/pkg/errors"

"go.viam.com/rdk/config"
"go.viam.com/rdk/logging"
"go.viam.com/utils/rpc"
)

type AppConn struct {
ReconfigurableClientConn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that an AppConn instance attempts to establish a connection in a background Goroutine, all calls to its methods would require a nil check and the acquisition of a lock. This code/functionality is already written out in the ReconfigurableClientConn, so I figured it'd be best to just reuse that.


// Err stores the most recent error returned by the serialized dial attempts running in the background. It can also be used to tell
// whether dial attempts are currently happening; If err is a non-nil value, dial attempts have stopped. Accesses to Err should respect
// ReconfigurableClientConn.connMu
Err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use this anywhere? If not, would prefer to not have it at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, no. I'll get rid of it. I only included it in this PR bc I felt bad throwing away the error every time but agree that it has no use atm.

}

func NewAppConn(ctx context.Context, cloud *config.Cloud, logger logging.Logger) (*AppConn, error) {
grpcURL, err := url.Parse(cloud.AppAddress)
if err != nil {
return nil, err
}

dialOpts := dialOpts(cloud)

if grpcURL.Scheme == "http" {
dialOpts = append(dialOpts, rpc.WithInsecure())
}

appConn := &AppConn{}

// a lock is not necessary here because this call is blocking
appConn.conn, err = rpc.DialDirectGRPC(ctx, grpcURL.Host, logger, dialOpts...)
if err != nil {
if errors.Is(err, context.DeadlineExceeded) {
go func() {
for {
appConn.connMu.Lock()

ctxWithTimeOut, ctxWithTimeOutCancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link
Member Author

@bashar-515 bashar-515 Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use the context.Context passed to NewAppConn() when creating this context.WithTimeout because the one passed as a function parameter has its own timeout and is cancelled when this function returns (during which this Goroutine might still be running).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the context passed in would be good because then a context cancellation stemming from killing the viam-server will propagate properly


appConn.conn, err = rpc.DialDirectGRPC(ctxWithTimeOut, grpcURL.Host, logger, dialOpts...)
if errors.Is(err, context.DeadlineExceeded) {
appConn.connMu.Unlock()

// only dial again if previous attempt timed out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.DeadlineExceeded the only error you get if you're in low-connectivity environments? or if you were out and then in of wifi range? I would be ok with this routine running in the background for as long as viam-server is running no matter what the error since this connection is so important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the routine running in the background for the entire lifetime of the viam-server while we don't have a connection)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Should we log this error? Only log if it's not a timeout? Always log?

Copy link
Member

@cheukt cheukt Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to always log at debug level, may reassess if it's really annoying

continue
}

ctxWithTimeOutCancel()

appConn.Err = err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm handling errors returned by the background dials by simply storing them in the AppConn struct and stopping the repeating dial attempts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other question around this, I'm not sure if we need this field (yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually we'd want to expose whether the connection is connected or not, but that can come later (the connection being not nil does not tell us whether it is disconnected or not)


break
}

appConn.connMu.Unlock()
}()
} else {
return nil, err
}
}

return appConn, nil
}

func dialOpts(cloud *config.Cloud) []rpc.DialOption {
dialOpts := make([]rpc.DialOption, 0, 2)
// Only add credentials when secret is set.
if cloud.Secret != "" {
dialOpts = append(dialOpts, rpc.WithEntityCredentials(cloud.ID,
rpc.Credentials{
Type: "robot-secret",
Payload: cloud.Secret,
},
))
}
return dialOpts
}
6 changes: 5 additions & 1 deletion web/server/entrypoint.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
"go.viam.com/utils/rpc"

"go.viam.com/rdk/config"
"go.viam.com/rdk/grpc"
"go.viam.com/rdk/logging"
"go.viam.com/rdk/resource"
"go.viam.com/rdk/robot"
Expand Down Expand Up @@ -191,13 +192,16 @@ func RunServer(ctx context.Context, args []string, _ logging.Logger) (err error)
// Start remote logging with config from disk.
// This is to ensure we make our best effort to write logs for failures loading the remote config.
if cfgFromDisk.Cloud != nil && (cfgFromDisk.Cloud.LogPath != "" || cfgFromDisk.Cloud.AppAddress != "") {
ctxWithTimeout, ctxWithTimeoutCancel := config.GetTimeoutCtx(ctx, true, cfgFromDisk.Cloud.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually create the ctx with timeout inside the NewAppConn function, since I would consider the lifetime of AppConn to be the lifetime of the viam-server. so it'd make more sense to just pass in the server context and then AppConn can branch off of it if needed

appConn, err := grpc.NewAppConn(ctxWithTimeout, cfgFromDisk.Cloud, logger) // TODO(RSDK-8292): [q] what logger should I pass here?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[qu]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a sublogger, possibly networking.app_connection

ctxWithTimeoutCancel()
netAppender, err := logging.NewNetAppender(
&logging.CloudConfig{
AppAddress: cfgFromDisk.Cloud.AppAddress,
ID: cfgFromDisk.Cloud.ID,
Secret: cfgFromDisk.Cloud.Secret,
},
nil, false, logger.Sublogger("networking").Sublogger("netlogger"),
appConn, false, logger.Sublogger("networking").Sublogger("netlogger"),
)
if err != nil {
return err
Expand Down
Loading