-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request to add DNS server implementation for sidecar #328
Conversation
Thank you for this PR and this valuable feature addition. I would like to make sure that we reuse the local registry cache instead of making a remote API call for every query. @zvi not related to this PR, but we need to think about how SRV records figure in this scenario. It would mostly require merging rules that contain backend weights and priorities. At some point I would like to convert our examples to use DNS as well. |
return nil | ||
} | ||
|
||
// Shutdown stpos the DNS server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. Stops
|
||
switch endPointType { | ||
case "tcp": | ||
ip, err = validateEndPointTypeTCPAndUDP(serviceInstance.Endpoint.Value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two can become one switch statement.
case "udp": | ||
ip, err = validateEndPointTypeTCPAndUDP(serviceInstance.Endpoint.Value) | ||
|
||
case "http": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need https as well
} | ||
|
||
func validateEndPointTypeHTTP(value string) (net.IP, error) { | ||
parsedURL, err := url.Parse(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YOu may not even have URLs in value. That case has to be handled. Today, that is the default case as well.
} | ||
return record | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DO we have support for SRV records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as of yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating SRV records is a no brainer here. Its a simple function, where you populate the fields with the proto, IP, port and weight/priorities information. The last two can be obtained from the controller rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple support (without weight/priorities) for SRV should be pretty straight-forward. @tomerGolany can you please look into adding support for that?
ServiceInstances, err = s.config.DiscoveryClient.ListServiceInstances(serviceName) | ||
} else { | ||
filters := client.InstanceFilter{ServiceName: serviceName, Tags: tags} | ||
ServiceInstances, err = s.config.DiscoveryClient.ListInstances(filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcahana @GregHanson is this going to remote registry or is it going to the local cache in sidecar for registry and controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code, as is wired today, is calling the remote registry.
@tomerGolany has started working on another change-set so that this will be served from the local cache. Specifically, he is changing the RegistryMonitor
so that it'll also implement the github.com/amalgam8/amalgam8/registry/client.Discovery
interface, and to inject that into the dns server, instead of the existing HTTP-based discovery client.
I would suggest moving the DNS resolution capability to the registry instead of the sidecar. When a service is using DNS-based service discovery (e.g., Consul, SkyDNS), then it makes sense to move the DNS server to the control plane, instead of having a DNS server for every sidecar. Is there a use case that I am missing here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my new comments, esp regarding tags, choice of instances, etc.
var ServiceInstances []*client.ServiceInstance | ||
var err error | ||
if len(tags) == 0 { | ||
ServiceInstances, err = s.config.DiscoveryClient.ListServiceInstances(serviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tags are not provided, as will often be the case, this should actually look up the service in the rules to see if there are specific rules for that service. The rules provide the default set of tags to use for the service, and if there are no tags, they provide weights for each set of tags. This would enable the weighted load balancing. And it covers use cases such as route to instances in the same data center. Arbitrarily choosing an instance will end up in a very suboptimal setup.
In the current implementation, the DNS based lookup is essentially bypassing everything that we did for tag/rule based routing, weights, actions and such. So, lets please use the controller rules to add some smarts to the DNS responses.
It also brings into focus a bigger question: you will now be effectively duplicating what the nginx proxy is doing by doing version-based routing via DNS.
Finally, how do you propose to route all DNS calls to the sidecar? Replacing the nameservers in /etc/resolv.conf ? or some other trick? I don't see that in the patch set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating the rules processing logic is indeed a big question. There's a non-trivial amount of work/code to be written to support that, and of course a maintenance cost. I would say this is feasible only if DNS support gains momentum and clients.
Alternatively, if we ever to replace nginx with our own Golang-based proxy (e.g., by integrating/extending Træfɪk, Caddy, or Skipper), we can easily share the same rules processing logic with the DNS and proxy'ing components. There are also other benefits in having a Go-based proxy, but that's outside the scope of this PR.
case dns.TypeAAAA: | ||
case dns.TypeANY: | ||
default: | ||
response.SetRcode(request, dns.RcodeServerFailure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that in Golang, you don't need to have expressions below a case statement and that there is no automatic fall through. But that said, for the sake of readability, it would be much nicer if you hoisted the s.retrieveServices
into the switch case.
Secondly, given that the code below is only supporting TYPEA and TYPEAAA, you should not be having a TYPE_ANY here. ANY is a very bad idea as well. Folks like CloudFlare have deprecated it.
Finally, given that you are retrieving multiple service instances, supporting DNS_SRV seems like a no brainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch-case style is a conventional pattern in Golang. You'll find similar usages in other parts of our code, as well as stdlib, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the TYPE_ANY - reading this, CloudFlare have deprecated it mostly due to the complexity involved in supporting it. I don't think this complexity is relevant to our tiny, single-client DNS server. Either way, I'm not strictly against removing it, I'm just saying supporting it is probably not that of a bad idea.
continue | ||
} | ||
numOfMatchingRecords++ | ||
if ip.To4() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the dns.QuestionType here to determine if the user wants a A, AAAA or a SRV record and return the appropriate list.
} | ||
return record | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating SRV records is a no brainer here. Its a simple function, where you populate the fields with the proto, IP, port and weight/priorities information. The last two can be obtained from the controller rules.
After discussing with @tomerGolany, here are the proposed changes to support SRV records:
|
@zcahana "Priority/weight will be uniformly assigned to 1 (at least until we figure if/how to support routing rules)." The sidecar already has a local copy of the rules. If there are gRPC clients that are using this sidecar for resolving their DNS SRV requests, they can take advantage of the weights and priorities for version-based routing. Also, FYI: a simple DNS server in nginx: https://gist.github.com/agentzh/6c50d37510daef792ed220fa0d970393 |
filters := client.InstanceFilter{ServiceName: serviceName, Tags: tags} | ||
ServiceInstances, err = s.config.DiscoveryClient.ListInstances(filters) | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code can/should be removed..
@@ -140,37 +143,140 @@ func (s *Server) handleQuestion(question dns.Question, request, response *dns.Ms | |||
} | |||
|
|||
func (s *Server) retrieveServices(question dns.Question, request, response *dns.Msg) ([]*client.ServiceInstance, error) { | |||
var ServiceInstances []*client.ServiceInstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by convention, local variable names start with a lowercase letter (--> serviceInstances
).
return ServiceInstances, err | ||
} | ||
|
||
func (s *Server) retrieveServicesFromSRVQuery(serviceName string, tagOrProtocol string) ([]*client.ServiceInstance, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no real need for both retrieveServicesFromSRVQuery()
and retrieveServicesFromRegularQuery()
, which mostly duplicate each other. Here's one implementation that does both:
func (s *Server) retrieveInstancesForServiceQuery(serviceName string, tagOrProtocol ...string) ([]*client.ServiceInstance, error) {
protocol := ""
tags := make([]string, 0, len(tagOrProtocol))
// Split tags and protocol filters
for _, tag := range tagsOrProtocol {
switch tag {
case "tcp", "udp", "http", "https":
if protocol != "" {
return nil, fmt.Errorf("invalid DNS query: more than one protocol specified")
}
protocol = tag
default:
tags = append(tags, tag)
}
}
filters := client.InstanceFilter{ServiceName: serviceName, Tags: tags}
// Dispatch query to registry
serviceInstances, err := s.config.DiscoveryClient.ListInstances(filters)
if err != nil {
return nil, err
}
// Apply protocol filter
if protocol != "" {
k := 0
for i, serviceInstance := range serviceInstances {
if serviceInstance.Endpoint.Type == protocol {
serviceInstances[k] = serviceInstance
k++
}
}
serviceInstances = serviceInstances[:k]
}
return serviceInstances, nil
}
|
||
} | ||
|
||
func (s *Server) retrieveServicesFromInstanceQuery(instanceID string) ([]*client.ServiceInstance, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aligned with the function name in prev comment - retrieveInstancesForInstanceQuery()
} | ||
|
||
func (s *Server) retrieveServicesFromInstanceQuery(instanceID string) ([]*client.ServiceInstance, error) { | ||
ServiceInstances, err := s.config.DiscoveryClient.ListInstances(client.InstanceFilter{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto regarding lowercase variable name.
fullDomainRequestArray := dns.SplitDomainName(question.Name) | ||
domainName := fullDomainRequestArray[len(fullDomainRequestArray)-1] | ||
instanceID := serviceInstance.ID | ||
targetName := instanceID + ".instance." + domainName + "." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More efficient (less string allocations/copying):
fmt.Sprintf("%s.instance.%s.", instanceID, domainName)
func validateEndPointTypeHTTP(value string) (net.IP,string, error) { | ||
startsWithHTTP := strings.HasPrefix(value, "http://") | ||
startsWithHTTPS := strings.HasPrefix(value, "https://") | ||
if !startsWithHTTPS && !startsWithHTTP { | ||
value += "http://" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appends "http://" to the end of value, e.g. 172.107.0.10:8080http://
. Not what you want :)
func validateEndPointTypeTCPAndUDP(value string) (net.IP, error) { | ||
ip, _, err := net.SplitHostPort(value) | ||
func validateEndPointTypeTCPAndUDP(value string) (net.IP,string, error) { | ||
ip, port, err := net.SplitHostPort(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
net.SplitHostPort()
doesn't work when no port is specified (e.g. 172.17.0.10
): https://play.golang.org/p/kLnRJxrQAi
I think you'll have to parse it on your own.
} | ||
host := parsedURL.Host | ||
ip, _, err := net.SplitHostPort(host) | ||
ip, port , err := net.SplitHostPort(host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For http/https endpoint type, the port is optional. If not specified, it should default to 80 for http, and 443 for https.
|
||
} | ||
|
||
func (suite *TestSuite) TestRequestsSRVWuthTag() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo 😄
response.SetRcode(request, dns.RcodeNameError) | ||
return nil, fmt.Errorf("service name wasn't included in domain %s", question.Name) | ||
} | ||
if fullDomainRequestArray[numberOfLabels-2] == "service" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this hard coded subdomain?
Perhaps an easier way would be to let people specify their local search domains in the config file and we serve srv or A records for those local domains while forwarding rest to other DNS servers.
var err error | ||
// parse query : | ||
// Query format: | ||
// [tag or endpoint type]*.<service>.service.<domain>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the DNS domain specified in the config? This would let people use custom DNS domains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean why are you using the hard coded "service" value in the DNS name instead of reusing the one provided by user in config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rshriram The top-level domain is configurable by the client (defaults to amalgam8
).
The hardcoded 'service'/'instance' labels are second-level domains, which are used to distinguish service records vs. instance records (used with SRV records).
For example, reviews.service.amalgam8
vs. 98q7wfah4.instance.amalgam8
More details here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a better way rather than asking the user to explicitly specify whether they are querying a service or an instance ? I feel that most of the time, the query would be for a service and the instance specific query would be relatively rarer. Maybe the service can be made as default if not specified. And if the instance word is present, then we resolve a specific instance.
By the way, reserved words should be prefixed with an _ to avoid collision with naturally occurring DNS names.
SRV records - if we are not supporting weights are not being set, they should be set to 0, as per the RFC.
A records - how are we load balancing the results from the DNS? Are we picking one randomly or using round robin? Can you point me to the function in code where this is being done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Omitting the
service
keyword will create ambiguous queries. Doesxyz.instance.amalgam8
queries for an instance with IDxyz
, or rather for instances of the service namedinstance
with the tagxyz
? - Not sure what you mean here. Our DNS server is authoritative over its top-level domain (e.g.,
amalgam8
), and can defineservice.amalgam8
as a subdomain under which it resolves names. There's no other naturally occurring names under this domain, other than these which we choose to assign records to. - Agree, will change to 0 (and also for priority).
- We don't load balance, we return all records matching a query. The client is responsible for load balancing however it wants (e.g., randomly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) it will be the latter. I am trying to avoid strict requirements (.service., and .instance.) for using the DNS in our sidecar, as it would require changes to the application code (when someone uses just DNS based discovery, without nginx, for their existing application).
For a clearer example, see this https://github.com/skynetservices/skydns (under Service Discovery using DNS).
querying 1.rails.staging.us.east.skydns.local returns that instance's
IP address, while querying rails.staging.us.east.skydns.local returns a set of records (1.rails, 2.rails, 3.rails..).
<tags|instanceIDs|protocol>*.<serviceName>.<domainName>.
for A/AAAA (no SRV)
<_serviceName>.<_tag|instanceIDs|protocol>.<domainName>.
for SRV (as per the RFC)
where user can define the domainName.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we should atleast randomize the list of A records that we return. I am not sure what is the typical behavior of most DNS clients that query A records. If they always pick the first record, then all load will end up on the first host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's go with your proposed query syntax.
To resolve the ambiguity between tag/protocol/instanceID, we'll first try to match the label as an instance ID, afterwards as the protocol (endpoint type) and lastly as a tag.
And +1 for randomizing the returned records.
instanceID := serviceInstance.ID | ||
targetName := fmt.Sprintf("%s.instance.%s.", instanceID, domainName) | ||
targetName := fmt.Sprintf("%s.instance.%s", instanceID, domainName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the '.' from the suffix? I don't think domainName
ends with a dot.
if parsedIP == nil { | ||
return nil, port, fmt.Errorf("url ip value %s is not a valid ip format", ip) | ||
if port == "0" && startsWithHTTPS { | ||
port = "430" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
433...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's port 443. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops 😳
return nil | ||
} | ||
|
||
func (m *registry) copyInstances(instances []*client.ServiceInstance) []client.ServiceInstance { | ||
arrayToReturn := make([]client.ServiceInstance, 0, len(instances)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug here: arrayToReturn
is allocated with length 0, so accessing arrayToReturn[i]
panics (this is what has been causing the integration test to fail - all proxy sidecars were crashing...).
Use instead:
arrayToReturn := make([]client.ServiceInstance, len(instances))
for i, instance := range instances {
arrayToReturn[i] = *instance
}
return arrayToReturn
m.lock.Lock() | ||
m.hashed = latestCatalog | ||
m.lock.Unlock() | ||
instances, _ := m.registryClient.ListInstances(client.InstanceFilter{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid querying the remote registry again in this case. We already have the list of instances acquired from m.getLatestCatalog()
. We can have m.getLatestCatalog()
return a list instead of a map, have another function to construct a map out of this list, and use the list for notifying the listeners.
for _, listener := range m.listeners { | ||
if err = listener.CatalogChange(latestCatalog); err != nil { | ||
m.lock.RLock() | ||
copyOfListeners := m.listeners |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can copy the list of listeners under the w-lock above. It's quicker and simpler than re-acquiring the r-lock.
I've added a bunch of commits addressing all recent review comments (+ some additional refactoring and bug fixes). Also opened amalgam8/amalgam8.github.io#89 for documenting the DNS function. @rshriram I'm approving my own changes, please approve your review as well so we can merge. |
Added DNS server implementation for sidecar , include testing and configurations.