Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to add DNS server implementation for sidecar #328

Merged
merged 27 commits into from
Nov 2, 2016

Conversation

tomerGolany
Copy link
Contributor

Added DNS server implementation for sidecar , include testing and configurations.

@rshriram
Copy link
Member

rshriram commented Oct 9, 2016

Thank you for this PR and this valuable feature addition. I would like to make sure that we reuse the local registry cache instead of making a remote API call for every query.

@zvi not related to this PR, but we need to think about how SRV records figure in this scenario. It would mostly require merging rules that contain backend weights and priorities.

At some point I would like to convert our examples to use DNS as well.

return nil
}

// Shutdown stpos the DNS server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. Stops


switch endPointType {
case "tcp":
ip, err = validateEndPointTypeTCPAndUDP(serviceInstance.Endpoint.Value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two can become one switch statement.

case "udp":
ip, err = validateEndPointTypeTCPAndUDP(serviceInstance.Endpoint.Value)

case "http":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need https as well

}

func validateEndPointTypeHTTP(value string) (net.IP, error) {
parsedURL, err := url.Parse(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YOu may not even have URLs in value. That case has to be handled. Today, that is the default case as well.

}
return record
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DO we have support for SRV records?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as of yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating SRV records is a no brainer here. Its a simple function, where you populate the fields with the proto, IP, port and weight/priorities information. The last two can be obtained from the controller rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple support (without weight/priorities) for SRV should be pretty straight-forward. @tomerGolany can you please look into adding support for that?

ServiceInstances, err = s.config.DiscoveryClient.ListServiceInstances(serviceName)
} else {
filters := client.InstanceFilter{ServiceName: serviceName, Tags: tags}
ServiceInstances, err = s.config.DiscoveryClient.ListInstances(filters)
Copy link
Member

@rshriram rshriram Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcahana @GregHanson is this going to remote registry or is it going to the local cache in sidecar for registry and controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code, as is wired today, is calling the remote registry.
@tomerGolany has started working on another change-set so that this will be served from the local cache. Specifically, he is changing the RegistryMonitor so that it'll also implement the github.com/amalgam8/amalgam8/registry/client.Discovery interface, and to inject that into the dns server, instead of the existing HTTP-based discovery client.

@rshriram
Copy link
Member

I would suggest moving the DNS resolution capability to the registry instead of the sidecar. When a service is using DNS-based service discovery (e.g., Consul, SkyDNS), then it makes sense to move the DNS server to the control plane, instead of having a DNS server for every sidecar. Is there a use case that I am missing here?

Copy link
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my new comments, esp regarding tags, choice of instances, etc.

var ServiceInstances []*client.ServiceInstance
var err error
if len(tags) == 0 {
ServiceInstances, err = s.config.DiscoveryClient.ListServiceInstances(serviceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tags are not provided, as will often be the case, this should actually look up the service in the rules to see if there are specific rules for that service. The rules provide the default set of tags to use for the service, and if there are no tags, they provide weights for each set of tags. This would enable the weighted load balancing. And it covers use cases such as route to instances in the same data center. Arbitrarily choosing an instance will end up in a very suboptimal setup.

In the current implementation, the DNS based lookup is essentially bypassing everything that we did for tag/rule based routing, weights, actions and such. So, lets please use the controller rules to add some smarts to the DNS responses.

It also brings into focus a bigger question: you will now be effectively duplicating what the nginx proxy is doing by doing version-based routing via DNS.

Finally, how do you propose to route all DNS calls to the sidecar? Replacing the nameservers in /etc/resolv.conf ? or some other trick? I don't see that in the patch set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating the rules processing logic is indeed a big question. There's a non-trivial amount of work/code to be written to support that, and of course a maintenance cost. I would say this is feasible only if DNS support gains momentum and clients.

Alternatively, if we ever to replace nginx with our own Golang-based proxy (e.g., by integrating/extending Træfɪk, Caddy, or Skipper), we can easily share the same rules processing logic with the DNS and proxy'ing components. There are also other benefits in having a Go-based proxy, but that's outside the scope of this PR.

case dns.TypeAAAA:
case dns.TypeANY:
default:
response.SetRcode(request, dns.RcodeServerFailure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in Golang, you don't need to have expressions below a case statement and that there is no automatic fall through. But that said, for the sake of readability, it would be much nicer if you hoisted the s.retrieveServices into the switch case.

Secondly, given that the code below is only supporting TYPEA and TYPEAAA, you should not be having a TYPE_ANY here. ANY is a very bad idea as well. Folks like CloudFlare have deprecated it.

Finally, given that you are retrieving multiple service instances, supporting DNS_SRV seems like a no brainer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch-case style is a conventional pattern in Golang. You'll find similar usages in other parts of our code, as well as stdlib, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the TYPE_ANY - reading this, CloudFlare have deprecated it mostly due to the complexity involved in supporting it. I don't think this complexity is relevant to our tiny, single-client DNS server. Either way, I'm not strictly against removing it, I'm just saying supporting it is probably not that of a bad idea.

continue
}
numOfMatchingRecords++
if ip.To4() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the dns.QuestionType here to determine if the user wants a A, AAAA or a SRV record and return the appropriate list.

}
return record
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating SRV records is a no brainer here. Its a simple function, where you populate the fields with the proto, IP, port and weight/priorities information. The last two can be obtained from the controller rules.

@zcahana
Copy link
Member

zcahana commented Oct 20, 2016

After discussing with @tomerGolany, here are the proposed changes to support SRV records:

  • The query syntax will be modified from [tag]*.<service_name>.<domain>. to one of the following:

    1. [tag or protocol]*.<service_name>.service.<domain>. (A/AAAA/SRV service lookup)
    2. _<service_name>._<tag or protocol>.service.<domain>. (SRV service lookup per RFC 2782)
    3. <instance_id>.instance.<domain>. (A record lookup for a specific instance).
  • The answer to an SRV query (either via query syntax 1 or 2 above) will be an one or more SRV records, each with the target field set to <instance_id>.instance.<domain>. (syntax 3 above). The answer will also include (in the "additional" section) the A records corresponding to these targets. For example:

    QUESTION SECTION:
    _myservice._http.service.consul.     IN  SRV
    
    ANSWER SECTION:
    _myservice._http.service.consul.  0   IN  SRV 1 1 9080 98q7wfah4.instance.amalgam8.
    
    ADDITIONAL SECTION:
    98q7wfah4.instance.amalgam8. 0   IN  A   172.10.0.18
    
  • The tag part of the query has been expanded to support both regular tags, as well as the protocol (endpoint type). For "tcp", "udp", "http" and "https", the tag is interpreted as protocol, whereas all others are tags.

  • Priority/weight will be uniformly assigned to 1 (at least until we figure if/how to support routing rules).

@rshriram
Copy link
Member

@zcahana "Priority/weight will be uniformly assigned to 1 (at least until we figure if/how to support routing rules)."
-- Why?

The sidecar already has a local copy of the rules. If there are gRPC clients that are using this sidecar for resolving their DNS SRV requests, they can take advantage of the weights and priorities for version-based routing.

Also, FYI: a simple DNS server in nginx: https://gist.github.com/agentzh/6c50d37510daef792ed220fa0d970393

@zcahana
Copy link
Member

zcahana commented Oct 23, 2016

@rshriram see here.

filters := client.InstanceFilter{ServiceName: serviceName, Tags: tags}
ServiceInstances, err = s.config.DiscoveryClient.ListInstances(filters)

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code can/should be removed..

@@ -140,37 +143,140 @@ func (s *Server) handleQuestion(question dns.Question, request, response *dns.Ms
}

func (s *Server) retrieveServices(question dns.Question, request, response *dns.Msg) ([]*client.ServiceInstance, error) {
var ServiceInstances []*client.ServiceInstance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by convention, local variable names start with a lowercase letter (--> serviceInstances).

return ServiceInstances, err
}

func (s *Server) retrieveServicesFromSRVQuery(serviceName string, tagOrProtocol string) ([]*client.ServiceInstance, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real need for both retrieveServicesFromSRVQuery() and retrieveServicesFromRegularQuery(), which mostly duplicate each other. Here's one implementation that does both:

func (s *Server) retrieveInstancesForServiceQuery(serviceName string, tagOrProtocol ...string) ([]*client.ServiceInstance, error) {
    protocol := ""
    tags := make([]string, 0, len(tagOrProtocol))

    // Split tags and protocol filters
    for _, tag := range tagsOrProtocol {
        switch tag {
            case "tcp", "udp", "http", "https":
                if protocol != "" {
                    return nil, fmt.Errorf("invalid DNS query: more than one protocol specified")
                }
                protocol = tag
            default:
                tags = append(tags, tag)
        }
    }
    filters := client.InstanceFilter{ServiceName: serviceName, Tags: tags}

    // Dispatch query to registry
    serviceInstances, err := s.config.DiscoveryClient.ListInstances(filters)
    if err != nil {
        return nil, err
    }

    // Apply protocol filter
    if protocol != "" {
        k := 0
        for i, serviceInstance := range serviceInstances {
            if serviceInstance.Endpoint.Type == protocol {
                serviceInstances[k] = serviceInstance
                k++
            }
        }
        serviceInstances = serviceInstances[:k]
    }

    return serviceInstances, nil
}


}

func (s *Server) retrieveServicesFromInstanceQuery(instanceID string) ([]*client.ServiceInstance, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned with the function name in prev comment - retrieveInstancesForInstanceQuery()

}

func (s *Server) retrieveServicesFromInstanceQuery(instanceID string) ([]*client.ServiceInstance, error) {
ServiceInstances, err := s.config.DiscoveryClient.ListInstances(client.InstanceFilter{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto regarding lowercase variable name.

fullDomainRequestArray := dns.SplitDomainName(question.Name)
domainName := fullDomainRequestArray[len(fullDomainRequestArray)-1]
instanceID := serviceInstance.ID
targetName := instanceID + ".instance." + domainName + "."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More efficient (less string allocations/copying):

fmt.Sprintf("%s.instance.%s.", instanceID, domainName)

func validateEndPointTypeHTTP(value string) (net.IP,string, error) {
startsWithHTTP := strings.HasPrefix(value, "http://")
startsWithHTTPS := strings.HasPrefix(value, "https://")
if !startsWithHTTPS && !startsWithHTTP {
value += "http://"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appends "http://" to the end of value, e.g. 172.107.0.10:8080http://. Not what you want :)

func validateEndPointTypeTCPAndUDP(value string) (net.IP, error) {
ip, _, err := net.SplitHostPort(value)
func validateEndPointTypeTCPAndUDP(value string) (net.IP,string, error) {
ip, port, err := net.SplitHostPort(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net.SplitHostPort() doesn't work when no port is specified (e.g. 172.17.0.10): https://play.golang.org/p/kLnRJxrQAi
I think you'll have to parse it on your own.

}
host := parsedURL.Host
ip, _, err := net.SplitHostPort(host)
ip, port , err := net.SplitHostPort(host)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For http/https endpoint type, the port is optional. If not specified, it should default to 80 for http, and 443 for https.


}

func (suite *TestSuite) TestRequestsSRVWuthTag() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo 😄

response.SetRcode(request, dns.RcodeNameError)
return nil, fmt.Errorf("service name wasn't included in domain %s", question.Name)
}
if fullDomainRequestArray[numberOfLabels-2] == "service" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this hard coded subdomain?
Perhaps an easier way would be to let people specify their local search domains in the config file and we serve srv or A records for those local domains while forwarding rest to other DNS servers.

var err error
// parse query :
// Query format:
// [tag or endpoint type]*.<service>.service.<domain>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the DNS domain specified in the config? This would let people use custom DNS domains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean why are you using the hard coded "service" value in the DNS name instead of reusing the one provided by user in config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rshriram The top-level domain is configurable by the client (defaults to amalgam8).
The hardcoded 'service'/'instance' labels are second-level domains, which are used to distinguish service records vs. instance records (used with SRV records).
For example, reviews.service.amalgam8 vs. 98q7wfah4.instance.amalgam8

More details here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a better way rather than asking the user to explicitly specify whether they are querying a service or an instance ? I feel that most of the time, the query would be for a service and the instance specific query would be relatively rarer. Maybe the service can be made as default if not specified. And if the instance word is present, then we resolve a specific instance.

By the way, reserved words should be prefixed with an _ to avoid collision with naturally occurring DNS names.

SRV records - if we are not supporting weights are not being set, they should be set to 0, as per the RFC.

A records - how are we load balancing the results from the DNS? Are we picking one randomly or using round robin? Can you point me to the function in code where this is being done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Omitting the service keyword will create ambiguous queries. Does xyz.instance.amalgam8 queries for an instance with ID xyz, or rather for instances of the service named instance with the tag xyz?
  2. Not sure what you mean here. Our DNS server is authoritative over its top-level domain (e.g., amalgam8), and can define service.amalgam8 as a subdomain under which it resolves names. There's no other naturally occurring names under this domain, other than these which we choose to assign records to.
  3. Agree, will change to 0 (and also for priority).
  4. We don't load balance, we return all records matching a query. The client is responsible for load balancing however it wants (e.g., randomly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) it will be the latter. I am trying to avoid strict requirements (.service., and .instance.) for using the DNS in our sidecar, as it would require changes to the application code (when someone uses just DNS based discovery, without nginx, for their existing application).
For a clearer example, see this https://github.com/skynetservices/skydns (under Service Discovery using DNS).
querying 1.rails.staging.us.east.skydns.local returns that instance's IP address, while querying rails.staging.us.east.skydns.local returns a set of records (1.rails, 2.rails, 3.rails..).

<tags|instanceIDs|protocol>*.<serviceName>.<domainName>. for A/AAAA (no SRV)
<_serviceName>.<_tag|instanceIDs|protocol>.<domainName>. for SRV (as per the RFC)
where user can define the domainName.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we should atleast randomize the list of A records that we return. I am not sure what is the typical behavior of most DNS clients that query A records. If they always pick the first record, then all load will end up on the first host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's go with your proposed query syntax.
To resolve the ambiguity between tag/protocol/instanceID, we'll first try to match the label as an instance ID, afterwards as the protocol (endpoint type) and lastly as a tag.

And +1 for randomizing the returned records.

instanceID := serviceInstance.ID
targetName := fmt.Sprintf("%s.instance.%s.", instanceID, domainName)
targetName := fmt.Sprintf("%s.instance.%s", instanceID, domainName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the '.' from the suffix? I don't think domainName ends with a dot.

if parsedIP == nil {
return nil, port, fmt.Errorf("url ip value %s is not a valid ip format", ip)
if port == "0" && startsWithHTTPS {
port = "430"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

433...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's port 443. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops 😳

return nil
}

func (m *registry) copyInstances(instances []*client.ServiceInstance) []client.ServiceInstance {
arrayToReturn := make([]client.ServiceInstance, 0, len(instances))
Copy link
Member

@zcahana zcahana Nov 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug here: arrayToReturn is allocated with length 0, so accessing arrayToReturn[i] panics (this is what has been causing the integration test to fail - all proxy sidecars were crashing...).

Use instead:

arrayToReturn := make([]client.ServiceInstance, len(instances))
for i, instance := range instances {
    arrayToReturn[i] = *instance
}
return arrayToReturn

m.lock.Lock()
m.hashed = latestCatalog
m.lock.Unlock()
instances, _ := m.registryClient.ListInstances(client.InstanceFilter{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid querying the remote registry again in this case. We already have the list of instances acquired from m.getLatestCatalog(). We can have m.getLatestCatalog() return a list instead of a map, have another function to construct a map out of this list, and use the list for notifying the listeners.

for _, listener := range m.listeners {
if err = listener.CatalogChange(latestCatalog); err != nil {
m.lock.RLock()
copyOfListeners := m.listeners
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can copy the list of listeners under the w-lock above. It's quicker and simpler than re-acquiring the r-lock.

@zcahana
Copy link
Member

zcahana commented Nov 2, 2016

I've added a bunch of commits addressing all recent review comments (+ some additional refactoring and bug fixes). Also opened amalgam8/amalgam8.github.io#89 for documenting the DNS function.

@rshriram I'm approving my own changes, please approve your review as well so we can merge.

@rshriram rshriram merged commit f821e3b into amalgam8:master Nov 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants