Bobcares

How to troubleshoot DNS failures with Amazon EKS

by | Sep 4, 2021

Looking for how to troubleshoot DNS failures with Amazon EKS? We can help you with this!

As a part of our AWS Support Services, we often receive similar requests from our AWS customers.

Today, let’s see the steps followed by our Support Techs to help our customers to troubleshoot DNS failures with Amazon EKS.

 

DNS failures with Amazon EKS

 
For querying internal and external DNS records, pods running inside the EKS cluster use the CoreDNS service’s cluster IP as the default name server.

So if there is any issue with CoreDNS pods, it may result in the application DNS resolution failure.

For troubleshooting issues with CoreDNS pods, we must verify that all the components of the kube-dns service are properly working. Because the CoreDNS pods are abstracted by kube-dns.

Now let’s see the resolution applies to the CoreDNS ClusterIP 10.100.0.10.

  1. Firstly, we need to run the following command to find the ClusterIP of the CoreDNS service:
kubectl get service kube-dns -n kube-system

2. Then run the following command to verify that DNS endpoints pointing to CoreDNS pods.

kubectl -n kube-system get endpoints kube-dns

For example, the above command results the following:

NAME ENDPOINTS AGE
kube-dns 192.166.4.219:53,192.166.3.117:53,192.168.2.228:53 + 1 more... 100d

3. Also we should verify that while communicating with CoreDNS, the pods aren’t blocked by a security group or network ACL.
 

Check and verify that kube-proxy pod is working

 
We need to check the logs for timeout errors or 403 unauthorized errors for verifying the kube-proxy pod has access to API servers for EKS cluster.

Run the follwing command to view the kube-proxy logs:

kubectl logs -n kube-system --selector 'k8s-app=kube-proxy'

 

To troubleshoot the DNS issue, connect to the application pod

 

  1. Firstly, run the following command to access a shell inside the running pod:
$ kubectl exec -it your-pod-name -- sh

If we get an error similar to the following, then the application pod might not have a shell binary available.

OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown
command terminated with exit code 126

2. Now check the resolv.conf file to verify that the cluster IP of the kube-dns service is in pod’s /etc/resolv.conf

cat /etc/resolv.conf

For example, resolv.conf shows a pod that’s configured to point at 10.100.0.10 for DNS requests.

nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

Here the IP should match the ClusterIP of your kube-dns service.

3. Now using nslookup command, we can verify that the pod can resolve an internal domain using the default clusterIP.

nslookup kubernetes 10.100.0.10

The result of nslookup command:

Server: 10.100.0.10
Address: 10.100.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.100.0.1

4. By using nslookup command, we can verify that the pod can resolve an external domain using the default clusterIP.

nslookup amazon.com 10.100.0.10

The result of nslookup command:

Server: 10.100.0.10
Address: 10.100.0.10#53
Non-authoritative answer:
Name: amazon.com
Address: 176.32.98.167
Name: amazon.com
Address: 205.251.243.104
Name: amazon.com
Address: 176.32.103.204

5. Also by nslookup command, we can verify that the pod can resolve using the IP address of the CoreDNS pod directly.

nslookup kubernetes COREDNS_POD_IP

nslookup amazon.com COREDNS_POD_IP

COREDNS_POD_IP: Replace this with endpoint IP.
 

Logs from coreDNS pods

 
Let’s see the steps to get detailed logs from coreDNS pods for debugging:

  1. Using the following command we can enable the debug log of CoreDNS pods and adding the log plugin to the CoreDNS ConfigMap.
kubectl -n kube-system edit configmap coredns

2. We need to add the log string in the editing window that shows in the output.

Also note that reloading the configuration, this can take several minutes for CoreDNS . To apply the changes immediately, we can restart the pods one by one.

kind: ConfigMap
apiVersion: v1
data:
Corefile: |
.:53 {
log # Enabling CoreDNS Logging
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
...
...

3. Now run the below command to check if the CoreDNS logs are failing or getting any hits from the application pod.

kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'

 

Search and ndots combination

 
DNS uses nameserver for name resolutions usually the cluster IP of a kube-dns service. To complete a query name to a fully qualified domain name, DNS uses search. The ndots value is the number of dots that must appear in a name to resolve a query before an initial query is made.

For example, you can set the ndots option to the default value 5 in a domain name that’s not fully qualified. Then, all external domains that don’t fall under the internal domain cluster.local are appended to the search domains before querying.

Example with the /etc/resolv.conf setting of the application pod:

nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

The CoreDNS looks for five dots in the domain being queried.

The logs look similar to the following if the pod makes a DNS resolution call for amazon.com:

[INFO] 192.168.3.72:33238 - 36534 "A IN amazon.com.default.svc.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000473434s
[INFO] 192.168.3.72:57098 - 43241 "A IN amazon.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000066171s
[INFO] 192.168.3.72:51937 - 15588 "A IN amazon.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000137489s
[INFO] 192.168.3.72:52618 - 14916 "A IN amazon.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 41 0.001248388s
[INFO] 192.168.3.72:51298 - 65181 "A IN amazon.com. udp 28 false 512" NOERROR qr,rd,ra 106 0.001711104s

NXDOMAIN indicates that the domain record wasn’t found.

NOERROR indicates that the domain record was found.

Here before making the final call on the absolute domain at the end, for each search domain is prepended with amazon.com. The final domain name is appended with a dot.  At the end, this makes it a fully qualified domain name. This means that for every external domain name query there could be four to five additional calls.

To fix this issue, we should either change ndots to 1 or append a dot at the end of the domain that’s queried or used.

nslookup example.com.

 

VPC resolver (AmazonProvidedDNS) limits

 
The VPC resolver can accept only a maximum limit of 1024 packets per second per network interface.

If there is more than one CoreDNS pod is on the same worker node, then the chances of hitting this limit are higher.

We need to add the following options to the CoreDNS deployment to use PodAntiAffinity rules to schedule CoreDNS pods on separate instances:

spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- kube-dns
topologyKey: kubernetes.io/hostname

[Need help with more AWS queries? We’d be happy to assist]
 

Conclusion

 
To conclude, today we discussed the steps followed by our Support Engineers to help our customers to troubleshoot DNS failures with Amazon EKS.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.