Orphan dns records
Orphan DNS Records¶
This document is focused around multi-cluster DNS where you have more than one instance of a gateway that shares a common hostname with other gateways and assumes you have the observability stack set up.
What is an orphan record?¶
An orphan DNS record is a record or set of records that are owned by an instance of the DNS operator that no longer has a representation of those records on its cluster.
How do orphan records occur?¶
Orphan records can occur when a DNSRecord
resource (a resource that is created in response to a DNSPolicy
) is deleted without allowing the owning controller time to clean up the associated records in the DNS provider. Generally in order for this to happen, you would need to force remove a finalizer
from the DNSRecord
resource, delete the kuadrant-system namespace directly or un-install kuadrant (delete the subscription if using OLM) without first cleaning up existing policies or delete a cluster entirely without first cleaning up the associated DNSPolicies. These are not common scenarios but when they do occur they can leave behind records in your DNS Provider which may point to IPs / Hosts that are no longer valid.
How do you spot an orphan record(s) exist?¶
There is a prometheus based alert that uses some metrics exposed from the DNS components to spot this situation. If you have installed the alerts for Kuadrant under the examples folder, you will see in the alerts tab an alert called PossibleOrphanedDNSRecords
. When this is firing it means there are likely to be orphaned records in your provider.
How do you get rid of an orphan record?¶
To remove an Orphan Record we must first identify the owner that is no longer aware of the record. To do this we need an existing DNSRecord in another cluster.
Example: You have 2 clusters that each have a gateway and share a host apps.example.com
and have setup a DNSPolicy for each gateway. On cluster 1 you remove the kuadrant-system
namespace without first cleaning up existing DNSPolicies targeting the gateway in your ingress-gateway
namespace. Now there are a set of records that were being managed for that gateway that have not been removed.
On cluster 2 the DNS Operator managing the existing DNSRecord in that cluster has a record of all owners of that dns name.
In prometheus alerts, it spots that the number of owners does not correlate to the number of DNSRecord resources and triggers an alert.
To remedy this rather than going to the DNS provider directly and trying to figure out which records to remove, you can instead follow the steps below.
Get the owner id of the DNSRecord on cluster 2 for the shared host
Get all the owner ids
kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}'
# output
# ["26aacm1z","49qn0wp7"]
Create a placeholder DNSRecord with none active ownerID
For each owner id returned that isn't the owner id of the record that we want to remove records for, we need to create a dnsrecord resource and delete it. This will trigger the running operator in this cluster to clean up those records.
This is one of the owner id not in the existing dnsrecord on cluster
export ownerID=26aacm1z
export rootHost=$(kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.spec.rootHost}')
Export a namespace with the aws credentials in it
export targetNS=kuadrant-system
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1alpha1
kind: DNSRecord
metadata:
name: delete-old-loadbalanced-dnsrecord
namespace: ${targetNS}
spec:
providerRef:
name: my-aws-credentials
ownerID: ${ownerID}
rootHost: ${rootHost}
endpoints:
- dnsName: ${rootHost}
recordTTL: 60
recordType: CNAME
targets:
- klb.doesnt-exist.${rootHost}
EOF
Delete the DNSrecord
Verification
We can verify that the steps worked correctly, by checking the DNSRecord again. Note it may take a several minutes for the other record to update. We can force it by adding a label to the record
kubectl label dnsrecord.kuadrant.io somerecord test=test -n ${targetNS}
kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}'
You should also see your alert eventually stop triggering.