Monitoring Kubernetes on Mesosphere DC/OS with Splunk Enterprise and Splunk Cloud
April 4, 2019If you are using Kubernetes on Mesosphere DC/OS you can find that our default configuration does not provide all the metrics and the information out of the box. In this blog post we will guide you though all the configuration changes to get all the information you need to monitor the health of your clusters and performance of your applications.
We used Quickstart guide for Kubernetes on DC/OS on AWS as an example.
Fix for the cgroup
filesystem
If you run the troubleshooting command verify
on
one of the collectorforkubernetes Pods you can find that it fails to find the cgroups for the Pods and Containers.
Kubernetes configuration: + api: OK x pod cgroup: FAILED pods = 0 (with cgroup filter = ^/([^/\s]+/)*kubepods(\.slice)?/((kubepods-)?(burstable|besteffort)(\.slice)?/)?([^/]*)pod([0-9a-f]{32}|[0-9a-f\-_]{36})(\.slice)?$) x container cgroup: FAILED containers = 0 (with cgroup filter = ^/([^/\s]+/)*kubepods(\.slice)?/((kubepods-)?(burstable|besteffort)(\.slice)?/)?([^/]*)pod([0-9a-f]{32}|[0-9a-f\-_]{36})(\.slice)?/(docker-|crio-)?[0-9a-f]{64}(\.scope)?(\/.+)?$) + volumes root: OK + runtime: OK docker
This is because we mount the cgroup
filesystem under /rootfs/sys/fs/cgroup
and if you look at the different
types of the cgroups
/rootfs/sys/fs/cgroup# ls -alh total 0 drwxr-xr-x. 13 root root 340 Apr 3 22:48 . drwxr-xr-x. 7 root root 0 Apr 3 22:48 .. drwxr-xr-x. 3 root root 0 Apr 3 22:48 blkio lrwxrwxrwx. 1 root root 26 Apr 3 22:48 cpu -> /sys/fs/cgroup/cpu,cpuacct drwxr-xr-x. 3 root root 0 Apr 3 22:48 cpu,cpuacct lrwxrwxrwx. 1 root root 26 Apr 3 22:48 cpuacct -> /sys/fs/cgroup/cpu,cpuacct
you'd realize that these links are broken. Cgroup cpu
points to the /sys/fs/cgroup/cpu,cpuacct
, when it should
point to the /roofs/sys/fs/cgroup/cpu,cpuacct
(or better ./cpu,cpuacct
). To fix that you can mount cgroups inside of the
container in our configuration differently. In both DaemonSets collectorforkubernetes
and collectorforkubernetes-master
change the volumeMounts from
- name: cgroup mountPath: /rootfs/sys/fs/cgroup readOnly: true
To
- name: cgroup-cpu mountPath: /rootfs/sys/fs/cgroup/cpu readOnly: true - name: cgroup-cpu mountPath: /rootfs/sys/fs/cgroup/cpuacct readOnly: true - name: cgroup-blkio mountPath: /rootfs/sys/fs/cgroup/blkio readOnly: true - name: cgroup-memory mountPath: /rootfs/sys/fs/cgroup/memory readOnly: true
And change the volumes from
- name: cgroup hostPath: path: /sys/fs/cgroup
To
- name: cgroup-cpu hostPath: path: /sys/fs/cgroup/cpu,cpuacct - name: cgroup-blkio hostPath: path: /sys/fs/cgroup/blkio - name: cgroup-memory hostPath: path: /sys/fs/cgroup/memory
After applying the change you can run the verify
command again and should see that it fixed the problem
Kubernetes configuration: + api: OK + pod cgroup: OK pods = 7 + container cgroup: OK containers = 16 + volumes root: OK + runtime: OK docker
Pods from DaemonSets collectorforkubernetes-master
fail to start
If you see that Pods from the DaemonSet collectorforkubernetes-master
fail to start with CrashLoopBackOff
look at
the events for this Pod with
kubectl describe pod --namespace collectorforkubernetes collectorforkubernetes-master-wbv62
If you will find something similar to
Events: Warning Failed 2m33s (x4 over 3m20s) kubelet, kube-control-plane-0-instance.devkubernetes01.mesos Error: failed to start container "collectorforkubernetes": Error response from daemon: OCI runtime create failed: container_linux.go:337: starting container process caused "process_linux.go:403: container init caused \"process_linux.go:368: setting cgroup config for procHooks process caused \\\"failed to write 200000 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2889c500-5665-11e9-a692-a6728d2eb688/collectorforkubernetes/cpu.cfs_quota_us: invalid argument\\\"\"": unknown
That means that the parent cgroup has a lower limit for the CPU, change the limits for the collectorforkubernetes-master
DaemonSet to 1000m
or 500m
. In our case we see that the parent cgroup for the master pods has a cpu.cfs_quota_us
equal to 160000
(1600m
)
cat 7d57d61b-4c0f-4133-b658-bdfa902f67b2/cpu.cfs_quota_us
160000
After lowering the CPU, apply the configuration and you should see now that the Pods from the collectorforkubernetes-master
are scheduled on the master nodes.
CoreDNS metrics
If you want to collect coredns
metrics, just run the command to attach the annotation to tell Collectord to start
forwarding metrics from coredns pods to Splunk
kubectl annotate deployment/coredns --namespace kube-system 'collectord.io/prometheus.1-path=/metrics' 'collectord.io/prometheus.1-port=9153' 'collectord.io/prometheus.1-source=coredns' --overwrite
etcd metrics
To be able to monitor etcd cluster with our application Monitoring Kubernetes for Splunk Enterprise and Splunk Cloud
you need to retrieve etcd certificates from the Kubernetes API pod, and modify the configuration of the collectorforkubernetes.yaml
.
To retrieve the certificates from the Kubernetes API, just find the name of one of the Pods with kube-apiserver. And copy
3 files ca-crt.pem
, kube-apiserver-crt.pem
and kube-apiserver-key.pem
kubectl cp --namespace kube-system kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos:/data/ca-crt.pem . kubectl cp --namespace kube-system kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos:/data/kube-apiserver-crt.pem . kubectl cp --namespace kube-system kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos:/data/kube-apiserver-key.pem .
Create a secret etcd-cert
in the collectorforkubernetes
namespace from just retrieved files
kubectl create secret generic --namespace collectorforkubernetes etcd-cert --from-file=./ca-crt.pem --from-file=./kube-apiserver-crt.pem --from-file=./kube-apiserver-key.pem
Now you need to modify the collectorforkubernetes.yaml
configuration. At first find the stanza [input.prometheus::etcd]
and disable it with disabled=true
. We use this configuration when etcd is deployed on the master nodes.
In the ConfigMap
, file 004-addon.conf
add the following configuration for each etcd cluster member
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | [input.prometheus::etcd-0] disabled = false type = kubernetes_prometheus index = host = etcd-0-peer.devkubernetes01 source = etcd interval = 60s endpoint.https = https://etcd-0-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379/metrics tokenPath = certPath = /etcd-cert/ca-crt.pem clientCertPath = /etcd-cert/kube-apiserver-crt.pem clientKeyPath = /etcd-cert/kube-apiserver-key.pem insecure = false includeHelp = false output = [input.prometheus::etcd-1] disabled = false type = kubernetes_prometheus index = host = etcd-0-peer.devkubernetes01 source = etcd interval = 60s endpoint.https = https://etcd-1-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379/metrics tokenPath = certPath = /etcd-cert/ca-crt.pem clientCertPath = /etcd-cert/kube-apiserver-crt.pem clientKeyPath = /etcd-cert/kube-apiserver-key.pem insecure = false includeHelp = false output = [input.prometheus::etcd-2] disabled = false type = kubernetes_prometheus index = host = etcd-0-peer.devkubernetes01 source = etcd interval = 60s endpoint.https = https://etcd-2-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379/metrics tokenPath = certPath = /etcd-cert/ca-crt.pem clientCertPath = /etcd-cert/kube-apiserver-crt.pem clientKeyPath = /etcd-cert/kube-apiserver-key.pem insecure = false includeHelp = false output = |
You can find the URLs of the etcd
members in the configuration for the kube-apiserver
kubectl describe --namespace kube-system pod kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos | grep etcd-servers --etcd-servers=https://etcd-0-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379
And the last step, mount the etcd-cert
secret to the collectorforkubernetes-addon
Deployment in the collectorforkubernetes.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | apiVersion: apps/v1 kind: Deployment metadata: name: collectorforkubernetes-addon ... spec: ... template: ... spec: ... containers: - name: collectorforkubernetes ... volumeMounts: ... - name: etcd-cert mountPath: /etcd-cert/ readOnly: true volumes: ... - name: etcd-cert secret: secretName: etcd-cert |
Now you have all the features of the Monitoring Kubernetes application, that will help you to monitor the health of the Kubernetes cluster and performance of your applications running on the Kubernetes clusters, deployed with Mesosphere DC/OS.