Troubleshooting
Verify configuration
Available from collectorforopenshift v5.2
Get the list of the pods
$ oc get pods -n collectorforopenshift NAME READY STATUS RESTARTS AGE collectorforopenshift-addon-857fccb8b9-t9qgq 1/1 Running 1 1h collectorforopenshift-master-bwmwr 1/1 Running 0 1h collectorforopenshift-xbnaa 1/1 Running 0 1h
Considering that we have 3 different deployment types, the DaemonSet we deploy on Masters (collectorforopenshift-master
),
the DaemonSet we deploy on non-master nodes (collectorforopenshift
) and one Deployment addon (collectorforopenshift-addon
)
verify one node from each deployment (in example below change the pod names to the pods that are running on your cluster).
$ oc exec -n collectorforopenshift collectorforopenshift-addon-857fccb8b9-t9qgq -- /collectord verify $ oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord verify $ oc exec -n collectorforopenshift collectorforopenshift-xbnaa -- /collectord verify
For each command you will see an output similar to
Version = 5.2.176 Build date = 181012 Environment = openshift General: + conf: OK + db: OK + db-meta: OK + instanceID: OK instanceID = 2LEKCFD4KT4MUBIAQSUG7GRSAG + license load: OK trial + license expiration: OK license expires 2018-11-12 15:51:18.200772266 -0500 EST + license connection: OK Splunk output: + OPTIONS(url=https://10.0.2.2:8088/services/collector/event/1.0): OK + POST(url=https://10.0.2.2:8088/services/collector/event/1.0, index=): OK Kubernetes configuration: + api: OK + pod cgroup: OK pods = 18 + container cgroup: OK containers = 39 x volumes root: FAILED failed to find any volumes under /rootfs/var/lib/minishift/openshift.local.volumes/ + runtime: OK docker Docker configuration: + connect: OK containers = 43 + path: OK + cgroup: OK containers = 40 + files: OK CRI-O configuration: - ignored: OK kubernetes uses other container runtime File Inputs: x input(syslog): FAILED no matches + input(logs): OK path /rootfs/var/log/ x input(audit-logs): FAILED cannot access /rootfs/var/lib/origin/openpaas-oscp-audit/ (err = stat /rootfs/var/lib/origin/openpaas-oscp-audit/: no such file or directory) System Input: + path cgroup: OK + path proc: OK Network stats Input: + path proc: OK Network socket table Input: + path proc: OK Proc Input: + path proc: OK Mount Input: + stats: OK Prometheus input: + input(kubernetes-api): OK + input(webconsole): OK x input(etcd): FAILED failed to load metrics from specified endpoints [https://:2379/metrics] x input(controller): FAILED failed to load metrics from specified endpoints [https://127.0.0.1:8444/metrics] + input(kubelet): OK Errors: 5
With the number of the errors at the end. In our example we show output from minishift, where we see some invalid configurations, like
volumes root: FAILED
- this version ofminishift
mounts runtime to/var/lib/minishift/base/openshift.local.volumes
, that is why we cannot find volume roots under/rootfs/var/lib/minishift/openshift.local.volumes/
input(syslog)
- minishift does not persist syslog output to disk, we will not be able to see these logs in applicationinput(audit-logs)
- we have not enabled audit logs, we will not be able to see audit data in applicationinput(etcd)
- etcd is embedded in origin imageinput(controller)
- controller is embedded in origin image
If you find some error in the configuration, like incorrect Splunk URL, after applying the change
kubectl apply -f ./collectorforopenshift.yaml
you will need to recreate pods, for that you can just delete all of them in our namespacekubectl delete pods --all -n collectorforopenshift
. The workloads will recreate them.
Describe command
Available from collectorforopenshift v5.12
When you apply annotations through the namespace, workload, configurations and pods it could be hard to track which annotations are applied to the Pod or Container. You can use a describe command of collectord to get information which annotations are used for the specific Pod. You can use any collectord Pod to run this command on the cluster
oc exec -n collectorforopenshift collectorforopenshift-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgres
Collect diagnostic information
If you need to open a support case you can collect diagnostic information, including performance, metrics and configuration (excluding splunk URL and Token).
Please run all 4 steps to collect diagnostic information.
1. Collect internal diag information from Collectord instance run following command
Available from collectorforopenshift v5.2
Choose pod from which you want to collect a diag
information.
The following command takes several minutes.
oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord diag --stream 1>diag.tar.gz
You can extract a tar archive to verify the information that we collect. We include information about performance, memory usage, basic telemetry metrics, information file with the information of the host Linux version and basic information about the license.
Since 5.20.400 performance information is not collected by default unless you include a flag
--include-performance-profiles
in the command.
2. Collect logs
oc logs -n collectorforopenshift --timestamps collectorforopenshift-master-bwmwr 1>collectorforopenshift.log 2>&1
3. Run verify
Available since collectorforopenshift v5.2
oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord verify > verify.log
4. Prepare tar archive
tar -czvf collectorforopenshift-$(date +%s).tar.gz verify.log collectorforopenshift.log diag.tar.gz
Pod is not getting scheduled
Verify that daemonsets have scheduled pods on the nodes
oc get daemonset --namespace collectorforopenshift
If in the output numbers under DESIRED
, CURRENT
, READY
or UP-TO-DATE
are 0
,
something can be wrong with configuration
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE collectorforopenshift 0 0 0 0 0 <none> 1m collectorforopenshift-master 0 0 0 0 0 <none> 1m
You can run command to describe current state of the daemonset/collectorforopenshift
$ oc describe daemonsets --namespace collectorforopenshift
In the output there are will be two daemonsets. In each you can find in the last lines events reported for this daemonset, for example
... Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 2m 43s 15 daemon-set Warning FailedCreate Error creating: pods "collectorforopenshift-" is forbidden: unable to validate against any security context constraint: [provider anyuid: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider anyuid: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used securityContext.runAsUser: Invalid value: 0: UID on container collectorforopenshift does not match required range. Found 0, required min: 1000000000 max: 1000009999 provider restricted: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider restricted: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]
The error above means that you forgot to add collectorforopenshift
service account to the privileged security context,
run command
$ oc adm policy add-scc-to-user privileged system:serviceaccount:collectorforopenshift:collectorforopenshift
Try to run describe
again in few moments (can take up to few minutes)
$ oc describe daemonsets --namespace collectorforopenshift
In the output you can still see the old event, but you should also see the new event SuccessfulCreate
Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- ... 1m 1m 1 daemon-set Normal SuccessfulCreate Created pod: collectorforopenshift-55t61
Failed to pull the image
When you run command
$ oc get daemonsets --namespace collectorforopenshift
You can find that number under READY
does not match DESIRED
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE default collectorforopenshift 1 1 0 1 0 <none> 6m
Try to find the pods, which OpenShift failed to start
$ oc get pods --namespace collectorforopenshift
If you see that collectorforopenshift-
pod has an error ImagePullBackOff
, as in the example below
NAMESPACE NAME READY STATUS RESTARTS AGE default collectorforopenshift-55t61 0/1 ImagePullBackOff 0 2m
In that case you need to verify that your OpenShift cluster have access to the hub.docker.com
registry or
registry.connect.redhat.com
, depends on which Configuration Reference you use.
You can run command
$ oc describe pods --namespace collectorforopenshift
Which should show you an output for each pod, including events raised for every pod
Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 3m 2m 4 kubelet, localhost spec.containers{collectorforopenshift} Normal Pulling pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:5.23.431" 3m 2m 4 kubelet, localhost spec.containers{collectorforopenshift} Warning Failed Failed to pull image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:5.23.431": rpc error: code = 2 desc = unexpected http code: 500, URL: https://registry.connect.redhat.com/auth/realms/rhc4tp/protocol/docker-v2/auth?scope=repository%3Aoutcoldsolutions%2Fcollectorforopenshift%3Apull&service=docker-registry 3m 1m 6 kubelet, localhost spec.containers{collectorforopenshift} Normal BackOff Back-off pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:5.23.431" 3m 1m 11 kubelet, localhost Warning FailedSync Error syncing pod
Failed to pull image from registry.connect.redhat.com
Images in Red Hat Container Catalog are listed with two repositories
registry.access.redhat.com
andregistry.connect.redhat.com
. Originally all images (including provided by Red Hat and by partners) were in theregistry.access.redhat.com
, but starting from beginning of 2018 partner images are getting moved toregistry.access.redhat.com
. OpenShift Container Platform has very good support forregistry.connect.redhat.com
registry, but currentlyregistry.access.redhat.com
is lucking good documentation and good out of box support of OpenShift Container Platform.
If in the events for the pods you see that OpenShift failed to download image from registry.connect.redhat.com
,
because of the authorization issues you can fallback to image from hub.docker.com
or you can authorize with this
registry and save the secret.
Look on Configuration Reference page how to authenticate
with registry.connect.redhat.com
.
Blocked access to external registries
If you are blocking external registries (hub.docker.com
or registry.connect.redhat.com
) for security reasons,
you can copy image from external registry to your own repository with one host which have access to external registry
Copying image from hub.docker.com to your own registry
$ docker pull outcoldsolutions/collectorforopenshift:5.23.431
After that, you can re-tag it by prefixing with your own registry
docker tag outcoldsolutions/collectorforopenshift:5.23.431 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:5.23.431
And push it to your registry
docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:5.23.431
After that you will need to change your configuration yaml file to specify that you want to use image from different location
image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:5.23.431
If you need to move image between computers you can export it to tar file
$ docker image save outcoldsolutions/collectorforopenshift:5.23.431 > collectorforopenshift.tar
And load it on different docker host
$ cat collectorforopenshift.tar | docker image load
Copying image from registry.connect.redhat.com to your own registry
Login to registry.connect.redhat.com
using docker login
and your Red Hat account
$ docker login registry.connect.redhat.com Username: [redhat-username] Password: [redhat-user-password] Login Succeeded
Make sure to use username and not email, when you login to this registry. They both allows you to login. But if you logged in with email, you will not be able to download the image.
$ docker pull registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:5.23.431
After that you can re-tag it by prefixing with your own registry
docker tag registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:5.23.431 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:5.23.431
And push it to your registry
docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:5.23.431
After that you will need to change your configuration yaml file to specify that you want to use image from different location
image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:5.23.431
If you need to move image between computers you can export it to tar file
$ docker image save registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:5.23.431 > collectorforopenshift.tar
And load it on different docker host
$ cat collectorforopenshift.tar | docker image load
Pod is crashing or running, but you don't see any data
Get the Pod information
First, get information about the Pod (replace pod name with the one that is crashing)
oc get pod -n collectorforopenshift -o yaml collectorforopenshift-master-mshxd
If in the lastState you see something similar to
lastState: terminated: containerID: docker://8e9086aaf65b86d6d070f98ef4c5c59d9c838401a1f40765dd997723144d65db exitCode: 128 finishedAt: "2022-10-16T05:58:13Z" message: path / is mounted on / but it is not a shared or slave mount reason: ContainerCannotRun startedAt: "2022-10-16T05:58:13Z"
You will need to modify how rootfs
is mounted inside the Pod.
In the collectorforopenshift.yaml
file find all mountPropagation: HostToContainer
and comment them out.
The only feature that will not work is the ability for Containerd to auto-discover volumes with the Application Logs.
Please email us at support@outcoldsolutions.com to help us to configure it properly.
Check Collectord logs
Start by looking at the logs of collectord; this is how the normal output looks like
$ oc logs -f collectorforopenshift-gvhgw --namespace collectorforopenshift INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65 You are running trial version of this software. Trial version valid for 30 days. Contact sales@outcoldsolutions.com to purchase the license or extend trial. See details on https://www.outcoldsolutions.com INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = ) INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$) INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$) INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check openshift 1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0
In case if you will forget to set url
and token
for Splunk output, you will see
INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65 Configuration validation failed [output.splunk]/url is required
In case if connection is failed to our license server you will see that in the logs. If your containers and hosts do not have access to the internet, please contact us for a license which does not require internet access.
If connection will fail to your Splunk instances, you will see that too in logs.
If you don't see mentioning of any *-json.log
files, but you have containers running, possible you have journald
logging
driver enabled instead of json-file
. Please look on our steps how to install the collector Monitoring OpenShift Installation.
As an example
INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65 You are running trial version of this software. Trial version valid for 30 days. Contact sales@outcoldsolutions.com to purchase the license or extend trial. See details on https://www.outcoldsolutions.com INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259 INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = ) INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$) INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$) INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check openshift 1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0
If you don't see any errors, but you don't see any data in the Monitoring OpenShift application, it is possible that you
have specified different index than main
for the Splunk HTTP Event Collector token you use. In that case you can
add this index as a default index for the Splunk role you are using. Or change our macros in the application to prefix
them with index=your_index
, you can find the macros in Splunk Web UI, under Setting, Advanced Search, Search Macros.
As an example for macro macro_openshift_logs
you will need to change the value from (sourcetype=openshift_logs)
to
(index=your_index sourcetype=openshift_logs)
. All our dashboards are built on top of these macros, changing that
should have immediate effect on the application.
Links
-
Installation
- Start monitoring your OpenShift environments in under 10 minutes.
- Automatically forward host, container and application logs.
- Test our solution with the embedded 30 days evaluation license.
-
Collector Configuration
- Collector configuration reference.
-
Annotations
- Changing index, source, sourcetype for namespaces, workloads and pods.
- Forwarding application logs.
- Multi-line container logs.
- Fields extraction for application and container logs (including timestamp extractions).
- Hiding sensitive data, stripping terminal escape codes and colors.
- Forwarding Prometheus metrics from Pods.
-
Audit Logs
- Configure audit logs.
- Forwarding audit logs.
-
Prometheus metrics
- Collect metrics from control plane (etcd cluster, API server, kubelet, scheduler, controller).
- Configure collector to forward metrics from the services in Prometheus format.
-
Configuring Splunk Indexes
- Using not default HTTP Event Collector index.
- Configure the Splunk application to use not searchable by default indexes.
-
Splunk fields extraction for container logs
- Configure search-time fields extractions for container logs.
- Container logs source pattern.
-
Configurations for Splunk HTTP Event Collector
- Configure multiple HTTP Event Collector endpoints for Load Balancing and Fail-overs.
- Secure HTTP Event Collector endpoint.
- Configure the Proxy for HTTP Event Collector endpoint.
-
Monitoring multiple clusters
- Learn how you can monitor multiple clusters.
- Learn how to set up ACL in Splunk.
-
Streaming OpenShift Objects from the API Server
- Learn how you can stream all changes from the OpenShift API Server.
- Stream changes and objects from OpenShift API Server, including Pods, Deployments or ConfigMaps.
-
License Server
- Learn how you can configure remote License URL for Collectord.
- Monitoring GPU
- Alerts
- Troubleshooting
- Release History
- Upgrade instructions
- Security
- FAQ and the common questions
- License agreement
- Pricing
- Contact