Collectord update - thruput and time correction
July 24, 2019Today we have shipped an updated version of Collectord (version 5.10.252
), that brings two features: configuration for thruput and time correction.
If you were running your OpenShift, Kubernetes or Docker clusters for a while, it is possible that you have gathered a lot of logs on the nodes. When you deploy Collectord, it will run as fast as it can (proving it's outstanding performance) that may potentially bring a lot of load on your Splunk deployments. To be able to preload the data we are providing two new features:
- Thruput - configure thruput on global level (Collectord instance) or specifically for the container or host logs.
- Time correction - configure the time range, in which you want to forward the logs, for example define that you want to forward logs only in time range (-48 hours, +1 hour). All events that are outside of this time range will be ignored.
Thruput
First you can configure the global thruput in the Collectord configuration. Under section [general]
you can find
thruputPerSecond
, which you can set for example to 256Kb
. Collectord will apply this thruput to all the logs it ships
from this node. Important note, that we do not count metrics that we ship from this node in the thruput, as we do not
want to throttle metrics delivery, so we will not trigger unwanted alerts.
For each container you can configure thruput
independently, and for host logs you can configure thruput
per set.
For example, if you configure thruputPerSecond
under [input.files::logs]
, that means that Collectord will have a
thruput for the files, that match all the files under configuration [input.files::logs]
.
If you configure thruputPerSecond
under [input.files]
(container logs), each Container will have its own thruput. For,
example if the node has two containers, one sending 100Kb
per second and another 50Kb
per second, and you have set thruputPerSecond
to 80Kb
,
only the first container will be throttled to 80Kb
, because the second produces less than 80Kb
per second.
For the container logs you can also override this configuration with annotations, you can apply collectord.io/logs-ThruputPerSecond: 50Kb
.
Alerts for throttled logs
We are providing two different alerts. First one will tell you if Collectord containers are producing WARN messages, and the message will look similar to
WARN 2019/07/24 18:53:00.815293 outcoldsolutions.com/collector/pipeline/pipes/thruput/pipe.go:70: pipeline is getting throttled - /rootfs/var/lib/docker/containers/b2aa6678086cbe2cd4ca374743a25e89225279db26ec34c7f4af8434b43b9b38 - maximum thruput = 10240 bytes per second
We produce this WARN message once a minute or less frequent.
You can see these WARN messages with alert Collectord reports warnings or errors
in Splunk.
Also you will know if logs are getting throttled with the alert Warning: Increasing lag between event time and indexing time in container logs
,
where we compare _time
of event to the _indextime
of event, and see if the lag is growing.
Time correction
Similarly to thruput you can configure events that you believe are too old or too new to be forwarded to Splunk.
Under section [general]
in configuration you can find two keys tooOldEvents
and tooNewEvents
which you can set to
durations. For example
[general] ... # 168h = 7 days tooOldEvents = 168h # anything newer than 1 hour ahead is getting dropped tooNewEvents = 1h
You can also configure these keys independently for the Container logs and host logs. And in case of container logs you can override these values with annotations
annotations: collectord.io/logs-TooOldEvents: 24h collectord.io/logs-tooNewEvents: 30m
Alerts for time correction
If Collectord finds events that are too new or too old it will raise a WARN message
WARN 2019/07/24 18:28:15.516115 outcoldsolutions.com/collector/pipeline/pipes/timecorrection/pipe.go:88: skipping too old or too new events - /rootfs/var/lib/docker/containers/7bef94bc58965ff059f7989ad9ae7db0b123b9e60615ffb28055884b85664cd3 - events should be in the scope (-7h, +30m)
We produce this WARN message once a minute or less frequent.
We can show these WARN messages with alert Collectord reports warnings or errors
in Splunk.
Upgrade
If you are on version 5.10
, just upgrade the image to version 5.10.252
. If you on previous versions, please look at our
upgrade instructions