After building a graphite cluster being able to handle a huge amount of metrics, there is some maintenance to be done on it. Updates, config-changes and so on. No problem so far but if there is a need for nearly 100% of uptime, some tricks are necessary. The strategy is to do wave-deployments of updates delivered by rpm packages. This is done by yadt, an open source toolset for datacenter automation. It iterates over the nodes, stopping corresponding services, updating rpm’s and starting the services again. To reach the target, the cluster has to be able to handle this procedure.
First tests pointed out, that some thousands of metrics are lost after updating the cluster. A deeper look inside and some days of investigation later some problems were found:
- Stopping a cache or relay instance is done by rhel init-functions in the init-script, sending a TERM signal to the process, waiting some seconds and sending a KILL signal follwing up. Some seconds are quite to low for a instance under heavy load. The waiting time was not enough to flush queued metrics, causing loss of metrics.
- Stopping a relay doesn’t stop the traffic being sent by loadbalancer to the stopping instance, which raises the amount of lost metrics
After trying to modify the behavior of init-functions to wait a proper time, we decided to remove them and implement it by ourselves. The solution was to define a configurable queue_flush_wait_time, representing the maximum time to wait for the process to terminate properly. The procedure sends a TERM signal and checks each second, if the process has terminated, until it reaches queue_flush_wait_time and sends the KILL signal. This allows a graceful shutdown and secure termination in case of failures for cache and relay processes.
To protect the relay-instances from incoming metrics while stopping, iptables comes into place, closing corresponding ports before stopping and opening them afterwards. This was the first strike trying to solve the problem of lost metrics and may be obsolete by waiting long enough before killing the process. I will test it.
After modifications were made, I made a wave deployment to see if there are any problems. This was quite easy by spot testing a few metrics for None values in the time-range of the deployment. This would indicate a loss of metrics. Nothing to be found. Another method to verify the process was to take a deeper look at some carbon metrics like metricsReceived for caches and relays. The following picture displays these metrics while deploment. The green graph indicates a small reduction of total metrics received during the process, which might be normal cause if connections getting terminated and clients waiting some time to reconnect. The purple graph lowers on each node shutdown and raises massively showing the reapplication of queued metrics to the caches.
A small print statement in the init script displays the astonishing average number of 230s per cache-instance to terminate properly. This makes deployment a really long-running task but might be much better than loosing metrics. Specially if your monitoring and alarming is widely based on graphite data. 😉