As mentioned before we run a four-node cluster for redundancy and charding. A pair of two nodes persists all metrics. To see if there is anything wrong, there is a monitoring check testing for double metrics, where they shouldn’t be….

After building a graphite cluster being able to handle a huge amount of metrics, there is some maintenance to be done on it. Updates, config-changes and so on. No problem so far but if there is a need for nearly 100% of uptime, some tricks are necessary.

Graphite is a toolset for distributed metric collection from nearly infinite sources. After some month of beta-testing graphite we decided to make it the backend for a new monitoring architecture, measuring nearly everything but alarming only on some of the metrics. This prevents redundant data gathering and enables to do really nice stuff like self-learning anomaly detection on this huge amount of metrics.