Graphite is a toolset for distributed metric collection from nearly infinite sources. Everything being able to send an ip package can report metrics to it. It consists of multiple components, each for a specialized task. Basically there is carbon-relay, receiving metrics and forwarding them in a special manner. It can be configured to spread metrics to other instances using a distributed hashing algorithm or to simply duplicate them. Another component is the carbon-cache, which persists metrics to so called whisper database files and holds as much as possible of them in a ram cache to fasten queries on them. For quick and easy access to the metric data there is graphite-web, a powerful tool allowing simple graphing of everything using wildcards, overlaying and a powerful set of functions, usable by a very nice url-based api.

After some month of beta-testing graphite we decided to make it the backend for a new monitoring architecture, measuring nearly everything but alarming only on some of the metrics. This prevents redundant data gathering and enables to do really nice stuff like self-learning anomaly detection on this huge amount of metrics.

While the testing environment consisted of two servers, both persisting all metrics received, we needed something bigger to handle the huge amount of data we wanted to collect. There were some requirements to fullfill:

  • geo-redundant data storage (saving metrics on two datacenter locations)
  • possibility to handle 500.000 metrics/min or bursts of 1.000.000 metrics/min
  • capabilities for wave-deployments of new software versions without impact for read or write access to the cluster
  • distributed i/o load to prevent massive load on single hypervisors
  • possibility to easily extend the cluster if necessary in the future

In respect to these requirements we tried out different setups with different disadvantages. The only one matching all requirements is shown below.

A dns view based service address, resolving in dependency to the location of the system asking for it, points to the responsible loadbalancer. The loadbalancers route requests to the first level of carbon-relay instances, two of them per node. These first level relays duplicate data to their partner node in the other datacenter. Second level relays are configured to distribute metric data to a number of carbon-cache instances in the same location. This allows redundant and distributed storage of data.

graphite-cluster-structure

 

This is realized by separted config files for each instance, configuring different ports and needed parameters for metric routing. Specialized multi-init-scripts iterate over configured instances for service start/stop/status. All carbon-cache instances write into the same folder on one node. Using consistent hashing not only distributes i/o load but also ensures that there will be no write of two cache instances to one file at a time. A metric (and its whisper data file) is always handled by the same cache-instance on a node.

On each node there is a graphite-web instance running, knowing all carbon-cache instances on the same node for access to local metrics. Furthermore there are all other graphite-web instances on the other nodes configured as „cluster_servers“ to be queried if graphite-web couldn’t satisfy a request by querying local carbon-cache instances. This ensures that every graphite-web instance is able to access all the data stored in the cluster. This allows read access to be load-balanced (round-robin).

The following graph displays the total number of metrics receiving the cluster (green line), meaning the metricsReceived metric, summarized for all first-level relay instances. The red line represents a summarization of metricsReceived on the carbon-cache instances. The red line should always be as twice as the green line cause of the geo-redundant data-duplication inside the cluster. In consequence 300.000 metrics per minute require the caches to persist 600.000 metrics.  The purple line is located directly behind the red line, which means that every metric receiving the second-level relays is also persisted by carbon-caches. These simple graphs tell a lot about the health of a graphite cluster.

Graphite cluster default load

If you take a closer look inside the cluster and the single instances, there are some interesting facts to find. While carbon-relay hashes the metric target name (web.devweb01.system.cpu.iowait) inside its consistent hashing algorithm, there should be a slightly equal distribution of metrics over all carbon-cache instances, if there is enouth accident in the metric names. But reality looks a little bit different. The following graph shows metricsReceived for all carbon-caches while sending accident metrics by a simple performance testing tool.

consistent_hashing_distribution_over_caches

  • Zdenek Pizl

    25. Februar 2014 • Antworten

    Hallo,

    I am realy interested in your experience with proposed solution (http://adminberlin.de/graphite-scale-out/).

    There are some points I'd like to ask about:
    - what is hardware specification for physical nodes (A1, A2, B1, B2)
    - do I understand it well there is 1st layer of relay instances for replication data among datacenters and 2nd layer is for purpose of balance the load to whisper's storages?
    - what is the experience with inter-datacentrum replication, I am afraid of huge network load especially with your amount of metrics and datapoint?
    - are you going to publish some example of node's relay and cache instances?

    Thank you, this blogpost is amazing. Best regards Z. Pizl

    • admin

      25. Februar 2014 • Antworten

      Hi,

      you're right. The 1st layer of relay instances is configured to duplicate incoming metrics using relay-rules functionality of carbon-relay. The second layer of relays spans a consistent-hashring over carbon-cache instances, distributing the write-load. This scenario works "well" if you have proper hardware, but the systems architecture is quite complex. We call it spider! After having some HP G1 blade boxes with 24GB of memory and 2x240GB SAS (RAID1) we encountered limits at round about 450k metric updates per minute. If there is no external influence, the cluster runs fine but if there is anything, it has no spare capacity to buffer. The new hardware setup is HP G7 blades with 64GB of memory and 2x400GB SSD! This performs really well with 600k metric updates per minute atm. If you look at the stats it seems that the nodes hold all the whisper files in fs-cache, which is a huge advantage for performance. Flushing a lot of small writes to disk is also no problem any more with ssd. Maybe having much memory is enough but having ssd's on top is quite awesome.

      If you start thinging of a big setup, being able to scale out, have a look at influxdb. A quite amazing project. I hope that there will be the ability to plug graphite-web on top of it as an alternative backend. (http://influxdb.org/)

      Greets,
      Marco

      • Zdenek Pizl

        24. März 2014 • Antworten

        Marco,

        thank you for your response. I am going to deploy similar beast within our test cloud and I will see what would be the load.

        Regards, Zdenek Pizl.

  • Mike

    1. September 2014 • Antworten

    Hi, this is a very interesting setup. I tried something similar and ran into ugly performance problems with the graphite-web dashboards. I have the graphite-web running on top of the carbon-cache server (i have 4 carbon-cache server which hold all metrics). My Setup receives 500.000 metrics per Minute and the traffic slows down the frontend. All CPUs (16) plus the complete RAM (48 GB) plus Swap is 100% full because of the httpd process. Did you experience similar effect testing your setup?

    • Marco

      4. September 2014 • Antworten

      I've seen similar conditions where there are lots of requests against graphite-web but not in such a hard way. Can you just give me some details about your setup. How many requests per second is your cluster seeing? Do you see iowait?


Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

*

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden .