Friday, 24 August 2007

The case of the missing metrics...

Yesterday, I helped Phil install MonAMI on the Durham CE and update his Ganglia web front-end so it now has the nice graphs.

However, we hit a snag: a few of the metrics "disappear" every so often. This is most likely happening because gmond is loosing the UDP (multicast) metric update messages. After "a while" (the DMAX value), gmond assumes that the metric is no longer being monitored and purges it. The purged metrics no longer have their data written to the RRD file by gmetad, leaving a gap in the graph.

When we encountered this with Glasgow it was caused by incoming UDP packets overflowing gmond's network buffer. The ganglia MonAMI plugin has a work-around: every 200 packets it will pause "a while" (100 ms by default). Looks like this isn't enough for Durham.

The long term solution is for someone to fix gmond: it should be multithreaded (to stop gmetad downloads from blocking metric updates) or for it to accept data using a reliable transport (e.g. TCP).

