Speed up your CFEngine by using a RAM disk!

When using CFEngine on a daily basis with a heavy set of promises, it is possible that one day you will encounter the following problem, especially on older releases of CFEngine: the underlying databases can get slow over time and eat some I/O. On a “standard” machine it will not cause any harm, but on a heavily loaded one like a log centralization machine, a VM hypervisor or a database host, CFEngine will suffer.

One solution we tested was using a tmpfs (ramdisk) for the CFEngine state directory.

RAM disks!?

Well, the idea is quite simple actually: if you have a heavily I/O loaded machine, and a tiny daemon that suffers from the I/O load on non-critical information that takes up a few megabytes, why not load this very same information directly into RAM?

This is where the concept of a RAM disk kicks in: generically speaking, it creates a special mount point on the machine, usable like any filesystem but storing the content directly in RAM. This way, in exchange for a fraction of your available memory, you get a small space to store files that do not use the fixed storage I/O resources at all, and it is blazing fast (see: https://forums.bukkit.org/threads/harddrive-vs-ramdisk-vs-tmpfs-benchmark.2710/).

Of course, there is one drawback: the RAM disk content is lost everytime it is unmounted, it is purely volatile and should absolutely not be used for important data unless you do a periodic synchronization of said content on a persisting storage like a Hard Drive or a Solid State Drive.

Here is an interesting practical application of the RAM disk approach for web browser profiles: https://wiki.archlinux.org/index.php/Profile-sync-daemon

As you can see, RAM disks are pretty much the stuff you need to solve the problem we had with CFEngine’s I/O.

Using RAM disks with CFEngine

Now, straight to the point: I have a Linux machine on a heavily loaded hypervisor and my CFEngine has a lot of promises to apply. Problem: the execution of the agent becomes slow and using strace to see where is the bottleneck shows that the /var/cfengine/state/cf_state.tcdb and /var/cfengine/state/cf_lock.tcdb are the culprits as a lot of time is spent trying to store thing in these DBs, meanwhile the hard disk is already struggling with Virtual Machines I/O.

As explained in the CFEngine reference manual and documentation, this directory is used to store the data CFEngine needs to keep track of the machine state: these files (cf_locks.tcdb and cf_state.tcdb) are used to store information about persistent classes and locks, as well as various internal data such as cf-monitord metrics and “last seen” data for peer communication. You also have a package list cache for package_method backends (software_packages.csv). You can take a look at the content of the databases using the “tchmgr” tool bundled with the all-inclusive packages from cfengine.com.

Thus, it is relatively harmless to store these files in a temporary place as long as the machine does not reboot often, unless you rely heavily on persistent classes or need to keep track of latest occurrences of peer exchanges using cf-key in the long term. The consequence of putting this directory in RAM disk is that CFEngine will completely forget what it knows about the machine upon reboot and run for the first time as if you used the -K switch (ignore locking).

On Linux, the easiest and quickest way to deploy a RAM disk is using tmpfs: http://fr.wikipedia.org/wiki/Tmpfs

Now, this is how you mount a tmpfs of 128 MB on /var/cfengine/state:

/bin/mount -t tmpfs -o size=128M,nr_inodes=2k,mode=0700,noexec,nosuid,noatime,nodiratime tmpfs /var/cfengine/state

And voilà, that’s all folks. Now, you might want to add a cron job to periodically do a rsync from this directory to another, “just in case”.

Results

Well, quite easy to see, with: “/usr/bin/time sudo /opt/rudder/bin/cf-agent -KI”

Before:

4.92 user
0.40 system
0:06.65 elapsed
80% CPU (0 avgtext + 0 avgdata 150544 maxresident)k
0 inputs + 1952 outputs (0 major+ 27848 minor)pagefaults 0 swaps

After:

1.94 user
0.23 system
0:02.71 elapsed
80% CPU (0 avgtext + 0 avgdata 150528 maxresident)k
0 inputs + 1136 outputs (0 major + 27817 minor) pagefaults 0 swaps

Those results where obtained on a “calm” machine, the difference can exponentially grow as the I/O usage gets more intensive, especially if you are getting close to your storage backend limits. As you can see, the CPU usage is nearly the same whereas the execution time is drastically reduced. strace also shows that this time cf-agent nearly passes no time on the I/O on the tcdb databases whereas before it could block a few milliseconds at each call (and there were quite a lot of them…)

Now, if you are satisfied with the results and would like to keep this on the machine, a simple fstab entry will take care of that, to be appended at the end of /etc/fstab:

# Tmpfs for the CFEngine state backend storage directory
tmpfs		/var/cfengine/state	tmpfs	size=128M,nr_inodes=2k,mode=0700,noexec,nosuid,noatime,nodiratime	0	0

Well, it solved the problem for me, and I think the pros of the solution are interesting enough to live with the cons 🙂 Has any one tried (a maybe more elegant) other solution?

SHARE THIS