Marrying Cfengine and Salt for node data collection

At the University of Oslo [1] where I work, we have been using Cfengine3 [2] as the core component for continuous automation and maintenance  of the RedHat Enterprise Linux platform since 2010(ish). Cfengine really shines due to its flexibility, reliability, speed and low resource consumption. Flexibility because of its context class system, reliability because of its distributed nature and few dependencies, speed and low resource consumption because its written in C by developers that writes efficient code. For these reasons Cfengine is my preferred choice to ensure stable operation of critical infrastructure without worrying much about the promised outcome.

However, there is a difficult challenge of reporting/knowledge management/transparency in order to support design decisions when changing and/or improving the infrastructure policies. Automation tools like Cfengine is powerful hence it is possible to make quite powerful mistakes too. Change = risk as always, but change supported by infrastructure knowledge poses less risk than without this knowledge. Ultimately we would like to answer the question: “Will the change I am about to deploy work as intended, and will it not lead to unforeseen negative consequences for the operation?”. Infrastructure engineers around the world are constantly faced with these questions. Tools like Cfengine (combined with code review tools,  automatic testing and frequent small changes) certainly is a long step in the  right direction, but ultimately we would like remove all uncertainty regarding the consequences of deploying  changes.

It is tempting to think that we could collect all aspects of our infrastructure like component inventory, relationships between all components, however, this does not scale. We need to find a balance between information and knowledge collection and the risk we are trying to address. As Mark Burgess in many eloquent ways points out in his book “In search of certainty” [3] , we need to realise that we must accept that uncertainty will always exist and manage it as best we can. In order to minimise uncertainty, infrastructure  should be built with self repairing autonomous components that ensures stable and predictable behaviour, thus contributing to stability and predictability of the infrastructure at a larger scale. Cfengine is a tool which enables this behaviour.

But in the end it is us, the infrastructure engineers, the dev-ops, the software developers, that together must create and orchestrate all the different components in a manner that supports and contributes to stable desired state in spite of ever changing desires driven by business goals. Changes in desires necessitates adaptation of infrastructure components which lead to policy change which again poses risk.  The risk can be minimised by domain knowledge about the affected components and testing. If we could collect some kind of feedback from the components in the form of information about their current actual state and function, that could support and extend our domain knowledge.

This is a similar situation [4] Mike Svoboda of LinkedIN found himself in, and therefore created the sysops-api [5] with Cfengine, Redis and python. His method utilises Cfengine’s excellent capabilities of spreading activities over both time (splaytime) and space (select_class) in order to enable massive data collection from infrastructure components into Redis [6] data stores. The sysops-api can query collected data across the whole infrastructure fashion and get answers to questions asked in order to support policy changes. The tool is designed for LinkedIN’s particular needs. A couple of possible drawbacks with using this as a general approach:

  • A lot of data need to be pulled from Redis servers when querying for global infrastructure questions.
  • The lack of encryption, although compression of data somewhat obscures the data. This could of course easily be added by having the components encrypt the data too.

Lately I have become fascinated by ZeroMQ [6], their design philosophy and its adoption by different tools. ZeroMQ is basically a set of libraries for almost all known programming languages enabling efficient and reliable asynchronous network communication; network sockets with superpowers. I wondered I could leverage ZeroMQ in order to help building a tool for information collection and started looking around for helpful code examples. Then suddenly SaltStack [7] pops up on the top of the list. A parallel execution framework built on ZeroMQ; that looked interesting.

So I started reading documentation, installed a couple of salt minions and a salt master to test with. Salt stack is both a parallel execution system and a configuration management tool where you can specify basic desired state in YAML files and have it adapted to context using what salt calls “grains”. Grains are somewhat similar to facts in Puppet and classes in Cfengine. The syntax of the configuration language seemed simple enough, however advanced adaptation of context seemed more cumbersome than Cfengine at least at the first look of it. Also the installation pulled down a bunch of python library dependencies (Salt is built in Python). Hence I still think Cfengine is the best alternative when it comes to stability and desired state outcome.

But that asynchronous parallel execution system looks interesting for information collection. So Salt is kind of different than other parallel shells or similar because of ZeroMQ. When minions start, they establish contact with their master to listen for commands. It does not matter if the master is not yet up, the minion will just silently sit and wait until the master is ready. And minions themselves are not opening any additional network ports listening for incoming connections.

Another interesting aspect was that Salt seemed to be using the same kind of PKI design as Cfengine for encrypting traffic and authenticating between master and minions. So if I already has a running Cfengine infrastructure with agents and server(s), would it be possible to reuse the Cfengine keys to establish trusted communication between salt minions and salt master. It turns out that the answer is yes!

First let me explain why it seems like a good idea to use salt for inventory collection.

  1. The minion has a lot of built in knowledge about structured data collection just by listing the grains that the minion detects.
  2. It seems fairly easy to extend salt with new execution modules for further data collection
  3. The command output is structured as serialised data in form of YAML, JSON or other possible outputs, hence it would require little effort to push the output straight into a document database like Elasticsearch for instance.
  4. Asynchronous traffic; ZeroMQ manages data-queues efficiently on client and servers side making clients wait in turn until the server is ready to accept data.

If I could just enable salt on all Cfengine nodes I would instantly get basic inventory information and a generic way of collecting arbitrary information from hosts and get the results as structured data.

Minions that do not have a public master key, is going to initially accept the master it is configured to contact. Minions that have a public master key is not going to accept any other master than the one that matches that key. In order to make all minions securely trust the right master we can make promises in Cfengine to place a copy of the salt-master public key in the correct location on the minion. The master key must of course be copied from the salt master and made available in a Cfengine file repository available to agents.

Salt master does not trust any minions before the minions keys are accepted. Acceptance is a matter of running salt-key -a “hostname”. All this does is to place the minion public key into the directory of accepted keys on the salt master. From this point the salt master is capable of executing modules on the minions and collect the resulting data.

It turns out that Cfengine and salt are using exactly the same type of keypairs, hence it should be possible to:

  1. Copy or link cf-agent private and public host key into salt-minion private and public keys.
  2. Copy public agent keys from the Cfengine server to the accepted keys directory of the salt master

… and hence get pre-authenticated and encrypted salt-stack communication by means of pre-existing Cfengine RSA keys.

Well there is a few obstacles to overcome still. First of all the cfagent private key-is password protected. This is a bit odd, since the password is anyway publicly available in the Cfengine source code on github and thus no real security is added. The password is not needed for security anyway. The secure communication and trust  of both salt-stack and Cfengine relies on non-disclosure of the private key file outside of the owning host and not the private key having a password.  Salt minons expect a privat key without a password. There is couple of ways to fix this. The easiest is probably to make the cf-agent promise to use openssl command line client to make a passwordless copy of its private key, and place it in the minions’ PKI directory. Another way would be to copy or symlink the cf-agent key and patch the minion python code to know about the passphrase of the Cfengine key so it can be loaded into the minion daemon.

This can be done by altering the following in the file “/usr/lib/python2.6/site-packages/salt/crypt.py”:

    def get_keys(self):
”’
Returns a key objects for the minion
”’
# Make sure all key parent directories are accessible
user = self.opts.get(‘user’, ‘root’)
salt.utils.verify.check_path_traversal(self.opts[‘pki_dir’], user)
if os.path.exists(self.rsa_path):
key = RSA.load_key(self.rsa_path)
-key = RSA.load_key(self.rsa_path)
+          ppfunc = lambda x:”Cfengine passphrase”
+          key = RSA.load_key(self.rsa_path,ppfunc)

The other obstacle is that the Cf-agent public keys are stored on the server side is in PKCS#1 format (pem header —–BEGIN RSA PUBLIC KEY—–)  while the salt master seems to expects a key in the “X.509 SubjectPublicKeyInfo” format ( pem header —–BEGIN PUBLIC KEY—–) . I tried to convert the Cfengine keys using the “openssl” command line utility, however it seems like PKCS#1 is not an understood input format to this utility. The following perl snippet will do the trick though:

use Crypt::OpenSSL::RSA;
my $str = do { local $/; <STDIN> };
$rsa_pub = Crypt::OpenSSL::RSA->new_public_key($str);
print $rsa_pub->get_public_key_x509_string

The next steps will be to export data into Elasticsearch [9] and use it for reporting and query information to reduce uncertainty when changing Cfengine policies.

References:

[1] http://www.usit.uio.no/english/about/organisation/it-drift/gd/gid/
[2] https://github.com/cfengine
[3] http://markburgess.org/certainty.html
[4] https://www.youtube.com/watch?v=H1dVsSvKBlM
[5] https://github.com/linkedin/sysops-api
[6] http://redis.io/
[7] http://zeromq.org/
[8] https://github.com/saltstack
[9] http://www.elasticsearch.org/

Posted in Cfengine, Configuration management