Signing Docker images with Notary server

This is the second of 2 blogs that we will do something with Docker and Security. In the first blogpost (Click), we will start Clair and use a tool called clair-scanner to scan Docker images that are on your host. In the 2nd blogpost (This one) we will start a Registry and Notary Server|Signer to sign Docker images. Notary allows use to sign images and we can configure the Docker daemon to only start containers from signed images.

For both blogposts, we will be using a sample configuration from the following Github repository: https://github.com/dj-wasabi/clair-and-docker-notary-example

This repository contains a docker-compose.yml file and some necessary files that are needed to run the applications. The docker-compose file contains the following applications:

  • Docker Registry v2
  • Notary (Server)
  • Notary (Signer)
  • Notary (DB)
  • Clair (DB)
  • Clair
  • Clair-scanner

The rest of the files are configuration files specific to these applications and I provided some self-signed certificates. These SSL certificates can only used for demo purposes.

Before we continue, lets do a clone of the Github repository and make sure you have Docker and docker-compose installed and running.

Why do we want to make use of a Notary server? Once you have Docker running, you are able to download all kinds of Docker images and run them. Some of them are the official ones, like Debian, CentOS and/or Hashicorp’s Consul, but you can also download and run Docker images from some one else. But you don’t know for sure what is installed and running in an image when you download one. With the previous blog-post we used Clair which can help to find if there are vulnerabilities in an Docker image, but you don’t know if an image is tampered with.

Notary will not fix this problem, it doesn’t scan the image to see if it is tampered with, but with Notary we will be able to sign our own Docker images. When a specific environment variable is set, we can only use these signed Docker images to run on our host(s). If we do want to download a Docker image from for example Docker hub, it will provide an error message.

We need to prepare some things before we can start the containers. First we need to make sure we add some entries in the hosts file, so we can resolve 2 FQDN’s which are used for the Registry Server and for the Notary Server.

Add following to hosts file:

127.0.0.1     notary-server.example.local registry-server.example.local

Once we have done that, we have to copy a configuration file and the ca-root certificate to a directory in our home-dir.

mkdir -p ~/.notary && cp files/config/config.json files/certs/ca-root.crt ~/.notary

Now we are done preparing and we can start de containers. We start the Registry server and the Notary server. When starting the Notary server, we will also automatically start the Notary signer and the database. So don’t be confused when you see extra containers running.

docker-compose up -d notary-server registry-server

You can verify if Notary server is correctly started by executing the following command:

openssl s_client -connect notary-server.example.local:4443 -CAfile files/certs/ca-root.crt -no_ssl3 -no_ssl2

This will return some information about the SSL certificate that is configured for the Notary server. Example:

$ openssl s_client -connect notary-server.example.local:4443 -CAfile files/certs/ca-root.crt -no_ssl3 -no_ssl2
CONNECTED(00000005)
depth=1 C = EU, ST = Example, L = Example, O = Example, OU = Example, CN = ca.example.local, emailAddress = root@ca.example.local
verify return:1
depth=0 C = EU, ST = Example, O = Example, CN = notary-server.example.local
verify return:1
---
Certificate chain
 0 s:/C=EU/ST=Example/O=Example/CN=notary-server.example.local
   i:/C=EU/ST=Example/L=Example/O=Example/OU=Example/CN=ca.example.local/emailAddress=root@ca.example.local

So lets pull an image and retag it so we can push it later on to our newly started Registry server. Lets make sure the image does have a tag like latest or 1.2.1.

docker pull wdijkerman/clair-scanner
docker tag wdijkerman/clair-scanner registry.example.local:5000/wdijkerman/clair-scanner:latest

But don’t push it yet, we first have to make sure that we have set some environment variables before doing so.

export DOCKER_CONTENT_TRUST_SERVER=https://notary-server.example.local:4443
export DOCKER_CONTENT_TRUST=1

These 2 environment variables mean that we enable Docker Content Trust, so when we wants to do something with the image it will be checked with Notary server which is available on the provided URL.

The Registry server is configured with basic authentication, so we have to login first:

docker login registry.example.local:5000

Username: admin
Password: password

Now we are ready and we can now push our newly tagged image to the Docker Registry:

$ docker push registry.example.local:5000/wdijkerman/clair-scanner:latest
The push refers to repository [registry.example.local:5000/wdijkerman/clair-scanner]
4737f34f33f3: Pushed
5ff3301a32f4: Pushed
7bff100f35cb: Pushed
latest: digest: sha256:2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32 size: 946
Signing and pushing trust metadata
You are about to create a new root signing key passphrase. This passphrase
will be used to protect the most sensitive key in your signing system. Please
choose a long, complex passphrase and be careful to keep the password and the
key file itself secure and backed up. It is highly recommended that you use a
password manager to generate the passphrase and keep it safe. There will be no
way to recover this key. You can find the key in your config directory.
Enter passphrase for new root key with ID 98a3a53:
Repeat passphrase for new root key with ID 98a3a53:
Enter passphrase for new repository key with ID 3dd4fb6:
Repeat passphrase for new repository key with ID 3dd4fb6:
Finished initializing "registry.example.local:5000/wdijkerman/clair-scanner"
Successfully signed registry.example.local:5000/wdijkerman/clair-scanner:latest

Because this is the first time we have pushed an image, it asks us to enter a passphrase for the root key and for the repository. Generate a passphrase and enter these with the push command.

Ok, so now the Docker image is pushed in our Registry server and it is signed by the Notary server. We can verify this by executing the next command:

 $ notary -s ${DOCKER_CONTENT_TRUST_SERVER} -d ~/.docker/trust list registry.example.local:5000/wdijkerman/clair-scanner
NAME      DIGEST                                                              SIZE (BYTES)    ROLE
----      ------                                                              ------------    ----
latest    2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32    946             targets

Here you see that we have a single image pushed to the Notary server. The value in the DIGEST is the same as the Docker image ID.

Before we continue, in the home directory we have a hidden “.docker” directory. In one of the sub directories the keys that where generated with the first push a stored here. These are important, so make sure to backup these files. There is also a possibility to set some environment variables with the passphrase so you won’t have to backup these files, but couldn’t find them yet.

$ ls -l ~/.docker/trust/private/
total 16
-rw-------  1 wdijkerman  staff  477 Feb 23 20:00 3dd4fb64fbd1524884b02fefde0771d0708082c70201511f15580b42244f37cf.key
-rw-------  1 wdijkerman  staff  416 Feb 23 20:00 98a3a53715f98652478ab6cf0c58f56a720956cc405292a72fb7a97fb0fb4618.key

So lets remove the 2 images from the host so we can pull them later again.

docker image rm registry.example.local:5000/wdijkerman/clair-scanner:latest
docker image rm wdijkerman/clair-scanner

And now we will do 2 pulls, 1 from our Registry server to verify that it just works. After this, we download an image from Docker hub.

$ docker pull registry.example.local:5000/wdijkerman/clair-scanner:latest
Pull (1 of 1): registry.example.local:5000/wdijkerman/clair-scanner:latest@sha256:2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32
sha256:2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32: Pulling from wdijkerman/clair-scanner
cd784148e348: Pull complete
8297cc41e539: Pull complete
ef2f20c2497d: Pull complete
Digest: sha256:2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32
Status: Downloaded newer image for registry.example.local:5000/wdijkerman/clair-scanner@sha256:2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32
Tagging registry.example.local:5000/wdijkerman/clair-scanner@sha256:2f876d115399b206181e8f185767f9d86a982780785f13eb62f982c958151a32 as registry.example.local:5000/wdijkerman/clair-scanner:latest

And now we download an image from the Docker hub:

$ docker pull wdijkerman/clair-scanner:latest
Error: error contacting notary server: x509: certificate signed by unknown authority

Where you just as me happy that it failed? 🙂

As you see, the pull failed from the Docker hub, as the image is not registered with Notary server so there is now no chance of running an container from any other source than our own Registry server. Mission accomplished.

Links

Docker Content Trust

Docker Notary Server

Docker Registry Server

Scanning Docker images with CoreOS Clair

This is the first of 2 blogs that we will do something with Docker and Security. In the first blogpost (This one), we will start Clair and use a tool called clair-scanner to scan Docker images that are on your host. In the 2nd blogpost we will start a Registry and Notary Server|Signer to sign Docker images. Notary allows use to sign images and we can configure the Docker daemon to only start containers from signed images.

For both blogposts, we will be using a sample configuration from the following Github repository: https://github.com/dj-wasabi/clair-and-docker-notary-example

This repository contains a docker-compose.yml file and some necessary files that are needed to run the applications. The docker-compose file contains the following applications:

  • Docker Registry v2
  • Notary (Server)
  • Notary (Signer)
  • Notary (DB)
  • Clair (DB)
  • Clair
  • Clair-scanner

The rest of the files are configuration files specific to these applications and I provided some self-signed certificates. These SSL certificates can only used for demo purposes.

Before we continue, lets do a clone of the Github repository and make sure you have Docker and docker-compose installed and running.

Clair

Clair is an open source project for the static analysis of vulnerabilities in application containers (currently including appc and docker). Clair will analyze a layer to see if it finds any vulnerabilities. If vulnerabilities are found, Clair will provide information about the vulnerability. To let Clair scan these layers, we use a tool called “clair-scanner“. clair-scanner will get all layers from an Docker image on your host and provide these to Clair by uploading them 1-by-1. Once all layers have been scanned, the clair-scanner will provide the vulnerabilities (if there are any).

Lets start Clair by executing the following command:

docker-compose up -d clair

It will start a PostgreSQL container and the Clair container itself. Once Clair is started, it will fetch the vulnerabilities for the various operating systems that is configured in the file: files/config/clair-config.yaml in earlier mentioned repository. This might take a while (In my case it was 15 minutes).

The following is configured in earlier mentioned configuration file:

  updater:
    interval: 1m
    enabledupdaters:
      - debian
      - ubuntu
      - rhel
      - oracle
      - alpine
      - suse

As you see, Clair will download vulnerabilities information from the above mentioned operating systems.

Occasionally check the logfile of clair (docker logs -f clair) and see if you find the following log messages appear:

{"Event":"could not get NVD data feed hash","Level":"warning","Location":"nvd.go:137","Time":"2019-01-26 20:19:59.682956","data feed name":"2018","error":"invalid .meta file format"}
{"Event":"could not get NVD data feed hash","Level":"warning","Location":"nvd.go:137","Time":"2019-01-26 20:19:59.682956","data feed name":"2019","error":"invalid .meta file format"}

You’ll see them from year 2002 to 2019. Once these messages are logged, we are able to continue with scanning (a) Docker image(s).

Some basic Clair information

Some information about Clair while we are waiting.

When you want to scan an image, Clair expects to analyze each layer of a Docker image. This kind data needs to be POST’ed to the following endpoint: http://localhost:6060/v1/layers

Example of a POST data request (Can also be found on: https://coreos.com/clair/docs/latest/api_v1.html#layers)

{
  "Layer": {
    "Name": "523ef1d23f222195488575f52a39c729c76a8c5630c9a194139cb246fb212da6",
    "Path": "https://mystorage.com/layers/523ef1d23f222195488575f52a39c729c76a8c5630c9a194139cb246fb212da6/layer.tar",
    "Headers": {
      "Authorization": "Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.EkN-DOsnsuRjRO6BxXemmJDm3HbxrbRzXglbN2S4sOkopdU4IsDxTI8jO19W_A4K8ZPJijNLis4EZsHeY559a4DFOd50_OqgHGuERTqYZyuhtF39yxJPAjUESwxk2J5k_4zM3O-vtd1Ghyo4IbqKKSy6J9mTniYJPenn5-HIirE"
    },
    "ParentName": "140f9bdfeb9784cf8730e9dab5dd12fbd704151cf555ac8cae650451794e5ac2",
    "Format": "Docker"
  }
}

The most important keys are the Name, Path and the ParentName.

When an image has 3 layers (Layer A, is the base image lets say debian:latest, B is installing a package and C is adding a file), Clair expects first a POST request to the /v1/layers endpoint with the layer information of layer A. The Name should have the SHA256 value of layer “A”, and the Path should contain an URL on which the layer can be downloaded. Clair will download the layer and run the actual analysis.

When layer A is analysed, layer B should be uploaded to the endpoint. But now the ParentName should contain the layer SHA256 value of layer A. When layer C is analysed, the ParentName should contain the SHA256 value of Layer B and etc.

Yes, you read that correctly. Clair will download the layer from for example a Docker Registry. This means that an image should for example already been pushed to a Docker Registry. But this is a bit to late, as you should only push an image to a Docker Registry if it doesn’t contain any vulnerabilities. Here comes the clair-scanner tool into play. This tool will basically start a web server where Clair can download the layers from when analysing.

Once Clair is completely started we can continue. Lets download an Docker image:

docker pull wdijkerman/consul

(You can of course also download an other Docker image)

This is a very basic Alpine image running Consul and Python. So lets check that Docker image.

 $ docker-compose run --rm clair-scanner wdijkerman/consul
2019/01/26 19:43:52 [INFO]  Start clair-scanner
2019/01/26 19:43:55 [INFO]  Server listening on port 9279
2019/01/26 19:43:55 [INFO]  Analyzing 5491dce778832e33c284cd8185100e76d6daa18f8cbc32458c706776894127fc
2019/01/26 19:43:55 [INFO]  Analyzing 28a9cc8dcad2060c54ae345db266ad00e4d84b1f7526e5186f93844eb3bb426e
2019/01/26 19:43:56 [INFO]  Analyzing 746b97d6fd172bacbe51699e383b5a47ceb3d779c3580b9dd35dfb7bd4a72a83
2019/01/26 19:43:56 [INFO]  Analyzing 1e04d30b4435c531eafe3d3b17155f3f3f4a9b4874ca1f1d3115ad273db43d1e
2019/01/26 19:43:56 [INFO]  Analyzing c3453aa5ff961a1d1710c2f110a788d796a5456241c664489e65fd269f0e1687
2019/01/26 19:43:56 [INFO]  Image [wdijkerman/consul] contains NO unapproved vulnerabilities

It shows us that there are 5 layers in this image and no vulnerabilities where detected! (This was during time of writing this blogpost, it can always be the case when new vulnerabilities are found!)

This blogpost is not really successful if we only show things that are ok, so lets check an image that contains 1 or more vulnerabilities.

Lets check the postgres:latest image (It is part of the Clair installation if you where wondering why you have downloaded the image). So lets check that one.

 $ docker-compose run --rm clair-scanner postgres:latest
2019/01/26 19:25:46 [INFO]  Start clair-scanner
2019/01/26 19:25:54 [INFO]  Server listening on port 9279
2019/01/26 19:25:54 [INFO]  Analyzing 08bf86d6624450c487db18071224c88003d970848fb8c5b2b07df27e3f6869b2
2019/01/26 19:25:54 [INFO]  Analyzing f419c5f6b63090e31755da12d65829dfd90ac42b90c70a725fb5dc7856395fc7
2019/01/26 19:25:54 [INFO]  Analyzing 906fb3014e147615f2219607d99604bdc53d0a6cdb0f4886ebf99548df918073
2019/01/26 19:25:54 [INFO]  Analyzing 1439e9b10c58144ac2acb85fa9aab36127201d1b2550b45216a341fa32957d17
2019/01/26 19:25:54 [INFO]  Analyzing 75d637800b713ea9c0bcd3a19eed8c144598ef8477da147a50d10cd6e85d2919
2019/01/26 19:25:54 [INFO]  Analyzing d91867cee1db8a866d638ae1d66c8078abfd236cda83c7ba72a5d214c5c8c4a3
2019/01/26 19:25:54 [INFO]  Analyzing b7d5ec0a0cb0939be115288b61e074a17359f8eb283e0deab13d30b4c0a060e8
2019/01/26 19:25:54 [INFO]  Analyzing 130e7676deae310571cbc46260a81adfc9f0de8a8684bbc33077b12c388594b7
2019/01/26 19:25:54 [INFO]  Analyzing 2bc990d0b93546a555b6abd28c365a1383f58fb64fd36142c5a9a0cbd26131e2
2019/01/26 19:25:54 [INFO]  Analyzing f4dfa6837911fd604bedc0d96126c12b2209a87421a7a0f56b0781d507b0aca8
2019/01/26 19:25:54 [INFO]  Analyzing a9598ee0b475f1cafbe8f63d6c7243ca37da704b9496e2d08c164238e8d0be3c
2019/01/26 19:25:54 [INFO]  Analyzing 1622bb5b03dc3ce4a5af7f2f89c443ce749b54be0703b92d7822f8789cf79281
2019/01/26 19:25:54 [INFO]  Analyzing fb4daa3b039b8e9889bc9f3c675c4811c85d34746c8862c673fe3da1998ae08b
2019/01/26 19:25:54 [INFO]  Analyzing 0b6857f87b6965b43ed41cc7a54591b3697b6049603cbd1b760030045915e3de
2019/01/26 19:25:54 [WARN]  Image [postgres:latest] contains 86 total vulnerabilities
2019/01/26 19:25:54 [ERRO]  Image [postgres:latest] contains 86 unapproved vulnerabilities
+------------+-----------------------------+--------------+------------------------+--------------------------------------------------------------+
| STATUS     | CVE SEVERITY                | PACKAGE NAME | PACKAGE VERSION        | CVE DESCRIPTION                                              |
+------------+-----------------------------+--------------+------------------------+--------------------------------------------------------------+
| Unapproved | High CVE-2017-16997         | glibc        | 2.24-11+deb9u3         | elf/dl-load.c in the GNU C Library (aka glibc or libc6)      |
|            |                             |              |                        | 2.19 through 2.26 mishandles RPATH and RUNPATH containing    |
|            |                             |              |                        | $ORIGIN for a privileged (setuid or AT_SECURE) program,      |
|            |                             |              |                        | which allows local users to gain privileges via a Trojan     |
|            |                             |              |                        | horse library in the current working directory, related      |
|            |                             |              |                        | to the fillin_rpath and decompose_rpath functions.           |
|            |                             |              |                        | This is associated with misinterpretion of an empty          |
|            |                             |              |                        | RPATH/RUNPATH token as the "./" directory. NOTE: this        |
|            |                             |              |                        | configuration of RPATH/RUNPATH for a privileged program      |
|            |                             |              |                        | is apparently very uncommon; most likely, no such            |
|            |                             |              |                        | program is shipped with any common Linux distribution.       |
|            |                             |              |                        | https://security-tracker.debian.org/tracker/CVE-2017-16997   |
+------------+-----------------------------+--------------+------------------------+--------------------------------------------------------------+
| Unapproved | High CVE-2017-12424         | shadow       | 1:4.4-4.1              | In shadow before 4.5, the newusers tool could be             |
|            |                             |              |                        | made to manipulate internal data structures in ways          |
|            |                             |              |                        | unintended by the authors. Malformed input may lead          |
|            |                             |              |                        | to crashes (with a buffer overflow or other memory           |
|            |                             |              |                        | corruption) or other unspecified behaviors. This             |
|            |                             |              |                        | crosses a privilege boundary in, for example, certain        |
|            |                             |              |                        | web-hosting environments in which a Control Panel allows     |
|            |                             |              |                        | an unprivileged user account to create subaccounts.          |
|            |                             |              |                        | https://security-tracker.debian.org/tracker/CVE-2017-12424   |
+------------+-----------------------------+--------------+------------------------+--------------------------------------------------------------+

Oops. 86 vulnerabilities! Well, this Docker image might not be really safe to use but in the end, that is all up 2 you.

As you might see in the postgres example, there is a column “STATUS” in the output and all of them are “Unapproved“. Why is that? clair-scanner allows you to whitelist specific vulnerabilities when scanning images.

The clair-scanner tool has an exit code of 0 when no vulnerabilities are found and has an exit code of !=0 when vulnerabilities are found. So if you would run the clair-scanner as part of your CI pipeline, this would fail. However, there could be a reason to whitelist a vulnerability and the clair-scanner will not provide an exit code of !=0 when whitelisted vulnerabilities are found.

Example of a whitelist file.

generalwhitelist: #Approve CVE for any image
  CVE-2017-6055: XML
  CVE-2017-5586: OpenText
images:
  ubuntu: #Apprive CVE only for ubuntu image, regardles of the version
    CVE-2017-5230: Java
    CVE-2017-5230: XSX
  alpine:
    CVE-2017-3261: SE

So we have 2 CVE vulnerabilities that we whitelist, no matter what base Docker OS image is used. For Ubuntu we whitelist 2 CVE’s and 1 for Alpine. I’m not sure, but I would say the XML, OpenText is just a basic description for what package the CVE belongs to.

Summary

So with this blogpost we where able to start Clair and do some Docker image analysing with the tool clair-scanner. It showed us that the postgresql image contains some vulnerabilities. So now you can update your CI pipeline by adding a check to scan for vulnerabilities, before pushing the image to a Docker Registry. Next blogpost, we will start a secure Docker Registry and we will sign Docker images with the Notary Server and Signer tool.

Links

Clair
Clair-scanner

Continuous deployment of Ansible Roles

Ansible Logo

There are a lot of articles about Ansible with continuous deployment, but these are only about using Ansible as a tool to do continuous deployment. There is not much (Well, I can’t really find none) about continuous deployment on code changes in Ansible Roles/playbooks itself. But first: Why do you want to do that?

Well, it is very easy to make changes in a role or a playbook and deploy that to a (production) machine(s). I do hope that these changes are commited into the git repository (And pushed) so that changes on the host can be tracked back to the code. Hopefully you didn’t make any errors in the playbook or role so all will be fine during deployment and no unwanted downtime is caused, because nothing is tested.

When you are part of a team, this would be a downside of using Ansible. It is very easy to make changes to a playbook or a role locally and not commit it to the repository, deploy it to a production server and continue like it didn’t happen. You can make agreements on these kinds of procedures on when and how to execute playbooks, but you always have that coworker that don’t (or partly) want to follow procedures or just because of lack of time (“It has to be working this morning!” or just any other lame excuse to not test your code before deployment).

When the team/serverpark grows bigger and/or the company you work for matures and even has an SLA, you can’t just deploy any untested code anymore. You’ll have to make sure that changes you made to code is tested, like any other code. Application developers write unit tests on their code and the application is tested by either automated tests or by using test|qa team. Application development is not any different than writing software for your infrastructure, it all needs to be tested before you use it on production.

What I will describe in this blogpost is just a suggestion on how to do this. This might not be foolproof or maybe there are other or better ways on how todo this or … (Fill in some something other reason). As this is something that works for me, it might help you to create your own pipeline. YMMV.

I haven’t looked at all at Ansible Tower or the open sourced version, so it might be that parts or maybe all of what I am describing here can be done by Tower.

Before we do anyting, I’ll first describe how my Ansible setup looks like so we have some background before we do anything. All of my roles has their own git repository, including documentation and Jenkinsfiles. A Jenkinsfile is the Jenkins job configuration file that contains all steps that Jenkins will execute. Its the .travis.yml (Of Travis CI) file equivalent of Jenkins and we will come back later to this. I also have 1 git repository that contains all ansible data, like host_vars, group_vars and the inventory file.

I have a Jenkins running with the Docker plugin and once a job is started, a Docker container will be started and the job will be executed from this container. Once the Job is done (Succeeded or Failed doesn’t matter which), the container and all data in this container is removed.

Jenkins Jobs

All my Ansible roles has 3 jenkinsfiles stored in the git repository for the following actions:

  1. Molecule Tests
  2. Staging deployment
  3. Production deployment

Molecule Tests

The first job is that the role is tested with Molecule. With Molecule we create 1 or more Docker containers and the role is deployed to these containers. Once that is done, we do an idempotent check and with TestInfra we verify if installation/configuration is done correctly. We can also execute some commands to verify that the deployed service is running correctly. Once these tests are completed, we can successfully deploy the ansible role without any problems. (On this page I have described some information on Molecule.)

How does the Jenkinsfile looks like:

node() {
    try {
        stage ("Get Latest Code") {
            checkout scm
            sh 'git rev-parse HEAD > .git/commit-id'
        }
        stage ("Install Application Dependencies") {
            sh 'sudo pip install --upgrade ansible==${ANSIBLE_VERSION} molecule==${MOLECULE_VERSION} docker'
        }
        stage ("Executing Molecule lint") {
            sh 'molecule lint'
        }
        stage ("Executing Molecule create") {
            sh 'molecule create'
        }
        stage ("Executing Molecule converge") {
            sh 'molecule converge'
        }
        stage ("Executing Molecule idemotence") {
            sh 'molecule idempotence'
        }
        stage ("Executing Molecule verify") {
            sh 'molecule verify'
        }
        stage('Tag git'){
            def commit_id = readFile('.git/commit-id').trim()
            withEnv(["COMMIT_ID=${commit_id}"]){
                sh '''#!/bin/bash
                if [[ $(git tag | grep ${COMMIT_ID} | wc -l) -eq 1 ]]
                    then    echo "Tag already exists"
                    else    echo "Tag will be created"
                            git config user.name "jenkins"
                            git config user.email "jenkins@localhost"
                            git tag -a $COMMIT_ID -m "Added tagging"
                            git push --tags
                fi
                '''
            }
        }
        stage('Start Staging Job') {
            def commit_id = readFile('.git/commit-id').trim()
            withEnv(["COMMIT_ID=${commit_id}"]){
                build job: 'ansible-access-2-staging', wait: false, parameters: [string(name: 'COMMIT_ID', value: "${COMMIT_ID}") ]
            }
        }
    } catch(all) {
        currentBuild.result = "FAILURE"
        throw err
    }
}

First stage of the Job is the checkout of the sourcecode of the git repository, so that we have data in the container. We get the latest git commit id, because I use this id to create a tag in git once the Molecule Tests succeeds.

First Molecule action is the lint. First we do some linting on the role and test files to make sure it is compliant. If it find some errors, it fails quickly and we can fix it. Then it proceeds with the Molecule actions create, converge, idemptence and verify. For those who are familiar with Molecule will notice that I use different stages for each action and not 1 stage which executes molecule test.

Stages overview of Jenkins job.

I use separate stages with single commands so I can quickly see on which part the job fails and focus on that immediately without going to the console output and scrolling down to see where it fails. After the Molecule verify stage, the Tag git stage is executed. This will use the latest commit id as a tag, so I know that this tag was triggered by Jenkins to run a build and was successful.

Last stage in the job is to start the 2nd job in Jenkins. This stage will start the job ansible-access-2-staging with the COMMIT_ID as parameter to the job and in the background (wait: false).

Currently, the Molecule configuration only has 1 “default” scenario. If I had more scenarios than the Jenkinsfile had probably a lot more stages or maybe more Jenkinsfiles.

Staging deployment

The first job was executed correctly and now this job is triggered. As mentioned before, the commit id of the previous job is passed into this job. The goal for this job is to deploy the role to an staging server and validate if everything is still working correctly. In this case we will execute the same tests on the staging staging as we did with Molecule, but we can also create an other test file and use that. In my case, there is only one staging server but it could also be a group of servers.

The Jenkinsfile for this job looks like this:

node() {
    try {
        stage ("Get the Code") {
            checkout scm: [$class: 'GitSCM', branches: [[name: "refs/tags/${params.COMMIT_ID}"]], extensions: [[$class: 'RelativeTargetDirectory', relativeTargetDir: 'ansible-access']], userRemoteConfigs: [[url: 'ssh://git@192.168.1.206:2222/ansible/access.git']]]
            checkout scm: [$class: 'GitSCM', branches: [[name: '*/master']], extensions: [[$class: 'RelativeTargetDirectory', relativeTargetDir: 'environment']], doGenerateSubmoduleConfigurations: false, userRemoteConfigs: [[url: 'ssh://git@192.168.1.206:2222/ansible/environment.git']]]
            sh 'pwd > workspace'
        }
        stage ("Install Application Dependencies") {
            sh 'sudo pip install --upgrade ansible==${ANSIBLE_VERSION} testinfra docker'
        }
        stage ("Execute role on host(s)") {
            dir("environment") {
                sh "ansible-playbook -i hosts -l staging playbooks/ansible-access.yml"
            }
        }
        stage ("Test Role execution") {
            workspace = readFile('workspace').trim()
            withEnv(["WORKSPACE_DIR=${workspace}", "MOLECULE_INVENTORY_FILE=${workspace}/environment/hosts"]){
                dir("environment/") {
                    sh "testinfra --connection=ansible --ansible-inventory=hosts --hosts=staging ${WORKSPACE_DIR}/ansible-access/molecule/default/tests/test_default.py --verbose"
                }
            }
        }
        stage('Tag git'){
            withEnv(["COMMIT_ID=${params.COMMIT_ID}_staging"]){
                dir("ansible-access/") {
                    sh '''#!/bin/bash
                    if [[ $(git tag | grep "${COMMIT_ID}" | wc -l) -eq 1 ]]
                        then    echo "Tag already exists"
                        else    echo "Tag will be created"
                                git config user.name "jenkins"
                                git config user.email "jenkins@localhost"
                                git tag -a $COMMIT_ID -m "Added tagging"
                                git push --tags
                    fi
                    '''
                }
            }
        }
        stage('Start Production Job') {
            build job: 'ansible-access-3-production', wait: false, parameters: [string(name: 'COMMIT_ID', value: "${params.COMMIT_ID}") ]
        }
    } catch(all) {
        currentBuild.result = "FAILURE"
        throw err
    }
}

The first stage is to checkout 2 git repositories: The Ansible Role and the 2nd is my “environment” repository that contains all Ansible data and both are stored in their own sub directory. With the Ansible role we checkout the provided tag refs/tags/${params.COMMIT_ID}. I also had to configure the url for the git repositories. Last step is to create a file that holds the output of the pwd file. We need this location in a later stage.

The 2nd Stage is to install the required applications, so not very interesting. The 3rd stage is to execute the playbook. In my “environment” repository (That holds all Ansible data) there is a playbooks directory and in that directory contains the playbooks for the roles. For deploying the ansible-access role, a playbook named ansible-access.yml is present and will be use to install the role on the host:

---
- hosts: all:!localhost
  become: True
  roles:
    - role: ansible-access

Very basic/simple. The 4th stage is to execute the Testinfra test script from the molecule directory to the staging server to verify the correct installation/configuration. In this case I used the same tests as Molecule, but I could also create a seperate file with some extra or other tests to verify the correct behaviour of the host.

And when all tests are complete, we create a new tag. In this job we create a new tag ${params.COMMIT_ID}_staging and push it so we know that the provided tag is deployed to our staging server.

With the last stage, we start the 3rd and last job, the job to deploy the role on the rest of the servers.

Production deployment

This is the job that deploys the Ansible role to the rest of the servers. This Jenkinsfile looks almost the same as the previous one, but with a few exceptions.

node() {
    try {
        stage ("Get the Code") {
            checkout scm: [$class: 'GitSCM', branches: [[name: "refs/tags/${params.COMMIT_ID}"]], extensions: [[$class: 'RelativeTargetDirectory', relativeTargetDir: 'ansible-access']], userRemoteConfigs: [[url: 'ssh://git@192.168.1.206:2222/ansible/access.git']]]
            checkout scm: [$class: 'GitSCM', branches: [[name: '*/master']], extensions: [[$class: 'RelativeTargetDirectory', relativeTargetDir: 'environment']], doGenerateSubmoduleConfigurations: false, userRemoteConfigs: [[url: 'ssh://git@192.168.1.206:2222/ansible/environment.git']]]
            sh 'pwd > workspace'
        }
        stage ("Install Application Dependencies") {
            sh 'sudo pip install --upgrade ansible==${ANSIBLE_VERSION} testinfra docker'
        }
        stage ("Execute role on host(s)") {
            dir("environment") {
                sh "ansible-playbook -i hosts -l 'all:!localhost:!staging' playbooks/ansible-access.yml"
            }
        }
        stage ("Test Role execution") {
            workspace = readFile('workspace').trim()
            withEnv(["WORKSPACE_DIR=${workspace}", "MOLECULE_INVENTORY_FILE=${workspace}/environment/hosts"]){
                dir("environment") {
                    sh "testinfra --connection=ansible --ansible-inventory=hosts --hosts='all:!localhost:!staging' ${WORKSPACE_DIR}/ansible-access/molecule/default/tests/test_default.py --verbose"
                }
            }
        }
        stage('Tag git'){
            withEnv(["COMMIT_ID=${params.COMMIT_ID}_production"]){
                dir("ansible-access") {
                    sh '''#!/bin/bash
                    if [[ $(git tag | grep "${COMMIT_ID}" | wc -l) -eq 1 ]]
                        then    echo "Tag already exists"
                        else    echo "Tag will be created"
                                git config user.name "jenkins"
                                git config user.email "jenkins@localhost"
                                git tag -a $COMMIT_ID -m "Added tagging"
                                git push --tags
                    fi
                    '''
                }
            }
        }
    } catch(all) {
        currentBuild.result = "FAILURE"
        throw err
    }
}

With the 3rd stage “Execute role on host(s)” we use an different limit. We now use all:!localhost:!staging to deploy to all hosts, but not to localhost and staging. Same is for the 4th stage, for executing the tests. As the last stage in the job, we create a tag ${params.COMMIT_ID}_production and we push it. Once we see this tag in our repository, we know that the changes is installed correctly on all servers.

Keep in mind that this can only be successful if you use proper and correct tests. You’ll really need to be sure that your tests is covering all of the components that is changed by your role. This deployment will fail or succeed with the quality of your tests.

Good luck and if you have suggestions please let me know.

Using Molecule V2 to test Ansible Roles

Its been a few weeks now since Molecule V2 was released. So lets go into some details with Molecule V2 and lets upgrade my dj-wasabi.zabbix-agent role to Molecule V2 during this blogpost.

For those who are unfamiliar with Molecule: Molecule allows you to development and test Ansible Roles. With Molecule, 1 or more Docker containers are created and the Ansible role is executed on these Docker containers (You can also configure Vagrant and some other providers). You can then verify if the role is installed/configured correctly in the container. Is the package installed by Ansible, is the service running, is the configuration file correctly placed with the correct information etc etc.

This would allow you to increase reliability and stability of your Role. For almost all of my publicly available Ansible Roles, I have tests configured. If someone makes an Pull Request on Github with a change, these tests will help me to see if the change won’t break anything and thus the Pull Request can easily been merged. If not, some more attention to the change is needed.

This might be obvious, but if you do not have Molecule installed or already have something installed lets update it to the latest version (As moment of writing 2.0.3):

pip install —upgrade molecule

When we execute the –version:

$ molecule --version
molecule, version 2.0.3

We see the version: 2.0.3

Porting

The people behind Molecule have created a page for porting a role that is already configured with Molecule V1 to port it to Molecule V2. As the page mentions about a python script and doing it manually, we use the manually option for migrating the role to Molecule V2.

I have created a git branch (port_molecule_v2) on my mac and will execute the first command that is described on the porting guide:

(environment) wdijkerman@Werners-MacBook-Pro [ ~/git/ansible/ansible-zabbix-agent -- Tue Sep 05 13:16:00 ]
(port_molecule_v2) $ molecule init scenario -r ansible-zabbix-agent -s default -d docker
--> Initializing new scenario default...
Initialized scenario in /Users/wdijkerman/git/ansible/zabbix-agent/molecule/default successfully.
(environment) wdijkerman@Werners-MacBook-Pro [ ~/git/ansible/ansible-zabbix-agent -- Tue Sep 05 13:16:02 ]
(port_molecule_v2) $

This command will create a “default” scenario. The biggest improvement of using Molecule V2 is using scenarios. You can use 1 “default” scenario or you might want to use 5 scenario’s. Its completely up to you on how you want to test your role.

The init command has created a new directory called “molecule”. This directory will contain all scenario’s:

(environment) wdijkerman@Werners-MacBook-Pro [ ~/git/ansible/ansible-zabbix-agent -- Tue Sep 05 13:22:22 ] (port_molecule_v2) $ tree molecule
molecule
└── default
    ├── Dockerfile.j2
    ├── INSTALL.rst
    ├── create.yml
    ├── destroy.yml
    ├── molecule.yml
    ├── playbook.yml
    └── tests
        └── test_default.py

2 directories, 7 files

Here you see the “default” scenario we just created earlier. This scenario contains several files. We will discuss some files later on this post.

Back to the porting guide. The 2nd option on the porting guide is to move the current testinfra tests to the file molecule/default/tests/test_default.py. So lets move all the tests (And I mean only the tests and not the other testinfra specific code) from one file to the other. Keep the contents of the new test_default.yml in place, as this is needed for Molecule.

The 3rd option on the porting guide is for ServerSpec, as we don’t use this we will skip this and continue with the 4th option. The 4th option on the porting guide is to port the old molecule.yml file to the new one. Now its get interesting.

The current default molecule.yml file in the scenario/default directory:

---
dependency:
  name: galaxy
driver:
  name: docker
lint:
  name: yamllint
platforms:
  - name: instance
    image: centos:7
provisioner:
  name: ansible
  lint:
    name: ansible-lint
scenario:
  name: default
verifier:
  name: testinfra
  lint:
    name: flake8

It will end like this:

---
dependency:
  name: galaxy
driver:
  name: docker
lint:
  name: yamllint

platforms:
  - name: zabbix-agent-centos
    image: milcom/centos7-systemd:latest
    groups:
      - group1
    privileged: True
  - name: zabbix-agent-debian
    image: maint/debian-systemd:latest
    groups:
      - group1
    privileged: True
  - name: zabbix-agent-ubuntu
    image: solita/ubuntu-systemd:latest
    groups:
      - group1
    privileged: True
  - name: zabbix-agent-mint
    image: vcatechnology/linux-mint
    groups:
      - group1
    privileged: True
provisioner:
  name: ansible
  lint:
    name: ansible-lint
scenario:
  name: default
verifier:
  name: testinfra
  lint:
    name: flake8

platforms

The platforms is a generic configuration approach to configure the instances in Molecule V2. With Molecule V1, you’ll had a docker configuration, a vagrant configuration etc etc for configuring the instances, but with V2 you only have platforms.

In the above example I have configured 4 instances, named zabbix-agent-centos, zabbix-agent-debian, zabbix-agent-ubuntu and zabbix-agent-mint. The all have an image configured and I have placed them in the group1 group. I don’t do anything with the groups with this Role, but lets add them anyways. I also added the “privileged: True”, because the role does use systemd and needs a privileged container to execute successfully. Later in this blog post we do something with dependencies and some Ansible configuration, so don’t run away just yet. 😉

The 5th option in the porting guide is to port the existing playbook.yml to the new playbook.yml in the default directory. So I’ll move the contents from one file to an other file.

As 6th and last option in the porting guide is to cleanup the old stuff. So remove the old files and directories and we can continue with the molecule test command.

Lets execute it.

(port_molecule_v2) $ molecule test
--> Test matrix
    
└── default
    ├── destroy
    ├── dependency
    ├── syntax
    ├── create
    ├── converge
    ├── idempotence
    ├── lint
    ├── side_effect
    ├── verify
    └── destroy
--> Scenario: 'default'
--> Action: 'destroy'
    
    PLAY [Destroy] *****************************************************************
    
    TASK [Destroy molecule instance(s)] ********************************************
    changed: [localhost] => (item=(censored due to no_log))
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=1    changed=1    unreachable=0    failed=0
    
    
--> Scenario: 'default'
--> Action: 'dependency'
Skipping, missing the requirements file.
--> Scenario: 'default'
--> Action: 'syntax'
    
    playbook: /Users/wdijkerman/git/ansible/zabbix-agent/molecule/default/playbook.yml
    
--> Scenario: 'default'
--> Action: 'create'

There is a lot of output now which I won’t add now, but just take a look at the beginning which I pasted above this line. During the output it shows which scenario is executing and which task. You can see that the line begins with “–> Scenario: ” and with “–> Action: “.

This is why Molecule V2 is awesome:

Molecule V2 uses Ansible itself to create the instances on which we want to install/test our Ansible role. You can see that by opening the create.yml file in the default directory. If we just place the last task in this blogpost

- name: Create molecule instance(s)
  docker_container:
    name: "{{ item.name }}"
    hostname: "{{ item.name }}"
    image: "molecule_local/{{ item.image }}"
    state: started
    recreate: False
    log_driver: syslog
    command: "{{ item.command | default('sleep infinity') }}"
    privileged: "{{ item.privileged | default(omit) }}"
    volumes: "{{ item.volumes | default(omit) }}"
    capabilities: "{{ item.capabilities | default(omit) }}"
  with_items: "{{ molecule_yml.platforms }}"

This last task in the create.yml file will create the actual Docker instance which we have configured in the molecule.yml file in the “platforms” section, which you can see at the “with_items” option. This is very cool, this means that you can configure the docker container with all the settings that Ansible allows you to use and Molecule will not limit this for you.

You can easily add for example the “oom_killer” option to the create.yml playbook and add it to the platform configuration in molecule.yml, without adding an feature request at Molecule and waiting when the feature is implemented.  Not that the waiting was long, the people behind Molecule are very fast fixing issues and adding features, so kuddo’s to them!

As you have guessed already, the create.yml file is for creating the instanced and destroy.yml will destroy those instances. You can override this if you don’t like the names.

This is an example if you really want to use other names for the playbooks (Or if you want to share playbooks when you have multiple scenario’s):

provisioner:
  name: ansible
  options:
    vvv: True
  playbooks:
    create: ../playbook/create-instances.yml
    converge: playbook.yml
    destroy:../playbook/destroy-instances.yml

Back to the molecule test command. The molecule test command fails on my first run during the lint action. (I will not show all output, as the list is very long!)

--> Scenario: 'default'
--> Action: 'lint'
--> Executing Yamllint on files found in /Users/wdijkerman/git/ansible/zabbix-agent/...
    /Users/wdijkerman/git/ansible/zabbix-agent/defaults/main.yml
      7:37      warning  too few spaces before comment  (comments)
      10:17     warning  truthy value is not quoted  (truthy)
      15:81     error    line too long (120 > 80 characters)  (line-length)
      21:81     error    line too long (106 > 80 characters)  (line-length)
      30:30     warning  truthy value is not quoted  (truthy)
      31:26     warning  truthy value is not quoted  (truthy)
    
    /Users/wdijkerman/git/ansible/zabbix-agent/handlers/main.yml
      8:11      warning  truthy value is not quoted  (truthy)
      8:14      error    no new line character at the end of file  (new-line-at-end-of-file)

Some of these messages is something I can work with, some I actually do not care. The output shows you every “failing” rule with the file. So the first file, defaults/main.yml has 6 failing rules. Per rule it shows you the following:

  • which line and character position
  • Type of error (warning or error)
  • The message

In my output of the lint actions, I see a lot of “line too long” messages. Personally I find the 80 characters limit a little bit to small these days, so lets update it to something higher. We first have to update the molecule.yml file and we have to update the lint section. First the lint section looked like this:

lint:
  name: yamllint

Now configure it like so it looks like this:

lint:
  name: yamllint
  options:
    config-file: molecule/default/yaml-lint.yml

We specify the yamllint by configuring a configuration file. Lets create the file yaml-lint.yml in the default directory and add something like this:

---

extends: default

rules:
  line-length:
    max: 120
    level: warning

We extend the current yaml-lint configuration by adding some of our own rules to overwrite the defaults. In this case, we overwrite the “line-length” rule to set the max to 120 characters and we set the level to warning (It was error). Every rule that results in an error will fail the lint action and in this case I don’t want to fail the tests because the line length was 122 characters.

When we run it again (I have fixed some other linting issues now, so output is a little different)

--> Scenario: 'default'
--> Action: 'lint'
--> Executing Yamllint on files found in /Users/wdijkerman/git/ansible/zabbix-agent/...
    /Users/wdijkerman/git/ansible/zabbix-agent/defaults/main.yml
      15:81     warning  line too long (120 > 80 characters)  (line-length)
      21:81     warning  line too long (106 > 80 characters)  (line-length)
    
    /Users/wdijkerman/git/ansible/zabbix-agent/molecule/default/create.yml
      9:81      warning  line too long (87 > 80 characters)  (line-length)
      10:81     warning  line too long (85 > 80 characters)  (line-length)
      16:81     warning  line too long (116 > 80 characters)  (line-length)
      30:81     warning  line too long (92 > 80 characters)  (line-length)
      33:81     warning  line too long (124 > 80 characters)  (line-length)

It keeps showing the “line too long”,  but as a warning and the lint action continues working. After this, the verify works too and the test is done!

Well, now I can commit my changes and push them to GitHub and lets Travis verify that it works. (Will not discuss that here).

group_vars

The zabbix-agent role doesn’t have any group_vars configured, but some of my other roles have group_vars configured in Molecule. Lets give a basic example of configuring the pizza property in the group_vars.

We have to update the provisioner section in molecule.yml:

provisioner:
  name: ansible
  lint:
    name: ansible-lint
  inventory:
    group_vars:
      group1:
        pizza: "Yes Please"

Here we “add” a property named pizza for all hosts that are in the group “group1”. If we had configured this with earlier on with the zabbix-agent role, all of the configured instances had access to the pizza property.

What if we have multiple scenarios and all use the same group_vars? We can create in the git_root directory of the role a directory named inventory and this has 1 or 2 subdirectories: group_vars and host_vars (if needed). To make the pizza property work, we create a file inventory/group_vars/group1 and add

---
pizza: "Yes Please"

Then we update the provisioner section in molecule.yml:

provisioner:
  name: ansible
  inventory:
    links:
      group_vars: ../../../inventory/group_vars/
      host_vars: ../../../inventory/host_vars/

Is this awesome or not?

Dependencies

This is almost the same as with Molecule V1, but with Molecule V2 the file should be present in the specific scenario directory (In my case molecule/default/) and should have the name requirement.yml.

The file requirement.yml is still in the same format as how it was (As this is specific to Ansible and not Molecule ;-))

---
- src: geerlingguy.apache
- src: geerlingguy.mysql
- src: geerlingguy.postgresql

If you want to add some options, you can do that by changing the dependency section of molecule.yml:

dependency:
  name: galaxy
  options:
    ignore-certs: True
    ignore-errors: True

With Molecule V1, there was a possibility to point to a requirements file, with Molecule V2 not.

ansible.cfg

This file is not needed anymore, we can all do this with the provisioner section in molecule.yml. So we don’t have to store the ansible.cfg and point it to the molecule.yml file like how it was with Molecule V1.

Lets say we have an ansible.cfg with the following contents:

[defaults]
library = Library

[ssh_connection]
scp_if_ssh = True

We can easily do this by updating the provisioner section to this:

provisioner:
  name: ansible
  config_options:
    defaults:
      library: Library
    ssh_connection:
      scp_if_ssh: True

TL;DR

Just upgrade to Molecule V2 and have fun! This is just awesome.

@Molecule coders: Thank you for this awesome version!

Automatically generate PKI certificates with Vault

A while a go I wrote an item on how to setup a secure Vault with Consul as backend and its time to do something with Vault again. With this blogpost we will setup Vault with the PKI backend. With the PKI backend we can generate or revoke short lived ssl certificates with Vault.

The goal with this blogpost is that we create intermediate CA certificate, configure Vault and generate certificates via the cmd line and via the API. The reason we use intermediate CA certificate is that if something might happen with the certificate/key, its much easier to revoke it and recreate a new intermediate certificate. If this would happen with the actual ROOT CA, you’ll have a troubles and work to fix it again. So keep the ROOT CA files on a safe place!

Preparations

We will create an intermediate certificate that Vault will be using to create and sign certificate requests. We have to create a new key and the certificate needs to be signed by the ROOT CA. First we create the key:

openssl genrsa -out private/intermediate_ca.key.pem 4096

And now we need to create a certificate signing request:

openssl req -config intermediate/openssl.cnf -new -sha256 \
-key private/intermediate_ca.key.pem -out \
csr/intermediate_ca.csr.pem

We have to make sure that we fill in the same information as the original CA, but in this case we use a slightly different Organisation Unit name so we know/verify that a certificate is signed by this intermediate CA instance. Once we filled in all data, we have to sign it with the ROOT CA to create the actual certificate:

openssl ca -keyfile private/cakey.pem -cert \
dj-wasabi.local.pem -extensions v3_ca -notext -md \
sha256 -in csr/intermediate_ca.csr.pem -out \
certs/intermediate_ca.crt.pem
Using configuration from /etc/pki/tls/openssl.cnf
Check that the request matches the signature
Signature ok
Certificate Details:
        Serial Number: 18268543712502854739 (0xfd86e7b7336db453)
        Validity
            Not Before: Aug 23 13:56:08 2017 GMT
            Not After : Aug 21 13:56:08 2027 GMT
        Subject:
            countryName               = NL
            stateOrProvinceName       = Utrecht
            organizationName          = dj-wasabi
            organizationalUnitName    = Vault CA
            commonName                = dj-wasabi.local
            emailAddress              = ikben@werner-dijkerman.nl
        X509v3 extensions:
            X509v3 Subject Key Identifier: 
                93:46:3D:69:24:32:C7:11:C4:B7:27:66:89:67:FB:1F:8E:1B:50:97
            X509v3 Authority Key Identifier: 
                keyid:60:63:7E:0F:54:5E:7D:A5:37:A8:6F:BD:27:BF:73:15:56:B2:89:31

            X509v3 Basic Constraints: 
                CA:TRUE
Certificate is to be certified until Aug 21 13:56:08 2027 GMT (3650 days)
Sign the certificate? [y/n]:y

1 out of 1 certificate requests certified, commit? [y/n]y
Write out database with 1 new entries
Data Base Update

With the -keyfile and -cert we provide the key and crt file of the root CA to sign the new intermediate ssl certificate. Ok, 10 years might be a little bit to long, but this is just for my local environment and my setup probably won’t last that long. 🙂

We are almost done with the preparations and one thing we need to do before we go configuring Vault. We have to combine both the CA certificates and the intermediate private key into a single file, before we can upload it to Vault.

cat certs/intermediate_ca.crt.pem dj-wasabi.local.pem \
private/intermediate_ca.key.pem > certs/ca_bundle.pem

First we print the contents of the newly created crt file, then the ROOT ca crt file and as last the intermediate private key and place that all in a single file called ca_bundle.pem.

Vault

Now we are ready to continue with the Vault part. We open a terminal to the host/container running Vault and before we can do somehting, we have to authenticate ourself first. I use the root token for authenticating:

export VAULT_TOKEN=<_my_root_token_>

The pki backend is disabled at default so we have to enabled it before we can use it. You can enable it multiple times, each enabled backend can be used for a specific domain. In this post we only use one domain, but lets pretend we need to create a lot more after this so we don’t use “defaults” in paths and naming.

We will mount the pki plugin for the dj-wasabi.local domain, so lets use the path: dj-wasabi. We give it a small description and then we specify the pki backend and then hit enter.

vault mount -path=dj-wasabi -description="dj-wasabi Vault CA" pki

There are some more options we don’t use for now with this example but maybe you want some more control for it, you can see them by executing the command: vault mount –help.
We can verify that we have mounted the pki backend by executing the vault mounts command:

bash-4.3$ vault mounts
Path        Type       Accessor            Plugin  Default TTL  Max TTL    Force No Cache  Replication Behavior  Description
cubbyhole/  cubbyhole  cubbyhole_2540c354  n/a     n/a          n/a        false           local                 per-token private secret storage
dj-wasabi/  pki        pki_6e5dc562        n/a     system       system     false           replicated            dj-wasabi Vault CA
secret/     generic    generic_fb0527dd    n/a     system       system     false           replicated            generic secret storage
sys/        system     system_347beff9     n/a     n/a          n/a        false           replicated            system endpoints used for control, policy and debugging

Now its time to upload the intermediate bundle file. I have temporarily placed the file in the config directory of Vault (Its a host mount, so it was easier to copy the file to the container) and now we have to upload it to our dj-wasabi backend. We have to upload our ca bundle file into the path we earlier used to mount the pki backend: <mount_path>/config/ca, in my case it is dj-wasabi/config/ca:

vault write dj-wasabi/config/ca \
pem_bundle="@/vault/config/ca_bundle.pem"
Success! Data written to: dj-wasabi/config/ca

If you get an error now, it probably means something went wrong with either creating the ca bundle file or validating the intermediate certificate.

Now we need to set some correct urls. These urls are placed in the certificates that are generated and that allows browsers/applications to do some validations. We will set the following urls:

  • issuing_certificates: The endpoint on which browsers/3rd party tools can request information about the CA;
  • crl_distribution_points: The endpoint on which the Certification Revocation List is available. This is a list with revoked Certificates;
  • ocsp_servers: The url on which the OCSP service is available. OCSP Stands for Online Certificate Status Protocol and is used to determine the state of the Certificate. You can see it as a better version of the Certificate Revocation List;

Lets configure the urls:

vault write dj-wasabi/config/urls \
issuing_certificates="https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/ca" \
crl_distribution_points="https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/crl" \
ocsp_servers="https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/ocsp"
Success! Data written to: dj-wasabi/config/urls

We will come later on the blog post about this. 🙂

Before we can generate certificates, we need to create a role in Vault. With this role we map a name to a policy. This policy describes the configuration that is needed for generating the certificates. For example we have to configure on which domain we need create the certificates, can we create sub domains and most important, what is the ttl of a certificate.

vault write dj-wasabi/roles/dj-wasabi-dot-local allowed_domains="dj-wasabi.local" allow_subdomains="true" max_ttl="72h"
Success! Data written to: dj-wasabi/roles/dj-wasabi-dot-local

We are all set now, so lets create a certificate.

We specify the just created role and at minimum we have to provide the common_name (In this case small-test.dj-wasabi.local). You can find here all the options you can give when generating a certificate. The command looks like this:

vault write dj-wasabi/issue/dj-wasabi-dot-local common_name=small-test.dj-wasabi.local
Key             	Value
---             	-----
ca_chain        	[-----BEGIN CERTIFICATE-----
MIIFtTCCA52gAwIBAgIJAP2G57czbbRTMA0GCSqGSIb3DQEBCwUAMFcxCzAJBgNV
...
-----END CERTIFICATE-----
issuing_ca      	-----BEGIN CERTIFICATE-----
MIIFtTCCA52gAwIBAgIJAP2G57czbbRTMA0GCSqGSIb3DQEBCwUAMFcxCzAJBgNV
...
-----END CERTIFICATE-----
private_key     	-----BEGIN RSA PRIVATE KEY-----
MIIEpQIBAAKCAQEAsFSmpBCFN945+Chyz/YqsB2a/T73kdst4v7qm2ZLK50RxCj0
...
-----END RSA PRIVATE KEY-----
private_key_type	rsa
serial_number   	03:f2:bb:f5:27:16:81:20:76:0d:91:6f:fd:10:05:2d:a6:e1:59:e3

The command provides a lot of information and I have removed some of it to not full a whole page with unreadable data. It provides you all the data you’ll need to create a service that needs ssl certificates. As you see, it provides the certificate and the private_key, but also the ca_chain.

API

Lets generate a SSL certificate via the API.

curl -XPOST -k -H 'X-Vault-Token: <_my_root_token_>' \
-d '{"common_name": "blog.dj-wasabi.local"}' \
https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/issue/dj-wasabi-dot-local

We do an POST, and as a minimum we only provide the common_name (In this case blog.dj-wasabi.local). We use the X-Vault-Token which in my case is the ROOT Token as a header and we post it to the url https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/issue/dj-wasabi-dot-local url. If you remember, the dj-wasabi-dot-local is the name of the role, so this role has the correct ttl etc.

Lets execute it and once the certificate is created, a lot of output is returned in json format:

curl -XPOST -k -H 'X-Vault-Token: <_my_root_token_>' \
-d '{"common_name": "blog.dj-wasabi.local"}' \
https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/issue/dj-wasabi-dot-local
{"request_id":"e1d0f686-d0d8-d1d8-d7ab-428c7322229b","lease_id":"","renewable":false,"lease_duration":0,"data":{"ca_chain":["-----BEGIN CERTIFICATE-----asas-----END CERTIFICATE-----","-----BEGIN CERTIFICATE----asas------END CERTIFICATE-----"],"certificate":"-----BEGIN CERTIFICATE-----asas-----END CERTIFICATE-----","issuing_ca":"-----BEGIN CERTIFICATE-----asas-----END CERTIFICATE-----","private_key":"-----BEGIN RSA PRIVATE KEY-----asas-----END RSA PRIVATE KEY-----","private_key_type":"rsa","serial_number":"11:42:ba:66:94:b4:c9:5c:e5:1a:77:da:76:2e:57:5d:b5:64:f5:c3"},"wrap_info":null,"warnings":null,"auth":null}

Again I removed a lot of unreadable data from the example. Again you’ll see the private_key, certificate and the ca_chain which can be used with a service like nginx.

Lets do an overview of all certificates stored in our Vault:

curl -XGET -H 'X-Vault-Token: <_my_root_token_>' \
--request LIST https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/certs 
{"request_id":"fb5e7060-0d02-211a-ae25-50507a334706","lease_id":"","renewable":false,"lease_duration":0,"data":{"keys":["03-f2-bb-f5-27-16-81-20-76-0d-91-6f-fd-10-05-2d-a6-e1-59-e3","11-42-ba-66-94-b4-c9-5c-e5-1a-77-da-76-2e-57-5d-b5-64-f5-c3"]},"wrap_info":null,"warnings":null,"auth":null}

We see that there are 2 certificates stored in the Vault, the “keys” has 2 values. These keys are the Serial Numbers of the certificates. We have to use this Serial Number if we want to revoke it or we just want to get the certificate. An example of getting the certificate:

curl -XGET -H 'X-Vault-Token: df80e726-d3f0-8344-3782-fec19fe7a745' \
https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/cert/11-42-ba-66-94-b4-c9-5c-e5-1a-77-da-76-2e-57-5d-b5-64-f5-c3
{"request_id":"ae6e63f9-c04e-ac4c-d8a8-254347284771","lease_id":"","renewable":false,"lease_duration":0,"data":{"certificate":"-----BEGIN CERTIFICATE-----asasas-----END CERTIFICATE-----\n","revocation_time":0},"wrap_info":null,"warnings":null,"auth":null}

Again I removed some data from the example. You can only get the certificate, not the private key. I’ve copied the contents of the certificate in a file called blog.dj-wasabi.local.crt on my Mac, so when I run the openssl x509 command, it will show some information about this certificate:

openssl x509 -in blog.dj-wasabi.local.crt -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            11:42:ba:66:94:b4:c9:5c:e5:1a:77:da:76:2e:57:5d:b5:64:f5:c3
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=NL, ST=Utrecht, O=dj-wasabi, OU=Vault CA, CN=dj-wasabi.local/emailAddress=ikben@werner-dijkerman.nl
        Validity
            Not Before: Aug 23 16:51:36 2017 GMT
            Not After : Aug 26 16:52:05 2017 GMT
        Subject: CN=blog.dj-wasabi.local
 ...
            Authority Information Access: 
                OCSP - URI:https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/ocsp
                CA Issuers - URI:https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/ca

            X509v3 Subject Alternative Name: 
                DNS:blog.dj-wasabi.local
            X509v3 CRL Distribution Points: 

                Full Name:
                  URI:https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/crl
 ...

The output shows that the certificate is only valid (Validity) for 3 days (72 hours). If you take a look at the “Authority Information Access”, you’ll see the urls (OCSP and the CA Issuers) we have set earlier. And a little bit further we see the CRL Distribution Points, an url we also have set with the set urls command.

Keep in mind: Only during the generation of the certificate, the private key is returned. If you did loose the private key, then revoke the certificate and generate a new one.

As last command in this blogpost we do a revoke of an certificate. We have to do an POST and sent the serial_number to the revoke endpoint.

curl -XPOST -k -H 'X-Vault-Token: <_my_root_token_>' \
-d '{"serial_number":"03-f2-bb-f5-27-16-81-20-76-0d-91-6f-fd-10-05-2d-a6-e1-59-e3"}' \
https://vault.service.dj-wasabi.local:8200/v1/dj-wasabi/revoke
{"request_id":"ea8a7132-231f-7075-f42b-f81b272cc9cd","lease_id":"","renewable":false,"lease_duration":0,"data":{"revocation_time":1503506236,"revocation_time_rfc3339":"2017-08-23T16:37:16.755130614Z"},"wrap_info":null,"warnings":null,"auth":null}

It returns a json output with a key named revocation_time. This is the time since epoch when the certificate is revoked, 0 if the certificate isn’t revoked.

So, that was it! Have fun!

Monitoring Consul with statsd exporter and Prometheus

My choice for using a monitoring tool is currently Prometheus. With Prometheus you can easily gather metrics of applications and/or databases to see the actual performance of the application/database. When you have a tool like Zabbix or Nagios, you’ll need to write one or multiple scripts to gather all metrics and see how much you can store in your database without loosing performance of your monitoring tool. About the why Prometheus and not doing this with Zabbix or other monitoring tool is an subject for maybe an other blogpost.

One interesting application to monitor is Consul. When you look for monitoring Consul in google, you’ll find a lot of pages that shows you that you can use Consul as a monitoring tool but not many on how you can monitor Consul itself. With this blogpost I’ll describe what steps I have taken to monitor Consul. Please keep in mind this is is just a start and it is incomplete, so if you have suggestions to improve it please let me know.

On this blogpost we will do the following actions:

  • Configure Consul
  • Configure statsd exporter
  • Create some graphs

Configure Consul

Consul has a way for exposing metrics, called Telemetry. With Telemetry you can configure Consul for sending performance metrics to external tools/applications to monitor the performance of Consul. You can see some more information about configuring Consul for Telemetry on this page https://www.consul.io/docs/agent/options.html#telemetry. With this blogpost we will use the “statsd_address” option. In order to make this happen, we have to update our Consul configuration on the Consul Servers to add the following configuration:

    "telemetry": {
        "statsd_address": "192.168.1.202:9125"
    },

The IP Address is from the host itself, and in this case we have to send it to port 9125. Once we have configured this on all the Consul Servers, we need to restart them one by one so we keep the Consul Cluster running.

Configure statsd-exporter

When you use Prometheus, you’ll use exporters for your applications or databases to expose the metrics for Prometheus. Prometheus will scrape these metrics every 15 seconds (Well, you can configure that) and store them in the database. Consul doesn’t have an endpoint available to gather these metrics, we have to make use of the “statsd-exporter”. We already configured the Consul Servers to send metrics to a statsd server, so we only have to make sure we start one on each host running Consul Server.

Before we start an statsd-exporter, we first have to do some configuration first. We need to make sure we have a statd mapper file. With this file we map statsd fields into fields for prometheus and we can add labels per metric. On this page I have configured almost all mapping entries: https://gist.github.com/dj-wasabi/d9b31c4b74e561c72512f4edbdfe6927

Lets explain how an entry looks like:

consul.*.runtime.*
name="consul_runtime"
type="$2"
host="{{ inventory_hostname }}"

The first line in this mapping construction is the name of the statsd field. You’ll see asteriks, these are wildcards and these can be used as a value by assiging it to a filter. First asteric can be used as $1, second as $2 etc. The “name” is the name of the metric field in Prometheus, in this case the name is consul_runtime. Prometheus doesn’t accept dots in the names, so we have to use underscores for this.

We then create a label named “type” and we assign the value $2. The original statsd field that Consul has sent to the statsd-exporter looks like this:

consul.b139924a6f44.runtime.num_goroutines

With this mapping construction, we assign $1 with value b139924a6f44 and $2 with value num_goroutines. The last “host” label is something I add with Ansible. I use Ansible to deploy this statsd mapper file (And all other monitoring related configuration) to all my Consul servers and then I can filter in Prometheus or other graphing tool like Grafana which metrics belongs to which host.

I use the Docker container for the statsd-exporter, I place the statsd mapper file on /data/statsd-exporter.conf and start the following command:

docker run --name statsd-exporter \
-v /data/statsd-exporter.conf:/tmp/statsd-exporter.conf:ro \
-p 9102:9102 -p 9125:9125/udp prom/statsd-exporter \
-statsd.mapping-config=/tmp/statsd-exporter.conf \
-statsd.add-suffix=false

I mount the statsd mapper file as ro (Read Only), open 2 ports and configure the statsd-exporter tool to use the mapper file. In this case 2 ports are openend. One port on which the statsd is available for retrieving performance metrics (9125) and the other port (9102) is used for Prometheus to scrape these metrics.

Prometheus

At this moment, I have added the following into the Prometheus configuration to let Prometheus scrape the statsd-exporter metrics:

scrape_configs:
  - job_name: 'consul'
    static_configs:
      - targets: ['192.168.1.202:9102']
        labels:  {'host': 'vserver-202'}
      - targets: ['192.168.1.203:9102']
        labels:  {'host': 'vserver-203'}
      - targets: ['192.168.1.204:9102']
        labels:  {'host': 'vserver-204'}

This works for now because I Ansible to generate a Prometheus configuration, but I’ll go probably using a consul_sd_config in the near future so I won’t have to add all kinds of static configuration.

Once we have restarted Prometheus and started the statsd-exporter containers, I can see the following metrics appear in Prometheus:

consul_runtime{host="vserver-204",type="free_count"} 2.3117552e+08
consul_runtime{host="vserver-204",type="heap_objects"} 22853
consul_runtime{host="vserver-204",type="num_goroutines”} 82

(And much more, but the above 3 are examples which are used as an explanation in the previous paragraphs.)

Create some graphs

Now we have the metrics in Prometheus, but now we need to create some graphs. We use Grafana for this. Grafana can be used for creating Graphs to show the actual performance of Consul. I’ve created a Dashboard and uploaded it to grafana.com: https://grafana.com/dashboards/2351

Grafana Dashboard for Consul

Some of the following can be found on the dashboard:

  • Who is the Consul Leader;
  • How many Consul Servers are running?
  • Some CPU idle utilisation and load information (You’ll need the node-exporter for this);
  • Performance of writing information on the Consul leader to disk or the other nodes;
  • etc

This dashboard is not finished yet and is a mixed combination of Consul Leader data and Consul Server specific. So some graphs shows information specific to the selected Consul server (Dropdown at the top of the page) and some graphs show specific data for the Consul Leader.

If you have suggestions to improve the current situation, by either suggestion a better statsd mapper configuration file or for the Dashboard, please let me know so I can improve it. I hope we can all benefit from each other to improve the availability and performance of Consul with this.

Setting up a secure Vault with a Consul backend

vault_logo

With this blogpost we continue working with a secure Consul environment: We are configuring a secure Vault setup with Consul as backend. YMMV, but this is what I needed to configure to make it work.

Environment

We should have an working Consul Cluster environment. If you don’t have one, please take a look at here for creating one. With this blogpost we expect a secure Consul cluster with SSL certificates and using ACL’s.

In this blogpost we make use of the wdijkerman/vault container. This container is created by myself and is running Vault (At moment of writing release 0.6.4) on Alpine (running on 3.5). Vault is running as user ‘vault’ and the container can be configured to use SSL certificates.

prerequisites

We have to create SSL certificates for the vault service. In this blogpost we use the domain ‘dj-wasabi.local’, as Consul is already running with this domain configuration so we have to create ssl certificates for the FQDN: ‘vault.service.dj-wasabi.local’.

On my host where my OpenSSL CA configuration is stored, I execute the following commands:

openssl genrsa -out private/vault.service.dj-wasabi.local.key 4096

Generate the key.

openssl req -new -extensions usr_cert -sha256 -subj "/C=NL/ST=Utrecht/L=Nieuwegin/O=dj-wasabi/CN=vault.service.dj-wasabi.local" -key private/vault.service.dj-wasabi.local.key -out csr/vault.service.dj-wasabi.local.csr

Create a signing request file and then sign it with the CA.

openssl ca -batch -config /etc/pki/tls/openssl.cnf -notext -in csr/vault.service.dj-wasabi.local.csr -out certs/vault.service.dj-wasabi.local.crt

We copy the ‘vault.service.dj-wasabi.local.key’, ‘vault.service.dj-wasabi.local.crt’ and the caroot certificate file to the hosts which will be running the Vault container into the directory /data/vault/ssl. Hashicorp advises to run vault on hosts where Consul Agents are running, not Consul Servers. This has probably todo with that for most use cases they see is that Consul is part of large networks and thus the servers will handle a lot of request (High load). As the Consul Servers will be very busy, it would then be wise to not run anything else on those servers.

But this is my own versy small environment (With 10 machines) so I will run Vault on the hosts running the Consul Server.

ACL

Before we do anything on these hosts, we create a ACL in Consul. We have to make sure that Vault can create keys in the key/value store and we have to allow that Vault may create a service in Consul named vault.

So our (Client) ACL will look like this:

key "vault/" {
  policy = "write"
}
service "vault" {
  policy = "write"
}

We use this in the ui on the Consul Server and create the ACL. In my case, the ACL is created with id ’94c507b4-6be8-9132-ea15-3fc5b196ea29′. This ID is needed later on when we configure Vault. Also check your ACL for the ‘Anonymous token’. Please make sure you have set the following rule if the Consul default policy is set to deny:

service "vault" {
  policy = "read"
}

With this, we make sure the service is resolvable via dns. In my case this is for ‘vault.service.dj-wasabi.local’.

Configuration

We have to configure the vault docker container. We have to create a directory that will be mounted in the container. First we have to create an user on the host and then we create the directory: /data/vault/config and own it to the just created user.

useradd -u 994 vault
mkdir /data/vault/config
chown vault:vault /data/vault/config

The container is using a user named vault and has UID 994 and we have to make sure that everything is in sync with names and id. Now we create a config.hcl file in the earlier mentioned directory:

backend "consul" {
  address = "vserver-202.dc1.dj-wasabi.local:8500"
  check_timeout = "5s"
  path = "vault/"
  token = "94c507b4-6be8-9132-ea15-3fc5b196ea29"
  scheme = "https"
  tls_skip_verify = 0
  tls_key_file = "/vault/ssl/vault.service.dj-wasabi.local.key"
  tls_cert_file = "/vault/ssl/vault.service.dj-wasabi.local.crt”
  tls_ca_file = "/vault/ssl/dj-wasabi.local.pem"
}

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 0
  tls_key_file = "/vault/ssl/vault.service.dj-wasabi.local.key"
  tls_cert_file = "/vault/ssl/vault.service.dj-wasabi.local.crt"
  cluster_address = "0.0.0.0:8201"
}

disable_mlock = false

First we configure a backend for Vault. As we use Consul, we use the Consul backend. Because the Consul is running on https and is using certificates, we have to use the fqdn of the Consul node as the address (same as how we did in configuring Registratror in this post). We also have to configure the options ‘tls_key_file’, ‘tls_cert_file’ and ‘tls_ca_file’, these are the ssl certificates needed for accessing the secure Consul via SSL. Because of this, we have to set the ‘scheme’ to ‘https’ and we have to specify the token for the ACL we created earlier and add the value to the the token option.

Next we configure the listener for Vault. We configure the listener that it listens on all ips on port 8200. We also make sure we configure the earlier created SSL certificates by using them in the ‘tls_key_file’ and ‘tls_cert_file’ options.

The last option is to make sure that Vault can not swap data to the local disk.

Starting Vault

Now we are ready to start the docker container. We use the following command for this:

docker run -d -h vserver-202 --name vault \
--dns=172.17.0.2 --dns-search=service.dj-wasabi.local \
--cap-add IPC_LOCK -p 8200:8200 -p 8201:8201 \
-v /data/vault/ssl:/vault/ssl:ro \
-e VAULT_ADDR=https://vault.service.dj-wasabi.local:8200 \
-e VAULT_CLUSTER_ADDR=https://192.168.1.202:8200 \
-e VAULT_REDIRECT_ADDR=https://192.168.1.202:8200 \
-e VAULT_ADVERTISE_ADDR=https://192.168.1.202:8200 \
-e VAULT_CACERT=/vault/ssl/dj-wasabi.local.pem \
wdijkerman/vault

We have the SSL certificates stored in the /data/vault/ssl and we mount these as read only on /vault/ssl. With the VAULT_ADDR we specifiy on which url the vault service is available on, this is the url which Consul provides like any other server. With the VAULT_CACERT we specify on which location the CA Certificate file of our domain. The other 3 environment variables are needed for a High Available Vault environment and is to make sure how other vault instances can contact it.

When Vault is started, we will see something like this with the docker logs vault command:

==> Vault server configuration:

Backend: consul (HA available)
Cgo: disabled
Cluster Address: https://192.168.1.202:8200
Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", tls: "enabled")
Log Level: info
Mlock: supported: true, enabled: true
Redirect Address: https://192.168.1.202:8200
Version: Vault v0.6.4
Version Sha: f4adc7fa960ed8e828f94bc6785bcdbae8d1b263

==> Vault server started! Log data will stream in below:

But where are not done yet. When Vault is started, it is in a sealed state and because this is the first vault in the cluster we have to initialise it to. Also when you check the ui of Consul, you’ll see that the vault is in an error state. Why? When Vault starts, it automatically creates a service in Consul and add health checks. These health checks will check if a vault instance is sealed or not.

Initialise

As vault is running in the container, we open a terminal to the container:

docker exec -it vault bash

Now we have a bash shell running and we going to initialise vault. First we have to make sure we set the ‘VAULT_ADDR’ to this container, by executing the following command:

export VAULT_ADDR='https://127.0.0.1:8200'

Every time we want to do something with the vault instance, we have to set the ‘VAULT_ADDR’ to localhost. If we won’t do that, we will send the commands directly against the cluster.

As this is the first vault instance in the environment, we have to initialise it and we do that by executing the following command:

vault init -tls-skip-verify
Unseal Key 1: hemsIyJD+KQSWtKp0fQ0r109fOv8TUBnugGUKVl5zjAB
Unseal Key 2: lIiIaKI1F6pJ11Jw/g1CiLyZurpfhCM9AYIylrG/SKUC
Unseal Key 3: 298bn4H8bLbJRsPASOl3R+RPuDKIt6i5fYzqxQ3wL4ED
Unseal Key 4: W4RUiOU3IzQSZ8GD2z8jBEg2wK/q17ldr3zJipFjzKQE
Unseal Key 5: FNPHf8b+WCiS9lAzbdsWyxDgwic95DLZ03IR2S0sq4AF
Initial Root Token: ed220674-24da-d446-375d-bbd0334bcb31

Vault initialized with 5 keys and a key threshold of 3. Please
securely distribute the above keys. When the Vault is re-sealed,
restarted, or stopped, you must provide at least 3 of these keys
to unseal it again.

Vault does not store the master key. Without at least 3 keys,
your Vault will remain permanently sealed.

As we set the ‘VAULT_ADDR’ to ‘https://127.0.0.1:8200&#8217;, we have to add the ‘-tls-skip-verify’ option to the vault command. If we don’t do that, it will complain the it can not validate the certificate that matches the configured url ‘vault.service.dj-wasabi.local.

After executing the command, we see some output appear. This output is very important and needs to be saved somewhere on a secure location. The output provides us 5 unseal keys and the root token. Every time a vault instance is (re)started, the instance will be in a sealed state and needs to be unsealed. 3 of the 5 tokens needs to be used when you need to unseal a vault instance.

bash-4.3$ vault unseal -tls-skip-verify
Key (will be hidden):
Sealed: true
Key Shares: 5
Key Threshold: 3
Unseal Progress: 1
bash-4.3$ vault unseal -tls-skip-verify
Key (will be hidden):
Sealed: true
Key Shares: 5
Key Threshold: 3
Unseal Progress: 2
bash-4.3$ vault unseal -tls-skip-verify
Key (will be hidden):
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

We have executed 3 times the unseal command and now this Vault instance is unsealed. You can see the ‘Unseal Progress’ changing after we enter an unseal key. We can verify that state of the vault instance by executing the vault status command:

bash-4.3$ vault status -tls-skip-verify
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0
Version: 0.6.4
Cluster Name: vault-cluster-7e01e371
Cluster ID: b9446acf-4551-e4c2-fa5f-03bd1bcf872f

High-Availability Enabled: true
Mode: active
Leader: https://192.168.1.202:8200

We see that this vault instance is not sealed and that the mode of this node is active. You can also see that the leader of the vault instance is in my case the current host. (Not strange as this is the first Vault instance of the environment.) If we want to add a 2nd and more, we have to execute the same commands as before. With the exception of the vault init command, as we already have an initialised environment.

As we are still logged in on the node, lets create a simple entry.

bash-4.3$ export VAULT_TOKEN=ed220674-24da-d446-375d-bbd0334bcb31
bash-4.3$ vault write secret/password value=secret
Success! Data written to: secret/password

We first set the ‘VAULT_TOKEN’ variable, this value of this variable is the value of the ‘Initial root token’. After that, we created a simple entry in the database. Key ‘secret/password’ is created and had the value ‘secret’.

It took some time to investigate how to setup a High Available Vault environment with Consul, not much information can be found on the internet. So maybe this page will help you setting one up yourself. If you do have improvements please let me know.