Merging YAML and JSON Documents

This article shows how to merge documents programmatically in YAML and, or JSON formats.

Table of Contents

Motivation and Introduction

Recently, I had to merge two Kubernetes configs in YAML format together. I did this manually. But at some point things turned out not only to be error-prone but also to be cumbersome. This brings up the following research questions:

  1. What does it take to merge two YAML files programmatically?
  2. Can this be done with JSON files as well?

I will demonstrate how to achieve this goal by merging new entries from another YAML file into an existing ~/.kube/config (in YAML format) file.

When merging documents, one also has to make some decisions when it comes to duplicate entries in the documents. Should they be kept? Should they be appended to existing lists? Some tools offer more granular approaches on how to deal with these situations.

But before we start working with YAML and JSON files, let’s have a look at some nice Shell features: here documents and process substitution. Both will help us to reduce the number of temporary files.

Here Documents

Here Documents are special code blocks: they use some form of I/O redirection to feed lines via stdin to interactive programs. Let’s try this out:

cat <<EOF
one
two
  three
EOF
one
two
  three

As cat reads from stdin the contents of the here document will be written to stdout. Note the third line: tabs in the beginning of lines will be kept. As it is sometimes helpful, there are ways to get rid of them:

cat <<-EOF
	one two
	three four
	five six
	EOF
one two
three four
five six

Process Substitution

Shells allow for piping stdout from one command to the next. What if you need to pipe the stdout of multiple commands? Or what if a command only accepts files as input? Process substitution to the rescue: it provides a temporary path to given commands. By reading from this path, the output of the substituted process will be provided to the reading process. Another example:

echo <(/bin/true)
/dev/fd/63

As you can see, the result of the <() process substitution is the path to a file which contains stdout of the program execution when being read. Let’s now provide the diff command with two such temporary files:

diff -Nau <(cat <<-EOF
	{"a": "b"}
	EOF
) <(cat <<-EOF
	{"c": "d"}
	EOF
)
echo
--- /dev/fd/63	2022-11-15 10:09:07.641339172 +0100
+++ /dev/fd/61	2022-11-15 10:09:07.641339172 +0100
@@ -1 +1 @@
-{"a": "b"}
+{"c": "d"}

This is where here documents come in handy: multiple lines of input can now be provided easily to other commands. Please note that the invocation of echo solely serves the purpose of adding a trailing newline character to clear any buffering. Now we have all that’s needed to merge YAML and JSON documents!

Merging two JSON documents

Multiple JSON documents can be merged easily with the well-known jq utility:

jq -n --argfile o1 <file1> --argfile o2 <file2>

Let’s try it out with process substitution and here documents:

We are looking for the multiplication operator, which the jq documentation describes as follows:

Multiplying two objects will merge them recursively: this works like addition but if both objects contain a value for the same key, and the values are objects, the two are merged with the same strategy.

If two or more objects share the same key and if that key refers to a scalar or array, then the later objects in the input will overwrite the value (source: StackOverflow).

jq -n --argfile o1 <(cat <<-EOF
	{"a": "b"}
	EOF
) --argfile o2 <(cat <<-EOF
	{"a": "e", "c": "d"}
	EOF
) '$o1 * $o2'
{
  "a": "e",
  "c": "d"
}

The result shows how jq deals with duplicate keys, a in this case: as this key exists in both documents, the latter has precedence, i.e. the mapping of "a": "e" can be found in the output.

Alternatively, multiple JSON documents can be passed using the --slurp or the --null-input flags:

echo '{"a": ["b"]}' '{"a": ["e"], "c": "d"}' | jq --slurp '.[0] * .[1]'
{
  "a": [
    "e"
  ],
  "c": "d"
}

When using the --null-input or -n flag, an iterable called inputs will be returned. It can be processed with functions such as reduce:

echo '{"a": ["b"]}' '{"a": ["e"], "c": "d"}' | jq --null-input 'reduce inputs as $item ({}; . + $item)'
{
  "a": [
    "e"
  ],
  "c": "d"
}

Let’s discuss the reduce cal for a bit: reduce inputs as $item ({}; . + $item). Here, all input documents are provided in a variable called inputs. We iterate over it and assign each document to the local variable $item. For each $item we append to the root of the resulting object: . + $item. The empty curly braces {} refer to the starting object, which is empty in our case.

Merging two YAML documents

Let’s head over to the YAML format. Here, the yq utility will help us to achieve what we want:

yq eval-all '. as $item ireduce ({}; . * $item)' <(cat <<EOF
---
a: [1, 2]
b: foo
EOF
) <(cat <<EOF
---
a: [3, 4]
c: bar
EOF
)
---
a: [3, 4]
b: foo
c: bar

The behavior is similar to what we have seen with jq. However, with yq one has more granular control over how values shall be merged. The *+ operator can be used to append list values (source: StackOverflow), e.g.

yq eval-all '. as $item ireduce ({}; . *+ $item)' <(cat <<EOF
---
a: [1, 2]
b: foo
EOF
) <(cat <<EOF
---
a: [3, 4]
c: bar
EOF
)
---
a: [1, 2, 3, 4]
b: foo
c: bar

Merging Kubernetes Configs

Merging two Kubernetes config objects is not as straight forward as it may seem at first sight. Let’s first have a look at the document structure:

---
apiVersion: v1
clusters:
  - cluster:
      certificate-authority-data: ...
      server: https://192.168.56.10:6443
    name: kubernetes
contexts:
  - context:
      cluster: kubernetes
      user: kubernetes-admin
    name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
  - name: kubernetes-admin
    user:
      client-certificate-data: ...
      client-key-data: ...

These entries look quite generic. Some of these, I want to rename before merging them into my ~/.kube/config. In addition to that, each of the three (clusters, contexts, and users) contain a list of these respective elements. As yq can only sum up elements of lists or replace them completely (in contrast to: element wise), we first have to remove elements that we don’t want to find in the result.

First, let’s rename elements in the file that should be merged into the existing ~/.kube/config:

1PARTIAL_CONFIG="/tmp/admin.conf"
2
3yq eval-all '.clusters[0].name = "k8s-cka"
4		| .contexts[0].context.cluster = "k8s-cka"
5		| .contexts[0].context.user = "k8s-cka-admin@k8s-cka"
6		| .contexts[0].name = "k8s-cka-admin@k8s-cka"
7		| .users[0].name = "k8s-cka-admin"' \
8	"${PARTIAL_CONFIG}"

Nothing is written anywhere except for stdout. Later on, we will use this output with process substitution and write our changes inline into the Kubernetes config. Now, all elements have a telling name. As mentioned before, we have to remove entries identified by these names from the original first:

1ORIGINAL_CONFIG="/tmp/config" # <=== adjust this path (original Kubernetes config)
2
3yq eval-all 'del(.clusters[] | select(.name == "k8s-cka"))
4	| del(.contexts[] | select(.context.cluster == "k8s-cka"))
5	| del(.users[] | select(.name == "k8s-cka-admin"))' "${ORIGINAL_CONFIG}"

Note that the del operator allows to specify a regular filter expression. Here, we combine three invocations of del() to get rid of the three entries we will merge in from our second file but it should also be possible to get this done in a single call go del(). We can now combine the two steps:

  1. Removing the existing entries
  2. Merge the new entries
    • But before that, rename them
 1ORIGINAL_CONFIG="/tmp/config" # <=== adjust this path (original Kubernetes config)
 2PARTIAL_CONFIG="/tmp/admin.conf"
 3
 4# first, remove existing entries in-place
 5yq eval-all -i 'del(.clusters[] | select(.name == "k8s-cka"))
 6	| del(.contexts[] | select(.context.cluster == "k8s-cka"))
 7	| del(.users[] | select(.name == "k8s-cka-admin"))' "${ORIGINAL_CONFIG}"
 8
 9# then, add new entries in-place, but rename them beforehand
10yq eval-all -i ". as \$item ireduce ({}; . *+ \$item) | .current-context = \"k8s-cka\"" \
11	"${ORIGINAL_CONFIG}" \
12	<(yq eval '.clusters[0].name = "k8s-cka"
13			| .contexts[0].context.cluster = "k8s-cka"
14			| .contexts[0].context.user = "k8s-cka-admin@k8s-cka"
15			| .contexts[0].name = "k8s-cka-admin@k8s-cka"
16			| .users[0].name = "k8s-cka-admin"' \
17		"${PARTIAL_CONFIG}")
18
19# show the result
20yq eval-all -P "${ORIGINAL_CONFIG}"

Et voila: we have reached our goal. Addtional configurations can now be added programmatically into our existing Kubernetes configuration. We instruct yq to write its changes directly into the original config by using the -i flag.

Alternative approach

Instead of merging configurations, an alternative approach would be to work with multiple Kubernetes configurations and direnv. In order to supply kubectl with the desired configuration, one can use the KUBECONFIG environment variable. Using direnv the value of this variable can be set when entering a given directory.

This way it is also easier to not confuse different clusters because one will only “see” one of them at a time - depending on the current working directory.

Conclusion

Merging and transforming documents such as YAML and JSON documents takes some practice. Luckily, required tools, jq and yq, are very well documented by their authors. Furthermore, loads of examples and questions can be found on StackOverflow.

With this knowledge at hand, I can enrich my personal toolbox with shortcuts to simplify my daily live as a DevOps engineer.