Multus networking integration with Ceph (not finalized yet and subject to update)

We have already explored and explained the benefit of multi-homed networking, so this document will not rehearse that but simply focus on the implementation for the Ceph backend. If you are interested in learning more about multi-homed networking you can read the design documentation on that matter.

To make the story short, Multus should allow us to get the same performance benefit as HostNetworking as well as increasing the security. Using HostNetworking results in exposing all the network interfaces (the entire stack) of the host inside the container where Multus allows you to pick the one you want. Also, this minimizes the need of privileged containers (required for HostNetworking).

Proposed CRD changed

We already have a network CRD property, which looks like:

network:
  provider:
  selectors:

We will expand the selectors with the following two hardcoded keys:

selectors:
  public:
  cluster:

Each selector represents a NetworkAttachmentDefinition object in Multus. At least, one must be provided and by default they will represent:

public: data daemon public network (binds to public_network Ceph option).
cluster: data daemon cluster network (binds to cluster_network Ceph option).

If only public is set then cluster will take the value of public.

Multus supported configuration

Interface type

As part of the CNI spec, Multus supports several interface types. Rook will naturally support any of them as they don't fundamentally change the working behavior.

IPAM type

This is where things get more complex. Currently there are three different IPAM solutions available.

As part of our research we have found that the following IPAM types are not good candidates:

host-local: Maintains a local database of allocated IPs, this only works on a per host basis, so not suitable for a distributed environment since we will end up with IP collision To fix this, the whereabouts project looks promising but is not officially supported.
static: Allocate a static IPv4/IPv6 addresses to container and it's useful in debugging purpose. This cannot at scale because this means we will have to allocate IPs for all the daemon so it's not scalable.

You can find a more detailed analysis at the end of the document in the rejected proposal section.

Ceph daemons implementation challenges

Monitors and OSDs

The Ceph monitors only need access to the public network. The OSDs needs access to both public and cluster networks.

Monitors requirements so far are the following:

Predictable IP addresses
Keep the same IP address for the life time of the deployment (IP should survive a restart)

RGW implementation

Only need access to the public network. They use service IP with a load-balancer so we need to be careful.

MDS/RBD-MIRROR/NFS implementation

Only need access to the public network. Nothing to do in particular since they don't use any service IPs.

CSI pods

Steps must be taken to fix a CSI-with-multus issue documented here. To summarize the issue: when a CephFS/RBD volume is mounted in a pod using Ceph CSI and then the CSI CephFS/RBD plugin is restarted or terminated (e.g. by restarting or deleting its DaemonSet), all operations on the volume become blocked, even after restarting the CSI pods. The only workaround is to restart the node where the Ceph CSI plugin pod was restarted.

When deploying a CephCluster resource configured to use multus networks, a multus-connected network interface will be added to the host network namespace for all nodes that will run CSI plugin pods. This will allow Ceph CSI pods to run using host networking and still access Ceph's public multus network.

The design for mitigating the issue is to add a new DaemonSet that will own the network for all CephFS mounts as well as RBD mapped devices. The csi-{cephfs,rbd}plugin DaemonSet are left untouched.

New "holder" DaemonSet

The Rook-Ceph Operator's CSI controller creates a csi-plugin-holder DaemonSet configured to use the network.selectors.public network specified for the CephCluster. This DaemonSet runs on all the nodes along side the csi-{cephfs,rbd}plugin DaemonSet.

This Pod should only be stopped and restarted when a node is stopped so that volume operations do not become blocked. The Rook-Ceph Operator's CSI controller should set the DaemonSet's update strategy to OnDelete so that the pods do not get deleted if the DaemonSet is updated while also ensuring that the pods will be updated on the next node reboot (or node drain).

The new holder DaemonSet only contains a single container called holder, the container responsible for pinning the network for filesystem mounts and mapped block devices. It is used as a passthrough by the Ceph-CSI plugin pod which when mounting or mapping will use the network namespace of that holder pod.

One task of the Holder container is to pass the required PID into the csi-plugin pod. To solve this, the container will create a symlink to an equivalent of /proc/$$/ns/net in the hostPath volume shared with the csi-plugin pod. Then it will sleep forever. In order to support multiple Ceph-CSI consuming multiple Ceph clusters, the name of the symlink should be based on the Ceph cluster FSID. The complete name of the symlink can be decided during the implementation phase of the design. That symlink name will be stored in the Ceph-CSI configmap with a key name that we will define during the implementation.

Interactions between components

When the Ceph-CSI plugin mounts or maps a block device it will use the network namespace of the Holder pod. This is achieved by using nsenter to enter the network namespace of the Holder pod. Some changes are expected as part of the mounting/mapping method in today's csi-plugin pod. Instead of invoking rbd map ..., invoke nsenter --net=<net ns file of pause in long-running pod> rbd map ... (and same for mount -t ceph ...). The csi-plugin pod is already privileged, it already adds CAP_SYS_ADMIN. The only missing piece is to expose the host PID namespace with hostPID: true on the csi-cephfsplugin DaemonSet (RBD already has it).

Cleaning up the symlink is not a requirement nor a big issue. Once the Holder container stops, its network namespace goes away with it. So at worst, we end up with a broken symlink.

When a new node is added, both DaemonSets are started, and the process described above occurs on the node.

The initial implementation of this design is limited to supporting a single CephCluster with Multus. Until we stop using HostNetwork entirely it is impossible to support multiple CephClusters with and without Multus enabled.

Accepted proposal

So far, the team has decided to go with the whereabouts IPAM. It's an IP Address Management (IPAM) CNI plugin that assigns IP addresses cluster-wide. If you need a way to assign IP addresses dynamically across your cluster -- Whereabouts is the tool for you. If you've found that you like how the host-local CNI plugin works, but, you need something that works across all the nodes in your cluster (host-local only knows how to assign IPs to pods on the same node) -- Whereabouts is just what you're looking for. Whereabouts can be used for both IPv4 & IPv6 addressing.

It is under active development and is not ready but ultimately will allow to:

Have static IP addresses distributed across the cluster
These IPs will survive any deployment restart
The allocation will be done by whereabouts and not by Rook

The only thing that is not solved yet is how can we predict the IPs for the upcoming monitors? We might need to rework the way we bootstrap the monitors a little bit to not require to know the IP in advance.

Rejected proposals

The following proposal were rejected but we keep them here for traceability and knowledge.

IPAM type 'DHCP'

In this scenario, a DHCP server will distribute an IP address to a pod using a given range.

Pros:

Pods will get a dedicated IP on a physical network interface
No changes required in Ceph, Rook will detect the CIDR via the NetworkAttachmentDefinition then populate Ceph's flag public_network and cluster_network

Cons:

IP allocation not predictable, we don't know it until the pod is up and running. So the detection must append inside the monitor container is running, similarly to what the OSD code does today.
Requires drastic changes in the monitor bootstrap code
This adds a DHCP daemon on every part of the cluster and this has proven to be troublesome

Assuming we go with this solution, we might need to change the way the monitors are bootstrapped:

let the monitor discovered its own IP based on an interface
once the first mon is bootstrapped, we register its IP in a ConfigMap as well as populating clusterInfo
boot the second mon, look up in clusterInfo for the initial member (if the op dies in the process, we always CreateOrLoadClusterInfo() as each startup based on the cm so no worries)
go on and on with the rest of the monitors

TBT: if the pod restarts, it keeps the same IP.

IPAM type 'DHCP' with service IP

We could and this is pure theory at this point use a IPAM with DHCP along with service IP. This would require interacting with Kubeproxy and there is no such feature yet. Even if it was there, we decided not to go with DHCP so this is not relevant.

multus-network.md 9.8 KB Permalink History Raw