!!! note Deprecated in Rook v1.11 due to lack of usage and maintainership.
Openshift uses Machines
and MachineSets
from the cluster-api to dynamically provisions nodes. Fencing is a remediation method that reboots/deletes Machine
CRDs to solve problems with automatically provisioned nodes.
Once MachineHealthCheck controller detects that a node is NotReady
(or some other configured condition), it will remove the associated Machine
which will cause the node to be deleted. The MachineSet
controller will then replace the Machine
via the machine-api. The exception is on baremetal platforms where fencing will reboot the underlying BareMetalHost
object instead of deleting the Machine
.
PodDisruptionBudget
?Fencing does not use the eviction api. It is for Machine
s and not Pod
s.
Hopefully not. On cloud platforms, the OSDs can be rescheduled on new nodes along with their backing PVs, and on baremetal where the local PVs are tied to a node, fencing will simply reboot the node instead of destroying it.
We need to ensure that only one node can be fenced at a time and that Ceph is fully recovered (has PGs clean) before any fencing is initiated. The available pattern for limiting fencing is the MachineDisruptionBudget which allows us to specify maxUnavailable. However, this won’t be sufficient to ensure that Ceph has recovered before fencing is initiated as MachineHealthCheck does not check anything other than the node state.
Therefore, we will control how many nodes match the MDB by dynamically adding and removing labels as well as dynamically updating the MDB. By manipulating the MDB into a state where desiredHealthy > currentHealthy, we can disable fencing on the nodes the MDB points to.
We will implement two controllers machinedisruptionbudget-controller
and the machine-controller
to be implemented through the controller pattern describere here. Each controller watches a set of object kinds and reconciles one.
The bottom line is that fencing is blocked if the PG state in not active+clean, but fencing continues on Machine
s without the label which indicates that OSD resources are running there.
This controller watches ceph PGs and CephClusters. We will ensure the reconciler is enqueued every 60s. It ensures that each CephCluster has a MDB created, and the MDB's value of maxUnvailable reflects the health of the Ceph Cluster's PGs. If all PGs are clean, maxUnavailable = 1. else, maxUnavailable = 0.
We can share a ceph health cache with the other controller-runtime reconcilers that have to watch the PG "cleanliness".
The MDB will target Machine
s selected by a label maintained by the machine-controller
. The label is fencegroup.rook.io/<cluster-name>
.
This controller watches OSDs and Machine
s. It ensures that each Machine
with OSDs from a CephCluster
have the label fencegroup.rook.io/<cluster-name>
, and those that do not have running OSDs do not have label.
This will ensure that no Machine
without running OSDs will be protected by the MDB.
Node needs to be fenced, the OSDs on the node are down too
Node needs to be fenced, but the OSDs on the node are up