Faster RBD/CephFS RWO recovery in case of node loss.
For RBD RWO recovery:
When a node is lost where a pod is running with the RBD RWO volume is mounted, the volume cannot automatically be mounted on another node. If two clients are write to the same volume it could cause corruption. The node must be guaranteed to be down before the volume can be mounted on another node.
For CephFS recovery:
With the current design the node recovery will be faster for CephFS.
For RBD RWO recovery:
We have a manual solution to the problem which involves forceful deletion of a pod so that forced detachment and attachment work is possible. The problem with the current solution is that even after the forced pod deletion it takes around 11 minutes for the volume to mount on the new node. Also there are still chances of data corruption if the old pod on the lost node comes back online, causing multiple writers and lead to data corruption if the documentation is not followed to manually block nodes.
For CephFS recovery:
Currently, CephFS recovery is slower in case of node loss.
Note: This solution requires minimum kubernetes version 1.26.0
The kubernetes feature Non-Graceful Node Shutdown is available starting in Kubernetes 1.26 to help improve the volume recovery during node loss. When a node is lost, the admin is required to add the taint out-of-service
manually to the node. After the node is tainted, Kubernetes will:
Once this taint is applied manually, Rook will create a NetworkFence CR. The csi-addons operator will then blocklist the node to prevent any ceph rbd/CephFS client on the lost node from writing any more data.
After the new pod is running on the new node and the old node which was lost comes back, Rook will delete the NetworkFence CR.
example of taint to be applied to lost node:
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
# or
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule
Note: This will be enabled by default in Rook if the NetworkFence CR is found, in the case for some reason user wants to disable this feature in Rook can edit the
rook-ceph-operator-config
configmap and update theROOK_WATCH_FOR_NODE_FAILURE: "false"
.
There are multiple networking options available for example, Host Networking, Pod networking, Multus etc. This make it difficult to know which NodeIP address to blocklist. For this we'll follow the following approach which will work for all networking options, except when connected to an external Ceph cluster.
Get the volumesInUse
from the node which has the taint out-of-service
.
List all the pv and compare the pv spec.volumeHandle
with the node volumesInUse
field volumeHandle
Below is sample Node volumeInUse field
volumesInUse:
- kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-Rook-ceph-0000000000000002-24862838-240d-4215-9183-abfc0e9e4002
# Note: The volumeInUse naming convention are `kubernetes.io/csi/ + CSI driver name + ^ + volumeHandle`
and the following is pv volumeInHandle
volumeHandle: 0001-0009-rook-ceph-0000000000000002-24862838-240d-4215-9183-abfc0e9e4002
For Ceph volumes on that node:
If RBD PVC makes use of the rbd status API
example:
$ rbd status <poolname>/<image_name>
Watchers:
watcher=172.21.12.201:0/4225036114 client.17881 cookie=18446462598732840961
If CephFS PVC uses below CLI to clients connect to subvolume
example:
$ ceph tell mds.* client ls
...
...
...
"addr": {
"type": "v1",
"addr": "192.168.39.214:0",
"nonce": 1301050887
}
...
...
...
Get IPs from step 3 (in above example 172.21.12.201
)
blocklist the IP where the volumes are mounted.
Suggested change
Example of a NetworkFence CR that the Rook operator would create when a node.kubernetes.io/out-of-service
taint is added on the node:
apiVersion: csiaddons.openshift.io/v1alpha1
kind: NetworkFence
metadata:
name: <name> # We will keep the name the same as the node name
namespace: <ceph-cluster-namespace>
spec:
driver: <driver-name> # extract the driver name from the PV object
fenceState: <fence-state> # For us it will be `Fenced`
cidrs:
- 172.21.12.201
secret:
name: <csi-rbd-provisioner-secret-name/csi-cephfs-provisioner-secret-name> # from pv object
namespace: <ceph-cluster-namespace>
parameters:
clusterID: <clusterID> # from pv.spec.csi.volumeAttributes
Once the node is back online, the admin removes the taint.
Remove the taint
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-
# or
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule-
Rook will detect the taint is removed from the node, and immediately unfence the node by deleting the corresponding networkFence CR.
Rook will not automate tainting the node when they go offline. This is a decision the admin needs to make. But Rook will consider creating a sample script to watch for unavailable nodes and automatically taint the node based on how long node is offline. The admin can choose to enable the automated taints by running this example script.