Rook and Ceph upgrades are designed to ensure data remains available even while the upgrade is proceeding. Rook will perform the upgrades in a rolling fashion such that application pods are not disrupted. To ensure the upgrades are seamless, it is important to begin the upgrades with Ceph in a fully healthy state. This guide reviews ways of verifying the health of a CephCluster.
See the troubleshooting documentation for any issues during upgrades:
In a healthy Rook cluster, all pods in the Rook namespace should be in the
Running
(or Completed
) state and have few, if any, pod restarts.
ROOK_CLUSTER_NAMESPACE=rook-ceph
kubectl -n $ROOK_CLUSTER_NAMESPACE get pods
The Rook toolbox contains the Ceph tools that gives status details of the cluster with the
ceph status
command. Below is an output sample:
TOOLS_POD=$(kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[*].metadata.name}')
kubectl -n $ROOK_CLUSTER_NAMESPACE exec -it $TOOLS_POD -- ceph status
The output should look similar to the following:
cluster:
id: a3f4d647-9538-4aff-9fd1-b845873c3fe9
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,a
mgr: a(active)
mds: myfs-1/1/1 up {0=myfs-a=up:active}, 1 up:standby-replay
osd: 6 osds: 6 up, 6 in
rgw: 1 daemon active
data:
pools: 9 pools, 900 pgs
objects: 67 objects, 11 KiB
usage: 6.1 GiB used, 54 GiB / 60 GiB avail
pgs: 900 active+clean
io:
client: 7.4 KiB/s rd, 681 B/s wr, 11 op/s rd, 4 op/s wr
recovery: 164 B/s, 1 objects/s
In the output above, note the following indications that the cluster is in a healthy state:
HEALTH_OK
and there are no warning or error status
messages displayed.quorum
list.active
state.up
and in
.active+clean
state.active
for all filesystemsactive
If the ceph status
output has deviations from the general good health described above, there may
be an issue that needs to be investigated further. Other commands may show more relevant
details on the health of the system, such as ceph osd status
. See the
Ceph troubleshooting docs for help.
Rook will not upgrade Ceph daemons if the health is in a HEALTH_ERR
state.
Rook can be configured to proceed with the (potentially unsafe) upgrade by setting either
skipUpgradeChecks: true
or continueUpgradeAfterChecksEvenIfNotHealthy: true
as described in the
cluster CR settings.
The container version running in a specific pod in the Rook cluster can be verified in its pod spec
output. For example, for the monitor pod mon-b
, verify the container version it is running
with the below commands:
POD_NAME=$(kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -o custom-columns=name:.metadata.name --no-headers | grep rook-ceph-mon-b)
kubectl -n $ROOK_CLUSTER_NAMESPACE get pod ${POD_NAME} -o jsonpath='{.spec.containers[0].image}'
The status and container versions for all Rook pods can be collected all at once with the following commands:
kubectl -n $ROOK_OPERATOR_NAMESPACE get pod -o jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0]}{"\n"}{end}' && \
kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -o jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}'
The rook-version
label exists on Ceph resources. For various resource controllers, a
summary of the resource controllers can be gained with the commands below. These will report the
requested, updated, and currently available replicas for various Rook resources in addition to
the version of Rook for resources managed by Rook. Note that the operator
and toolbox deployments do not have a rook-version
label set.
kubectl -n $ROOK_CLUSTER_NAMESPACE get deployments -o jsonpath='{range .items[*]}{.metadata.name}{" \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{" \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'
kubectl -n $ROOK_CLUSTER_NAMESPACE get jobs -o jsonpath='{range .items[*]}{.metadata.name}{" \tsucceeded: "}{.status.succeeded}{" \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'
Any pod that is using a Rook volume should also remain healthy:
Running
state with few, if any, restarts