Currently, a Rook admin can declare how they want their cluster deployed by specifying values in the Cluster CRD. However, after a cluster has been initially declared and deployed, it is not currently possible to update the Cluster CRD and have those desired changes reflected in the actual cluster state. This document will describe a design for how cluster updating can be implemented, along with considerations, trade-offs, and a suggested scope of work.
As previously mentioned, the interface for a user who wants to update their cluster will be the Cluster CRD. To specify changes to a Rook cluster, the user could run a command like the following:
kubectl -n rook-ceph edit cluster.ceph.rook.io rook-ceph
This will bring up a text editor with the current value of the cluster CRD.
After their desired edits are made, for instance to add a new storage node, they will save and exit the editor.
Of course, it is also possible to update a cluster CRD via the Kubernetes API instead of kubectl
.
This will trigger an update of the CRD object, which the operator is already subscribed to events for. The update event is provided both the new and old cluster objects, making it possible to perform a diff between desired and actual state. Once the difference is calculated, the operator will begin to bring actual state in alignment with desired state by performing similar operations to what it does to create a cluster in the first place. Controllers, pod templates, config maps, etc. will be updated and configured with the end result of the Rook cluster pods and state representing the users desired cluster state.
The most common case for updating a Rook cluster will be to add and remove storage resources. This will essentially alter the number of OSDs in the cluster which will cause data rebalancing and migration. Therefore, updating storage resources should be performed by the operator with special consideration as to not degrade cluster performance and health beyond acceptable levels.
The Cluster CRD has many fields, but not all of them will be updatable (i.e., the operator will not attempt to make any changes to the cluster for updates to some fields).
The following fields will be supported for updates:
mon
: Ceph mon specific settings can be changed.
count
: The number of monitors can be updated and the operator will ensure that as monitors are scaled up or down the cluster remains in quorum.allowMultiplePerNode
: The policy to allow multiple mons to be placed on one node can be toggled.deviceFilter
: The regex filter for devices allowed to be used for storage can be updated and OSDs will be added or removed to match the new filter pattern.devicePathFilter
: The regex filter for paths of devices allowed to be used for storage can be updated and OSDs will be added or removed to match the new filter pattern.useAllDevices
: If this value is updated to true
, then OSDs will be added to start using all devices on nodes.
However, if this value is updated to false
, the operator will only allow OSDs to be removed if there is a value set for deviceFilter
.
This is to prevent an unintentional action by the user that would effectively remove all data in the cluster.useAllNodes
: This value will be treated similarly to useAllDevices
.
Updating it to true
is a safe action as it will add more nodes and their storage to the cluster, but updating it to false
is not always a safe action.
If there are no individual nodes listed under the nodes
field, then updating this field to false
will not be allowed.resources
: The CPU and memory limits can be dynamically updated.placement
: The placement of daemons across the cluster can be updated, but it is dependent on the specific daemon.
For example, monitors can dynamically update their placement as part of their ongoing health checks.
OSDs can not update their placement at all since they have data gravity that is tied to specific nodes.
Other daemons can decide when and how to update their placement, for example doing nothing for current pods and only honoring new placement settings for new pods.nodes
: Specific storage nodes can be added and removed, as well as additional properties on the individual nodes that have not already been described above:
devices
: The list of devices to use for storage can have entries added and removed.directories
: The list of directories to use for storage can also be updated.All other properties not listed above are not supported for runtime updates. Some particular unsupported fields to note:
dataDirHostPath
: Once the local host directory for storing cluster metadata and config has been set and populated, migrating it to a new location is not supported.hostNetwork
: After the cluster has been initialized to either use host networking or pod networking, the value can not be changed.
Changing this value dynamically would very likely cause a difficult to support transition period while pods are transferring between networks and would certainly impact cluster health.It is in the user's best interests to provide early feedback if they have made an update to their Cluster CRD that is invalid or not supported. Along with issue 1000, we should use the Kubernetes CRD validation feature to verify any changes to the Cluster CRD and provide helpful error messages in the case that their update can not be fulfilled.
It is important to remember that Linux device names can change across reboots. Because of this, we need to be very careful when determining whether it is a safe operation to remove an OSD. We need to be absolutely sure that the user really intended to remove the OSD from a device, as opposed to the device name randomly changing and becoming out of the device filter or list.
What is especially challenging here is that before the initial deployment of OSDs onto a node, which creates the UUIDs for each device, there is no known consistent and user friendly way to specify devices.
A lot of environments do not have labels, IDs, UUIDs, etc. for their devices at first boot and the only way to address them is by device name, such as sda
.
This is unfortunate because it is a volatile identifier.
Some environments do have IDs at first boot and we should consider allowing users to specify devices by those IDs instead of names in the near future.
That effort is being tracked by issue 1228.
The main approach that will be taken to solve this issue is to always compare the device UUID from a node's saved OSD config map against the device UUIDs of the current set of device names. If the two do not match, then it is not a safe operation to remove the OSD from the device. Let's walk through a couple simple scenarios to illustrate this approach:
NOT SAFE: Device name has changed, but filter has not been updated by the user:
sda
via device filter or list. Rook configures sda
and gets an OSD up and running.sda
, but the device has changed its name to sdb
. The device is now out of the filter.sda
with device UUID wxyz-1234
.sda
, so going by the saved config and not what the current devices names are, we know that the old sda
(device UUID wxyz-1234
), which is now sdb
should NOT be removed.SAFE: User has updated the filter and the device name has not changed:
sda
via device filter or list. Rook configures sda
and gets an OSD up and running.sdb
.sda
is no longer in the filter.sda
in its saved config and compares that to the device UUID of the current sda
.sda
.When the operator receives an event that the Cluster CRD has been updated, it will need to perform some orchestration in order to bring actual state of the cluster in agreement with the desired state.
For example, when mon.count
is updated, the operator will add or remove a single monitor at a time, ensuring that quorum is restored before moving onto the next monitor.
Updates to the storage spec for the cluster require even more careful consideration and management by the operator, which will be discussed in this section.
First and foremost, changes to the cluster state should not be carried out when the cluster is not in a healthy state. The operator should wait until cluster health is restored until any orchestration is carried out.
It is important to remember that a single OSD pod can contain multiple OSD processes and that the operator itself does not have detailed knowledge of the storage resources of each node.
More specifically, the devices that can be used for storage (e.g., match deviceFilter
) is not known until the OSD pod has been started on a given node.
As mentioned previously, it is recommended to make storage changes to the cluster one OSD at a time. Therefore, the operator and the OSD pods will need to coordinate their efforts in order to adhere to this guidance. When a cluster update event is received by the operator, it will work on a node by node basis, ensuring all storage updates are completed by the OSD pod for that node before moving to the next.
When an OSD pod starts up and has completed its device discovery, it will need to perform a diff of the desired storage against the actual storage that is currently included in the cluster. This diff will determine the set of OSD instances that need to be removed or added within the pod. Fortunately, the OSD pod start up is already idempotent and already handles new storage additions, so the remaining work will be the following:
We should consider an implementation that allows the OSD pod to refresh it's set of OSDs without restarting the entire pod, but since the replication controller's pod template spec needs to be updated by the operator in order to convey this information to the pod, we may need to live with restarting the pod either way. Remember that this will be done one node at a time to mitigate impact to cluster health.
Also, other types of update operations to the cluster (e.g., software upgrade) should be blocked while a cluster update is ongoing.
The Cluster CRD status will be kept up to date by the operator so the user has some insight into the process being carried out.
While the operator is carrying out an update to the cluster, the Cluster CRD status
will be set to updating
.
If there are any errors during the process, the message
field will be updated with a specific reason for the failure.
We should also update documentation for our users with easy commands to query the status and message fields so they can get more information easily.
As mentioned previously, the OSD pods need to communicate to the operator when they are done orchestrating their local OSD instance changes. To make this effort more resilient and tolerant of operator restarts, this effort should be able to be resumed. For example, if the operator restarts while an OSD pod is draining OSDs, the operator should not start telling other OSD pods to do work.
The OSDs and operator will jointly maintain a config map to track the status of storage update operations within the cluster.
When the operator initially requests an OSD pod to compute its storage diff, it will update a config map with an entry for the OSD containing a status of computingDiff
and a current timestamp.
When the OSD pod has finished computation and started orchestrating changes, it will update the entry with a status of orchestrating
and a current timestamp.
Finally, when the pod has finished, it will update the entry with completed
and a current timestamp again, letting the operator know it is safe to move onto the next node.
If the operator is restarted during this flow, it will look in the config map for any OSD pod that is not in the completed
state.
If it finds any, then it will wait until they are completed before moving onto another node.
This approach will ensure that only 1 OSD pod is performing changes at a time.
Note that this approach can also be used to ask an OSD pod to compute changes without having to restart the pod needlessly.
If the OSD pods are watching the config map for changes, then they can compute a diff upon request of the operator.
This section covers the general sequence for updating storage resources and outlines important considerations for cluster health. Before any changes begin, we will temporarily disable scrubbing of placement groups (the process of verifying data integrity of stored objects) to maximize cluster resources that can go to both client I/O and recovery I/O for data migration:
ceph osd set noscrub
ceph osd set nodeep-scrub
Some Ceph documentation also recommends limiting backfill and recovery work while storage is being added or removed. The intent is to maximize client I/O while sacrificing throughput of data migration. I do not believe this is strictly necessary and at this point I would prefer to not limit recovery work in the hopes of finishing data migrations as quickly as possible. I suspect that most cluster administrators would not be removing storage when the cluster is under heavy load in the first place. This trade-off can be revisited if we see unacceptable performance impact.
As mentioned previously, we will add one OSD at a time in order to allow the cluster to rebalance itself in a controlled manner and to avoid getting into a situation where there is an unacceptable amount of churn and thrashing. Adding a new OSD is fairly simple since the OSD pod logic already supports it:
ceph osd crush add-bucket {bucket-name} {type}
active+clean
state, meaning data migration is complete.Removing storage is a more involved process and it will also be done one OSD at a time to ensure the cluster returns to a clean state.
Of special note for removing storage is that a check should be performed to ensure that the cluster has enough remaining storage to recover (backfill) the entire set of objects from the OSD that is being removed.
If the cluster does not have enough space for this (e.g., it would hit the full
ratio), then the removal should not proceed.
For each OSD to remove, the following steps should be performed:
ceph osd crush reweight osd.<id> 0.0
, which will trigger data migration from the OSD.active+clean
stateout
with ceph osd out osd.<id>
ceph osd crush remove osd.<id>
ceph auth del osd.<id>
ceph osd rm osd.<id>
dataDirHostPath
): rm -fr /var/lib/rook/<osdID>
If the entire node is being removed, ensure that the host node is also removed from the CRUSH map:
$ ceph osd crush rm <host-bucket-name>
After all storage updates are completed, both additions and removals, then we can once again enable scrubbing:
ceph osd unset noscrub
ceph osd unset nodeep-scrub
The number of placement groups in the cluster compared to the number of OSDs is a difficult trade-off without knowing the user's intent for future cluster growth. The general rule of thumb is that you want around 100 PGs per OSD. With less than that, you have potentially unbalanced distribution of data with certain OSDs storing more than others. With more PGs than that, you have increased overhead in the cluster because more OSDs need to coordinate with each other, impacting performance and reliability.
It's important to note that shrinking placement group count (merging) is still not supported in Ceph. Therefore, you can only increase the number of placement groups (splitting) over time.
If the cluster grows such that we have too few placement groups per OSD, then we can consider increasing the number of PGs in the cluster by incrementing the pg_num
and pgp_num
for each storage pool.
Similar to adding new OSDs, this increase of PGs should be done incrementally and in a coordinated fashion to avoid degrading performance significantly in the cluster.
Placement group management will be tracked in further detail in issue 560.
The implementation of the design described in this document could be done in a phased approach in order to get critical features out sooner. One proposal for implementation phases would be:
mon
, resources
, etc.).