Rook allows creation and customization of storage clusters through the custom resource definitions (CRDs). There are primarily four different modes in which to create your cluster.
See the separate topics for a description and examples of each of these scenarios.
Settings can be specified at the global level to apply to the cluster as a whole, while other settings can be specified at more fine-grained levels. If any setting is unspecified, a suitable default will be used automatically.
name
: The name that will be used internally for the Ceph cluster. Most commonly the name is the same as the namespace since multiple clusters are not supported in the same namespace.namespace
: The Kubernetes namespace that will be created for the Rook cluster. The services, pods, and other resources created by the operator will be added to this namespace. The common scenario is to create a single Rook cluster. If multiple clusters are created, they must not have conflicting devices or host paths.external
:
enable
: if true
, the cluster will not be managed by Rook but via an external entity. This mode is intended to connect to an existing cluster. In this case, Rook will only consume the external cluster. However, Rook will be able to deploy various daemons in Kubernetes such as object gateways, mds and nfs if an image is provided and will refuse otherwise. If this setting is enabled all the other options will be ignored except cephVersion.image
and dataDirHostPath
. See external cluster configuration. If cephVersion.image
is left blank, Rook will refuse the creation of extra CRs like object, file and nfs.cephVersion
: The version information for launching the ceph daemons.
image
: The image used for running the ceph daemons. For example, quay.io/ceph/ceph:v17.2.6
. For more details read the container images section.
For the latest ceph images, see the Ceph DockerHub.
To ensure a consistent version of the image is running across all nodes in the cluster, it is recommended to use a very specific image version.
Tags also exist that would give the latest version, but they are only recommended for test environments. For example, the tag v17
will be updated each time a new Quincy build is released.
Using the v17
tag is not recommended in production because it may lead to inconsistent versions of the image running across different nodes in the cluster.allowUnsupported
: If true
, allow an unsupported major version of the Ceph release. Currently quincy
and reef
are supported. Future versions such as squid
(v19) would require this to be set to true
. Should be set to false
in production.
imagePullPolicy
: The image pull policy for the ceph daemon pods. Possible values are Always
, IfNotPresent
, and Never
.
The default is IfNotPresent
.dataDirHostPath
: The path on the host (hostPath) where config and data should be stored for each of the services. If the directory does not exist, it will be created. Because this directory persists on the host, it will remain after pods are deleted. Following paths and any of their subpaths must not be used: /etc/ceph
, /rook
or /var/log/ceph
.
dataDirHostPath
must be deleted. Otherwise, stale keys and other config will remain from the previous cluster and the new mons will fail to start.
If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes empty dir docs.skipUpgradeChecks
: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at YOUR OWN RISK, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the upgrade doc.continueUpgradeAfterChecksEvenIfNotHealthy
: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.dashboard
: Settings for the Ceph dashboard. To view the dashboard in your browser see the dashboard guide.
enabled
: Whether to enable the dashboard to view cluster statusurlPrefix
: Allows to serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)port
: Allows to change the default port where the dashboard is servedssl
: Whether to serve the dashboard via SSL, ignored on Ceph versions older than 13.2.2
monitoring
: Settings for monitoring Ceph using Prometheus. To enable monitoring on your cluster see the monitoring guide.
enabled
: Whether to enable the prometheus service monitor for an internal cluster. For an external cluster, whether to create an endpoint port for the metrics. Default is false.metricsDisabled
: Whether to disable the metrics reported by Ceph. If false, the prometheus mgr module and Ceph exporter are enabled.
If true, the prometheus mgr module and Ceph exporter are both disabled. Default is false.externalMgrEndpoints
: external cluster manager endpointsexternalMgrPrometheusPort
: external prometheus manager module port. See external cluster configuration for more details.port
: The internal prometheus manager module port where the prometheus mgr module listens. The port may need to be configured when host networking is enabled.interval
: The interval for the prometheus module to to scrape targets.network
: For the network settings for the cluster, refer to the network configuration settingsmon
: contains mon related options mon settings
For more details on the mons and when to choose a number other than 3
, see the mon health doc.mgr
: manager top level section
count
: set number of ceph managers between 1
to 2
. The default value is 2.
If there are two managers, it is important for all mgr services point to the active mgr and not the standby mgr. Rook automatically
updates the label mgr_role
on the mgr pods to be either active
or standby
. Therefore, services need just to add the label
mgr_role=active
to their selector to point to the active mgr. This applies to all services that rely on the ceph mgr such as
the dashboard or the prometheus metrics collector.modules
: is the list of Ceph manager modules to enablecrashCollector
: The settings for crash collector daemon(s).
disable
: is set to true
, the crash collector will not run on any node where a Ceph daemon runsdaysToRetain
: specifies the number of days to keep crash entries in the Ceph cluster. By default the entries are kept indefinitely.logCollector
: The settings for log collector daemon.
enabled
: if set to true
, the log collector will run as a side-car next to each Ceph daemon. The Ceph configuration option log_to_file
will be turned on, meaning Ceph daemons will log on files in addition to still logging to container's stdout. These logs will be rotated. In case a daemon terminates with a segfault, the coredump files will be commonly be generated in /var/lib/systemd/coredump
directory on the host, depending on the underlying OS location. (default: true
)periodicity
: how often to rotate daemon's log. (default: 24h). Specified with a time suffix which may be h
for hours or d
for days. Rotating too often will slightly impact the daemon's performance since the signal briefly interrupts the program.annotations
: annotations configuration settingslabels
: labels configuration settingsplacement
: placement configuration settingsresources
: resources configuration settingspriorityClassNames
: priority class names configuration settingsstorage
: Storage selection and configuration that will be used across the cluster. Note that these settings can be overridden for specific nodes.
useAllNodes
: true
or false
, indicating if all nodes in the cluster should be used for storage according to the cluster level storage selection and configuration values.
If individual nodes are specified under the nodes
field, then useAllNodes
must be set to false
.nodes
: Names of individual nodes in the cluster that should have their storage included in accordance with either the cluster level configuration specified above or any node specific overrides described in the next section below.
useAllNodes
must be set to false
to use specific nodes and their config.
See node settings below.config
: Config settings applied to all OSDs on the node unless overridden by devices
. See the config settings below.onlyApplyOSDPlacement
: Whether the placement specific for OSDs is merged with the all
placement. If false
, the OSD placement will be merged with the all
placement. If true, the OSD placement will be applied
and the all
placement will be ignored. The placement for OSDs is computed from several different places depending on the type of OSD:
placement.all
and placement.osd
placement.all
and inside the storageClassDeviceSets from the placement
or preparePlacement
flappingRestartIntervalHours
: Defines the time for which an OSD pod will sleep before restarting, if it stopped due to flapping. Flapping occurs where OSDs are marked down
by Ceph more than 5 times in 600 seconds. The OSDs will stay down when flapping since they likely have a bad disk or other issue that needs investigation. If the issue with the OSD is fixed manually, the OSD pod can be manually restarted. The sleep is disabled if this interval is set to 0.disruptionManagement
: The section for configuring management of daemon disruptions
managePodBudgets
: if true
, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically via the strategy outlined in the design. The operator will block eviction of OSDs by default and unblock them safely when drains are detected.osdMaintenanceTimeout
: is a duration in minutes that determines how long an entire failureDomain like region/zone/host
will be held in noout
(in addition to the default DOWN/OUT interval) when it is draining. The default value is 30
minutes.pgHealthCheckTimeout
: A duration in minutes that the operator will wait for the placement groups to become healthy (see pgHealthyRegex
) after a drain was completed and OSDs came back up.
Operator will continue with the next drain if the timeout exceeds.
No values or 0
means that the operator will wait until the placement groups are healthy before unblocking the next drain.pgHealthyRegex
: The regular expression that is used to determine which PG states should be considered healthy.
The default is ^(active\+clean|active\+clean\+scrubbing|active\+clean\+scrubbing\+deep)$
.removeOSDsIfOutAndSafeToRemove
: If true
the operator will remove the OSDs that are down and whose data has been restored to other OSDs. In Ceph terms, the OSDs are out
and safe-to-destroy
when they are removed.cleanupPolicy
: cleanup policy settingssecurity
: security page for key management configurationcephConfig
: Set Ceph config options using the Ceph Mon config storecsi
: Set CSI Driver optionsOfficial releases of Ceph Container images are available from Docker Hub.
These are general purpose Ceph container with all necessary daemons and dependencies installed.
TAG | MEANING |
---|---|
vRELNUM | Latest release in this series (e.g., v17 = Quincy) |
vRELNUM.Y | Latest stable release in this stable series (e.g., v17.2) |
vRELNUM.Y.Z | A specific release (e.g., v17.2.6) |
vRELNUM.Y.Z-YYYYMMDD | A specific build (e.g., v17.2.6-20230410) |
A specific will contain a specific release of Ceph as well as security fixes from the Operating System.
count
: Set the number of mons to be started. The number must be between 1
and 9
. The recommended value is most commonly 3
.
For highest availability, an odd number of mons should be specified.
For higher durability in case of mon loss, an even number can be specified although availability may be lower.
To maintain quorum a majority of mons must be up. For example, if there are three mons, two must be up.
If there are four mons, three must be up. If there are two mons, both must be up.
If quorum is lost, see the disaster recovery guide to restore quorum from a single mon.allowMultiplePerNode
: Whether to allow the placement of multiple mons on a single node. Default is false
for production. Should only be set to true
in test environments.volumeClaimTemplate
: A PersistentVolumeSpec
used by Rook to create PVCs
for monitor storage. This field is optional, and when not provided, HostPath
volume mounts are used. The current set of fields from template that are used
are storageClassName
and the storage
resource request and limit. The
default storage size request for new PVCs is 10Gi
. Ensure that associated
storage class is configured to use volumeBindingMode: WaitForFirstConsumer
.
This setting only applies to new monitors that are created when the requested
number of monitors increases, or when a monitor fails and is recreated. An
example CRD configuration is provided below.failureDomainLabel
: The label that is expected on each node where the mons
are expected to be deployed. The labels must be found in the list of
well-known topology labels.zones
: The failure domain names where the Mons are expected to be deployed.
There must be at least three zones specified in the list. Each zone can be
backed by a different storage class by specifying the volumeClaimTemplate
.
name
: The name of the zone, which is the value of the domain label.volumeClaimTemplate
: A PersistentVolumeSpec
used by Rook to create PVCs
for monitor storage. This field is optional, and when not provided, HostPath
volume mounts are used. The current set of fields from template that are used
are storageClassName
and the storage
resource request and limit. The
default storage size request for new PVCs is 10Gi
. Ensure that associated
storage class is configured to use volumeBindingMode: WaitForFirstConsumer
.
This setting only applies to new monitors that are created when the requested
number of monitors increases, or when a monitor fails and is recreated. An
example CRD configuration is provided below.stretchCluster
: The stretch cluster settings that define the zones (or other failure domain labels) across which to configure the cluster.
failureDomainLabel
: The label that is expected on each node where the cluster is expected to be deployed. The labels must be found
in the list of well-known topology labels.subFailureDomain
: With a zone, the data replicas must be spread across OSDs in the subFailureDomain. The default is host
.zones
: The failure domain names where the Mons and OSDs are expected to be deployed. There must be three zones specified in the list.
This element is always named zone
even if a non-default failureDomainLabel
is specified. The elements have two values:
name
: The name of the zone, which is the value of the domain label.arbiter
: Whether the zone is expected to be the arbiter zone which only runs a single mon. Exactly one zone must be labeled true
.volumeClaimTemplate
: A PersistentVolumeSpec
used by Rook to create PVCs
for monitor storage. This field is optional, and when not provided, HostPath
volume mounts are used. The current set of fields from template that are used
are storageClassName
and the storage
resource request and limit. The
default storage size request for new PVCs is 10Gi
. Ensure that associated
storage class is configured to use volumeBindingMode: WaitForFirstConsumer
.
This setting only applies to new monitors that are created when the requested
number of monitors increases, or when a monitor fails and is recreated. An
example CRD configuration is provided below.
The two zones that are not the arbiter zone are expected to have OSDs deployed.If these settings are changed in the CRD the operator will update the number of mons during a periodic check of the mon health, which by default is every 45 seconds.
To change the defaults that the operator uses to determine the mon health and whether to failover a mon, refer to the health settings. The intervals should be small enough that you have confidence the mons will maintain quorum, while also being long enough to ignore network blips where mons are failed over too often.
You can use the cluster CR to enable or disable any manager module. This can be configured like so:
mgr:
modules:
- name: <name of the module>
enabled: true
Some modules will have special configuration to ensure the module is fully functional after being enabled. Specifically:
pg_autoscaler
: Rook will configure all new pools with PG autoscaling by setting: osd_pool_default_pg_autoscale_mode = on
If not specified, the default SDN will be used. Configure the network that will be enabled for the cluster and services.
provider
: Specifies the network provider that will be used to connect the network interface. You can choose between host
, and multus
.selectors
: Used for multus
provider only. Select NetworkAttachmentDefinitions to use for Ceph networks.
public
: Select the NetworkAttachmentDefinition to use for the public network.cluster
: Select the NetworkAttachmentDefinition to use for the cluster network.addressRanges
: Used for host
or multus
providers only. Allows overriding the address ranges (CIDRs) that Ceph will listen on.
public
: A list of individual network ranges in CIDR format to use for Ceph's public network.cluster
: A list of individual network ranges in CIDR format to use for Ceph's cluster network.ipFamily
: Specifies the network stack Ceph daemons should listen on.dualStack
: Specifies that Ceph daemon should listen on both IPv4 and IPv6 network stacks.connections
: Settings for network connections using Ceph's msgr2 protocol
requireMsgr2
: Whether to require communication over msgr2. If true, the msgr v1 port (6789) will be disabled
and clients will be required to connect to the Ceph cluster with the v2 port (3300).
Requires a kernel that supports msgr2 (kernel 5.11 or CentOS 8.4 or newer). Default is false.encryption
: Settings for encryption on the wire to Ceph daemons
enabled
: Whether to encrypt the data in transit across the wire to prevent eavesdropping the data on the network.
The default is false. When encryption is enabled, all communication between clients and Ceph daemons, or between
Ceph daemons will be encrypted. When encryption is not enabled, clients still establish a strong initial authentication
and data integrity is still validated with a crc check.
IMPORTANT: Encryption requires the 5.11 kernel for the latest nbd and cephfs drivers. Alternatively for testing only,
set "mounter: rbd-nbd" in the rbd storage class, or "mounter: fuse" in the cephfs storage class.
The nbd and fuse drivers are not recommended in production since restarting the csi driver pod will disconnect the volumes.
If this setting is enabled, CephFS volumes also require setting CSI_CEPHFS_KERNEL_MOUNT_OPTIONS
to "ms_mode=secure"
in operator.yaml.compression
:
enabled
: Whether to compress the data in transit across the wire. The default is false.
See the kernel requirements above for encryption.!!! caution
Changing networking configuration after a Ceph cluster has been deployed is NOT
supported and will result in a non-functioning cluster.
Ceph daemons can operate on up to two distinct networks: public, and cluster.
Ceph daemons always use the public network, which is the Kubernetes pod network by default. The public network is used for client communications with the Ceph cluster (reads/writes).
If specified, the cluster network is used to isolate internal Ceph replication traffic. This includes additional copies of data replicated between OSDs during client reads/writes. This also includes OSD data recovery (re-replication) when OSDs or nodes go offline. If the cluster network is unspecified, the public network is used for this traffic instead.
Some Rook network providers allow manually specifying the public and network interfaces that Ceph
will use for data traffic. Use addressRanges
to specify this. For example:
network:
provider: host
addressRanges:
public:
- "192.168.100.0/24"
- "192.168.101.0/24"
cluster:
- "192.168.200.0/24"
This spec translates directly to Ceph's public_network
and host_network
configurations.
Refer to Ceph networking documentation
for more details.
The default, unspecified network provider cannot make use of these configurations.
Ceph public and cluster network configurations are allowed to change, but this should be done with great care. When updating underlying networks or Ceph network settings, Rook assumes that the current network configuration used by Ceph daemons will continue to operate as intended. Network changes are not applied to Ceph daemon pods (like OSDs and MDSes) until the pod is restarted. When making network changes, ensure that restarted pods will not lose connectivity to existing pods, and vice versa.
To use host networking, set provider: host
.
To instruct Ceph to operate on specific host interfaces or networks, use addressRanges
to select
the network CIDRs Ceph will bind to on the host.
If the host networking setting is changed in a cluster where mons are already running, the existing mons will remain running with the same network settings with which they were created. To complete the conversion to or from host networking after you update this setting, you will need to failover the mons in order to have mons on the desired network configuration.
Rook supports using Multus NetworkAttachmentDefinitions for Ceph public and cluster networks.
Refer to Multus documentation for details about how to set up and select Multus networks.
Rook will attempt to auto-discover the network CIDRs for selected public and/or cluster networks.
This process is not guaranteed to succeed. Furthermore, this process will get a new network lease
for each CephCluster reconcile. Specify addressRanges
manually if the auto-detection process
fails or if the selected network configuration cannot automatically recycle released network leases.
Only OSD pods will have both public and cluster networks attached (if specified). The rest of the Ceph component pods and CSI pods will only have the public network attached. The Rook operator will not have any networks attached; it proxies Ceph commands via a sidecar container in the mgr pod.
A NetworkAttachmentDefinition must exist before it can be used by Multus for a Ceph network. A recommended definition will look like the following:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: ceph-multus-net
namespace: rook-ceph
spec:
config: '{
"cniVersion": "0.3.0",
"type": "macvlan",
"master": "eth0",
"mode": "bridge",
"ipam": {
"type": "whereabouts",
"range": "192.168.200.0/24"
}
}'
master
matches the network interface on hosts that you want to use.
It must be the same across all hosts.macvlan
is highly recommended.
It has less CPU and memory overhead compared to traditional Linux bridge
configurations.whereabouts
is recommended because it ensures each pod gets an IP address unique
within the Kubernetes cluster. No DHCP server is required. If a DHCP server is present on the
network, ensure the IP range does not overlap with the DHCP server's range.NetworkAttachmentDefinitions are selected for the desired Ceph network using selectors
. Selector
values should include the namespace in which the NAD is present. public
and cluster
may be
selected independently. If public
is left unspecified, Rook will configure Ceph to use the
Kubernetes pod network for Ceph client traffic.
Consider the example below which selects a hypothetical Kubernetes-wide Multus network in the
default namespace for Ceph's public network and selects a Ceph-specific network in the rook-ceph
namespace for Ceph's cluster network. The commented-out portion shows an example of how address
ranges could be manually specified for the networks if needed.
network:
provider: multus
selectors:
public: default/kube-multus-net
cluster: rook-ceph/ceph-multus-net
# addressRanges:
# public:
# - "192.168.100.0/24"
# - "192.168.101.0/24"
# cluster:
# - "192.168.200.0/24"
We highly recommend validating your Multus configuration before you install Rook. A tool exists to facilitate validating the Multus configuration. After installing the Rook operator and before installing any Custom Resources, run the tool from the operator pod.
The tool's CLI is designed to be as helpful as possible. Get help text for the multus validation tool like so:
kubectl --namespace rook-ceph exec -it deploy/rook-ceph-operator -- rook multus validation run --help
Then, update the args in the multus-validation job template. Minimally, add the NAD names(s) for public and/or cluster as needed and and then, create the job to validate the Multus configuration.
If the tool fails, it will suggest what things may be preventing Multus networks from working properly, and it will request the logs and outputs that will help debug issues.
Check the logs of the pod created by the job to know the status of the validation test.
Daemons leveraging Kubernetes service IPs (Monitors, Managers, Rados Gateways) are not listening on the NAD specified in the selectors
.
Instead the daemon listens on the default network, however the NAD is attached to the container,
allowing the daemon to communicate with the rest of the cluster. There is work in progress to fix
this issue in the multus-service
repository. At the time of writing it's unclear when this will be supported.
Provide single-stack IPv4 or IPv6 protocol to assign corresponding addresses to pods and services. This field is optional. Possible inputs are IPv6 and IPv4. Empty value will be treated as IPv4. To enable dual stack see the network configuration section.
In addition to the cluster level settings specified above, each individual node can also specify configuration to override the cluster level settings and defaults. If a node does not specify any configuration then it will inherit the cluster level settings.
name
: The name of the node, which should match its kubernetes.io/hostname
label.config
: Config settings applied to all OSDs on the node unless overridden by devices
. See the config settings below.When useAllNodes
is set to true
, Rook attempts to make Ceph cluster management as hands-off as
possible while still maintaining reasonable data safety. If a usable node comes online, Rook will
begin to use it automatically. To maintain a balance between hands-off usability and data safety,
Nodes are removed from Ceph as OSD hosts only (1) if the node is deleted from Kubernetes itself or
(2) if the node has its taints or affinities modified in such a way that the node is no longer
usable by Rook. Any changes to taints or affinities, intentional or unintentional, may affect the
data reliability of the Ceph cluster. In order to help protect against this somewhat, deletion of
nodes by taint or affinity modifications must be "confirmed" by deleting the Rook Ceph operator pod
and allowing the operator deployment to restart the pod.
For production clusters, we recommend that useAllNodes
is set to false
to prevent the Ceph
cluster from suffering reduced data reliability unintentionally due to a user mistake. When
useAllNodes
is set to false
, Rook relies on the user to be explicit about when nodes are added
to or removed from the Ceph cluster. Nodes are only added to the Ceph cluster if the node is added
to the Ceph cluster resource. Similarly, nodes are only removed if the node is removed from the Ceph
cluster resource.
Nodes can be added and removed over time by updating the Cluster CRD, for example with kubectl -n rook-ceph edit cephcluster rook-ceph
.
This will bring up your default text editor and allow you to add and remove storage nodes from the cluster.
This feature is only available when useAllNodes
has been set to false
.
Below are the settings for host-based cluster. This type of cluster can specify devices for OSDs, both at the cluster and individual node level, for selecting which storage resources will be included in the cluster.
useAllDevices
: true
or false
, indicating whether all devices found on nodes in the cluster should be automatically consumed by OSDs. Not recommended unless you have a very controlled environment where you will not risk formatting of devices with existing data. When true
, all devices and partitions will be used. Is overridden by deviceFilter
if specified. LVM logical volumes are not picked by useAllDevices
.deviceFilter
: A regular expression for short kernel names of devices (e.g. sda
) that allows selection of devices and partitions to be consumed by OSDs. LVM logical volumes are not picked by deviceFilter
.If individual devices have been specified for a node then this filter will be ignored. This field uses golang regular expression syntax. For example:
sdb
: Only selects the sdb
device if found^sd.
: Selects all devices starting with sd
^sd[a-d]
: Selects devices starting with sda
, sdb
, sdc
, and sdd
if found^s
: Selects all devices that start with s
^[^r]
: Selects all devices that do not start with r
devicePathFilter
: A regular expression for device paths (e.g. /dev/disk/by-path/pci-0:1:2:3-scsi-1
) that allows selection of devices and partitions to be consumed by OSDs. LVM logical volumes are not picked by devicePathFilter
.If individual devices or deviceFilter
have been specified for a node then this filter will be ignored. This field uses golang regular expression syntax. For example:
^/dev/sd.
: Selects all devices starting with sd
^/dev/disk/by-path/pci-.*
: Selects all devices which are connected to PCI busdevices
: A list of individual device names belonging to this node to include in the storage cluster.
name
: The name of the devices and partitions (e.g., sda
). The full udev path can also be specified for devices, partitions, and logical volumes (e.g. /dev/disk/by-id/ata-ST4000DM004-XXXX
- this will not change after reboots).config
: Device-specific config settings. See the config settings belowHost-based cluster supports raw devices, partitions, logical volumes, encrypted devices, and multipath devices. Be sure to see the quickstart doc prerequisites for additional considerations.
Below are the settings for a PVC-based cluster.
storageClassDeviceSets
: Explained in Storage Class Device SetsThe following are the settings for Storage Class Device Sets which can be configured to create OSDs that are backed by block mode PVs.
name
: A name for the set.count
: The number of devices in the set.resources
: The CPU and RAM requests/limits for the devices. (Optional)placement
: The placement criteria for the devices. (Optional) Default is no placement criteria.The syntax is the same as for other placement configuration. It supports nodeAffinity
, podAffinity
, podAntiAffinity
and tolerations
keys.
It is recommended to configure the placement such that the OSDs will be as evenly spread across nodes as possible. At a minimum, anti-affinity should be added so at least one OSD will be placed on each available nodes.
However, if there are more OSDs than nodes, this anti-affinity will not be effective. Another placement scheme to consider is to add labels to the nodes in such a way that the OSDs can be grouped on those nodes, create multiple storageClassDeviceSets, and add node affinity to each of the device sets that will place the OSDs in those sets of nodes.
Rook will automatically add required nodeAffinity to the OSD daemons to match the topology labels that are found on the nodes where the OSD prepare jobs ran. To ensure data durability, the OSDs are required to run in the same topology that the Ceph CRUSH map expects. For example, if the nodes are labeled with rack topology labels, the OSDs will be constrained to a certain rack. Without the topology labels, Rook will not constrain the OSDs beyond what is required by the PVs, for example to run in the zone where provisioned. See the OSD Topology section for the related labels.
preparePlacement
: The placement criteria for the preparation of the OSD devices. Creating OSDs is a two-step process and the prepare job may require different placement than the OSD daemons. If the preparePlacement
is not specified, the placement
will instead be applied for consistent placement for the OSD prepare jobs and OSD deployments. The preparePlacement
is only useful for portable
OSDs in the device sets. OSDs that are not portable will be tied to the host where the OSD prepare job initially runs.
portable
: If true
, the OSDs will be allowed to move between nodes during failover. This requires a storage class that supports portability (e.g. aws-ebs
, but not the local storage provisioner). If false
, the OSDs will be assigned to a node permanently. Rook will configure Ceph's CRUSH map to support the portability.tuneDeviceClass
: For example, Ceph cannot detect AWS volumes as HDDs from the storage class "gp2", so you can improve Ceph performance by setting this to true.tuneFastDeviceClass
: For example, Ceph cannot detect Azure disks as SSDs from the storage class "managed-premium", so you can improve Ceph performance by setting this to true..volumeClaimTemplates
: A list of PVC templates to use for provisioning the underlying storage devices.
metadata.name
: "data", "metadata", or "wal". If a single template is provided, the name must be "data". If the name is "metadata" or "wal", the devices are used to store the Ceph metadata or WAL respectively. In both cases, the devices must be raw devices or LVM logical volumes.
resources.requests.storage
: The desired capacity for the underlying storage devices.
storageClassName
: The StorageClass to provision PVCs from. Default would be to use the cluster-default StorageClass.
volumeMode
: The volume mode to be set for the PVC. Which should be Block
accessModes
: The access mode for the PVC to be bound by OSD.
schedulerName
: Scheduler name for OSD pod placement. (Optional)
encrypted
: whether to encrypt all the OSDs in a given storageClassDeviceSet
See the table in OSD Configuration Settings to know the allowed configurations.
The following storage selection settings are specific to Ceph and do not apply to other backends. All variables are key-value pairs represented as strings.
metadataDevice
: Name of a device or lvm to use for the metadata of OSDs on each node. Performance can be improved by using a low latency device (such as SSD or NVMe) as the metadata device, while other spinning platter (HDD) devices on a node are used to store data. Provisioning will fail if the user specifies a metadataDevice
but that device is not used as a metadata device by Ceph. Notably, ceph-volume
will not use a device of the same device class (HDD, SSD, NVMe) as OSD devices for metadata, resulting in this failure.databaseSizeMB
: The size in MB of a bluestore database. Include quotes around the size.walSizeMB
: The size in MB of a bluestore write ahead log (WAL). Include quotes around the size.deviceClass
: The CRUSH device class to use for this selection of storage devices. (By default, if a device's class has not already been set, OSDs will automatically set a device's class to either hdd
, ssd
, or nvme
based on the hardware properties exposed by the Linux kernel.) These storage classes can then be used to select the devices backing a storage pool by specifying them as the value of the pool spec's deviceClass
field.initialWeight
: The initial OSD weight in TiB units. By default, this value is derived from OSD's capacity.primaryAffinity
: The primary-affinity value of an OSD, within range [0, 1]
(default: 1
).osdsPerDevice
**: The number of OSDs to create on each device. High performance devices such as NVMe can handle running multiple OSDs. If desired, this can be overridden for each node and each device.encryptedDevice
**: Encrypt OSD volumes using dmcrypt ("true" or "false"). By default this option is disabled. See encryption for more information on encryption in Ceph.crushRoot
: The value of the root
CRUSH map label. The default is default
. Generally, you should not need to change this. However, if any of your topology labels may have the value default
, you need to change crushRoot
to avoid conflicts, since CRUSH map values need to be unique.Allowed configurations are:
block device type | host-based cluster | PVC-based cluster |
---|---|---|
disk | ||
part | encryptedDevice must be false |
encrypted must be false |
lvm | metadataDevice must be "" , osdsPerDevice must be 1 , and encryptedDevice must be false |
metadata.name must not be metadata or wal and encrypted must be false |
crypt | ||
mpath |
Annotations and Labels can be specified so that the Rook components will have those annotations / labels added to them.
You can set annotations / labels for Rook components for the list of key value pairs:
all
: Set annotations / labels for all components except clusterMetadata
.mgr
: Set annotations / labels for MGRsmon
: Set annotations / labels for monsosd
: Set annotations / labels for OSDsprepareosd
: Set annotations / labels for OSD Prepare Jobsmonitoring
: Set annotations / labels for service monitorcrashcollector
: Set annotations / labels for crash collectorsclusterMetadata
: Set annotations only to rook-ceph-mon-endpoints
configmap and the rook-ceph-mon
and rook-ceph-admin-keyring
secrets. These annotations will not be merged with the all
annotations. The common usage is for backing up these critical resources with kubed
.
Note the clusterMetadata annotation will not be merged with the all
annotation.
When other keys are set, all
will be merged together with the specific component.Placement configuration for the cluster services. It includes the following keys: mgr
, mon
, arbiter
, osd
, prepareosd
, cleanup
, and all
.
Each service will have its placement configuration generated by merging the generic configuration under all
with the most specific one (which will override any attributes).
In stretch clusters, if the arbiter
placement is specified, that placement will only be applied to the arbiter.
Neither will the arbiter
placement be merged with the all
placement to allow the arbiter to be fully independent of other daemon placement.
The remaining mons will still use the mon
and/or all
sections.
!!! note
Placement of OSD pods is controlled using the [Storage Class Device Set](#storage-class-device-sets), not the general `placement` configuration.
A Placement configuration is specified (according to the kubernetes PodSpec) as:
nodeAffinity
: kubernetes NodeAffinitypodAffinity
: kubernetes PodAffinitypodAntiAffinity
: kubernetes PodAntiAffinitytolerations
: list of kubernetes TolerationtopologySpreadConstraints
: kubernetes TopologySpreadConstraintsIf you use labelSelector
for osd
pods, you must write two rules both for rook-ceph-osd
and rook-ceph-osd-prepare
like the example configuration. It comes from the design that there are these two pods for an OSD. For more detail, see the osd design doc and the related issue.
The Rook Ceph operator creates a Job called rook-ceph-detect-version
to detect the full Ceph version used by the given cephVersion.image
. The placement from the mon
section is used for the Job except for the PodAntiAffinity
field.
To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node.
Specific node affinity and tolerations that only apply to the
mondaemons in this example require the label
role=storage-mon-node` and also tolerate the control plane taint.
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v17.2.6
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
# enable the ceph dashboard for viewing cluster status
dashboard:
enabled: true
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-node
mon:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-mon-node
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
Resources should be specified so that the Rook components are handled after Kubernetes Pod Quality of Service classes. This allows to keep Rook components running when for example a node runs out of memory and the Rook components are not killed depending on their Quality of Service class.
You can set resource requests/limits for Rook components through the Resource Requirements/Limits structure in the following keys:
mon
: Set resource requests/limits for monsosd
: Set resource requests/limits for OSDs.
This key applies for all OSDs regardless of their device classes. In case of need to apply resource requests/limits for OSDs with particular device class use specific osd keys below. If the memory resource is declared Rook will automatically set the OSD configuration osd_memory_target
to the same value. This aims to ensure that the actual OSD memory consumption is consistent with the OSD pods' resource declaration.osd-<deviceClass>
: Set resource requests/limits for OSDs on a specific device class. Rook will automatically detect hdd
,
ssd
, or nvme
device classes. Custom device classes can also be set.mgr
: Set resource requests/limits for MGRsmgr-sidecar
: Set resource requests/limits for the MGR sidecar, which is only created when mgr.count: 2
.
The sidecar requires very few resources since it only executes every 15 seconds to query Ceph for the active
mgr and update the mgr services if the active mgr changed.prepareosd
: Set resource requests/limits for OSD prepare jobcrashcollector
: Set resource requests/limits for crash. This pod runs wherever there is a Ceph pod running.
It scrapes for Ceph daemon core dumps and sends them to the Ceph manager crash module so that core dumps are centralized and can be easily listed/accessed.
You can read more about the Ceph Crash module.logcollector
: Set resource requests/limits for the log collector. When enabled, this container runs as side-car to each Ceph daemons.cleanup
: Set resource requests/limits for cleanup job, responsible for wiping cluster's data after uninstallexporter
: Set resource requests/limits for Ceph exporter.In order to provide the best possible experience running Ceph in containers, Rook internally recommends minimum memory limits if resource limits are passed. If a user configures a limit or request value that is too low, Rook will still run the pod(s) and print a warning to the operator log.
mon
: 1024MBmgr
: 512MBosd
: 2048MBcrashcollector
: 60MBmgr-sidecar
: 100MB limit, 40MB requestsprepareosd
: no limits (see the note)exporter
: 128MB limit, 50MB requests!!! note
We recommend not setting memory limits on the OSD prepare job to prevent OSD provisioning failure due to memory constraints.
The OSD prepare job bursts memory usage during the OSD provisioning depending on the size of the device, typically
1-2Gi for large disks. The OSD prepare job only bursts a single time per OSD.
All future runs of the OSD prepare job will detect the OSD is already provisioned and skip the provisioning.
!!! hint
The resources for MDS daemons are not configured in the Cluster. Refer to the [Ceph Filesystem CRD](../Shared-Filesystem/ceph-filesystem-crd.md) instead.
For more information on resource requests/limits see the official Kubernetes documentation: Kubernetes - Managing Compute Resources for Containers
requests
: Requests for cpu or memory.
cpu
: Request for CPU (example: one CPU core 1
, 50% of one CPU core 500m
).memory
: Limit for Memory (example: one gigabyte of memory 1Gi
, half a gigabyte of memory 512Mi
).limits
: Limits for cpu or memory.
cpu
: Limit for CPU (example: one CPU core 1
, 50% of one CPU core 500m
).memory
: Limit for Memory (example: one gigabyte of memory 1Gi
, half a gigabyte of memory 512Mi
).!!! warning
Before setting resource requests/limits, please take a look at the Ceph documentation for recommendations for each component: [Ceph - Hardware Recommendations](http://docs.ceph.com/docs/master/start/hardware-recommendations/).
This example shows that you can override these requests/limits for OSDs per node when using useAllNodes: false
in the node
item in the nodes
list.
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v17.2.6
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
storage:
useAllNodes: false
nodes:
- name: "172.17.4.201"
resources:
limits:
cpu: "2"
memory: "4096Mi"
requests:
cpu: "2"
memory: "4096Mi"
Priority class names can be specified so that the Rook components will have those priority class names added to them.
You can set priority class names for Rook components for the list of key value pairs:
all
: Set priority class names for MGRs, Mons, OSDs, and crashcollectors.mgr
: Set priority class names for MGRs. Examples default to system-cluster-critical.mon
: Set priority class names for Mons. Examples default to system-node-critical.osd
: Set priority class names for OSDs. Examples default to system-node-critical.crashcollector
: Set priority class names for crashcollectors.The specific component keys will act as overrides to all
.
The Rook Ceph operator will monitor the state of the CephCluster on various components by default. The following CRD settings are available:
healthCheck
: main ceph cluster health monitoring sectionCurrently three health checks are implemented:
mon
: health check on the ceph monitors, basically check whether monitors are members of the quorum. If after a certain timeout a given monitor has not joined the quorum back it will be failed over and replace by a new monitor.osd
: health check on the ceph osdsstatus
: ceph health status check, periodically check the Ceph health state and reflects it in the CephCluster CR status field.The liveness probe and startup probe of each daemon can also be controlled via livenessProbe
and
startupProbe
respectively. The settings are valid for mon
, mgr
and osd
.
Here is a complete example for both daemonHealth
, livenessProbe
, and startupProbe
:
healthCheck:
daemonHealth:
mon:
disabled: false
interval: 45s
timeout: 600s
osd:
disabled: false
interval: 60s
status:
disabled: false
livenessProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
startupProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
The probe's timing values and thresholds (but not the probe itself) can also be overridden. For more info, refer to the Kubernetes documentation.
For example, you could change the mgr
probe by applying:
healthCheck:
startupProbe:
mgr:
disabled: false
probe:
initialDelaySeconds: 3
periodSeconds: 3
failureThreshold: 30
livenessProbe:
mgr:
disabled: false
probe:
initialDelaySeconds: 3
periodSeconds: 3
Changing the liveness probe is an advanced operation and should rarely be necessary. If you want to change these settings then modify the desired settings.
The operator is regularly configuring and checking the health of the cluster. The results of the configuration
and health checks can be seen in the status
section of the CephCluster CR.
kubectl -n rook-ceph get CephCluster -o yaml
[...]
status:
ceph:
health: HEALTH_OK
lastChecked: "2021-03-02T21:22:11Z"
capacity:
bytesAvailable: 22530293760
bytesTotal: 25757220864
bytesUsed: 3226927104
lastUpdated: "2021-03-02T21:22:11Z"
message: Cluster created successfully
phase: Ready
state: Created
storage:
deviceClasses:
- name: hdd
version:
image: quay.io/ceph/ceph:v17.2.6
version: 16.2.6-0
conditions:
- lastHeartbeatTime: "2021-03-02T21:22:11Z"
lastTransitionTime: "2021-03-02T21:21:09Z"
message: Cluster created successfully
reason: ClusterCreated
status: "True"
type: Ready
Ceph is constantly monitoring the health of the data plane and reporting back if there are
any warnings or errors. If everything is healthy from Ceph's perspective, you will see
HEALTH_OK
.
If Ceph reports any warnings or errors, the details will be printed to the status.
If further troubleshooting is needed to resolve these issues, the toolbox will likely
be needed where you can run ceph
commands to find more details.
The capacity
of the cluster is reported, including bytes available, total, and used.
The available space will be less that you may expect due to overhead in the OSDs.
The conditions
represent the status of the Rook operator.
Ready
condition is raised with ClusterCreated
reason and no other conditions. The cluster
will remain in the Ready
condition after the first successful configuration since it
is expected the storage is consumable from this point on. If there are issues preventing
the storage layer from working, they are expected to show as Ceph health errors.Ready
condition will have the reason ClusterConnected
.Progressing
condition.false
and the message
will
give a summary of the error. See the operator log for more details.There are several other properties for the overall status including:
message
, phase
, and state
: A summary of the overall current state of the cluster, which
is somewhat duplicated from the conditions for backward compatibility.storage.deviceClasses
: The names of the types of storage devices that Ceph discovered
in the cluster. These types will be ssd
or hdd
unless they have been overridden
with the crushDeviceClass
in the storageClassDeviceSets
.version
: The version of the Ceph image currently deployed.The topology of the cluster is important in production environments where you want your data spread across failure domains. The topology can be controlled by adding labels to the nodes. When the labels are found on a node at first OSD deployment, Rook will add them to the desired level in the CRUSH map.
The complete list of labels in hierarchy order from highest to lowest is:
topology.kubernetes.io/region
topology.kubernetes.io/zone
topology.rook.io/datacenter
topology.rook.io/room
topology.rook.io/pod
topology.rook.io/pdu
topology.rook.io/row
topology.rook.io/rack
topology.rook.io/chassis
For example, if the following labels were added to a node:
kubectl label node mynode topology.kubernetes.io/zone=zone1
kubectl label node mynode topology.rook.io/rack=zone1-rack1
These labels would result in the following hierarchy for OSDs on that node (this command can be run in the Rook toolbox):
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.01358 root default
-5 0.01358 zone zone1
-4 0.01358 rack rack1
-3 0.01358 host mynode
0 hdd 0.00679 osd.0 up 1.00000 1.00000
1 hdd 0.00679 osd.1 up 1.00000 1.00000
Ceph requires unique names at every level in the hierarchy (CRUSH map). For example, you cannot have two racks with the same name that are in different zones. Racks in different zones must be named uniquely.
Note that the host
is added automatically to the hierarchy by Rook. The host cannot be specified with a topology label.
All topology labels are optional.
!!! hint
When setting the node labels prior to `CephCluster` creation, these settings take immediate effect. However, applying this to an already deployed `CephCluster` requires removing each node from the cluster first and then re-adding it with new configuration to take effect. Do this node by node to keep your data safe! Check the result with `ceph osd tree` from the [Rook Toolbox](../../Troubleshooting/ceph-toolbox.md). The OSD tree should display the hierarchy for the nodes that already have been re-added.
To utilize the failureDomain
based on the node labels, specify the corresponding option in the CephBlockPool
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph
spec:
failureDomain: rack # this matches the topology labels on nodes
replicated:
size: 3
This configuration will split the replication of volumes across unique racks in the data center setup.
During deletion of a CephCluster resource, Rook protects against accidental or premature destruction of user data by blocking deletion if there are any other Rook Ceph Custom Resources that reference the CephCluster being deleted. Rook will warn about which other resources are blocking deletion in three ways until all blocking resources are deleted:
Rook has the ability to cleanup resources and data that were deployed when a CephCluster is removed.
The policy settings indicate which data should be forcibly deleted and in what way the data should be wiped.
The cleanupPolicy
has several fields:
confirmation
: Only an empty string and yes-really-destroy-data
are valid values for this field.
If this setting is empty, the cleanupPolicy
settings will be ignored and Rook will not cleanup any resources during cluster removal.
To reinstall the cluster, the admin would then be required to follow the cleanup guide to delete the data on hosts.
If this setting is yes-really-destroy-data
, the operator will automatically delete the data on hosts.
Because this cleanup policy is destructive, after the confirmation is set to yes-really-destroy-data
Rook will stop configuring the cluster as if the cluster is about to be destroyed.sanitizeDisks
: sanitizeDisks represents advanced settings that can be used to delete data on drives.
method
: indicates if the entire disk should be sanitized or simply ceph's metadata. Possible choices are quick
(default) or complete
dataSource
: indicate where to get random bytes from to write on the disk. Possible choices are zero
(default) or random
.
Using random sources will consume entropy from the system and will take much more time then the zero sourceiteration
: overwrite N times instead of the default (1). Takes an integer valueallowUninstallWithVolumes
: If set to true, then the cephCluster deletion doesn't wait for the PVCs to be deleted. Default is false
.To automate activation of the cleanup, you can use the following command. WARNING: DATA WILL BE PERMANENTLY DELETED:
kubectl -n rook-ceph patch cephcluster rook-ceph --type merge -p '{"spec":{"cleanupPolicy":{"confirmation":"yes-really-destroy-data"}}}'
Nothing will happen until the deletion of the CR is requested, so this can still be reverted. However, all new configuration by the operator will be blocked with this cleanup policy enabled.
Rook waits for the deletion of PVs provisioned using the cephCluster before proceeding to delete the
cephCluster. To force deletion of the cephCluster without waiting for the PVs to be deleted, you can
set the allowUninstallWithVolumes
to true under spec.CleanupPolicy
.
!!! attention
This feature is experimental.
The Ceph config options are applied after the MONs are all in quorum and running.
To set Ceph config options, you can add them to your CephCluster
spec as shown below.
See the Ceph config reference
for detailed information about how to configure Ceph.
spec:
# [...]
cephConfig:
# Who's the target for these config options?
global:
# All values must be quoted so they are considered a string in YAML
osd_pool_default_size: "3"
mon_warn_on_pool_no_redundancy: "false"
osd_crush_update_on_start: "false"
# Make sure to quote special characters
"osd.*":
osd_max_scrubs: "10"
!!! warning
Rook performs no direct validation on these config options, so the validity of the settings is the
user's responsibility.
The operator does not unset any removed config options, it is the user's responsibility to unset or set the default value for each removed option manually using the Ceph CLI.
The CSI driver options mentioned here are applied per Ceph cluster. The following options are available:
readAffinity
: RBD and CephFS volumes allow serving reads from an OSD in proximity to the client. Refer to the read affinity section in the Ceph CSI Drivers for more details.
enabled
: Whether to enable read affinity for the CSI driver. Default is false
.crushLocationLabels
: Node labels to use as CRUSH location, corresponding to the values set in the CRUSH map. Defaults to the labels mentioned in the
OSD topology topic.cephfs
:
kernelMountOptions
: Mount options for kernel mounter. Refer to the kernel mount options for more details.fuseMountOptions
: Mount options for fuse mounter. Refer to the fuse mount options for more details.