Target version: 1.5
In a production environment it is generally expected to have three failure domains where three replicas of the data are stored. If any individual failure domain goes down, the data is still available for reads and writes in the other two failure domains.
For environments that only have two failure domains available where data can be replicated, we need to also support the case where one failure domain is lost and the data is still fully available in the remaining failure domain.
To support this scenario, Ceph has integrated support for stretch clusters with an arbiter mon as seen in this PR and in the following design docs:
Rook will enable the stretch clusters when requested by the admin through the usual Rook CRDs.
To enable the stretch cluster based on the Ceph architecture:
The arbiter zone will commonly contain just a single node that is also a K8s master node, although the arbiter zone may certainly contain more nodes.
The type of failure domain used for stretch clusters is commonly "zone", but can be set to a different failure domain.
Distributed systems are impacted by the network latency between critical components. In a stretch cluster, the critical latency is in the Etcd servers configured with Kubernetes. K8s only supports latency of up to 5ms (10ms round trip), which is lower than the latency requirements for any Ceph components. Ceph mons can handle higher latency, designed for up to 700ms round trip.
The topology of the K8s cluster is to be determined by the admin, outside the scope of Rook. Rook will simply detect the topology labels that have been added to the nodes.
If the desired failure domain is a "zone", the topology.kubernetes.io/zone
label should
be added to the nodes. Any of the topology labels
supported by OSDs can be used.
In the minimum configuration, two nodes in each data zone would be labeled, while one node in the arbiter zone is required.
For example:
topology.kubernetes.io/zone=a
topology.kubernetes.io/zone=a
topology.kubernetes.io/zone=b
topology.kubernetes.io/zone=b
topology.kubernetes.io/zone=arbiter
The core changes to rook are to associate the mons with the required zones.
The new configuration is found under the stretchCluster
configuration where the
three zones must be listed and the arbiter zone is identified.
mon:
count: 5
allowMultiplePerNode: false
stretchCluster:
# The cluster is most commonly stretched over zones, but could also be stretched over
# another failure domain such as datacenter or region. Must be one of the labels
# used by OSDs as documented at https://rook.io/docs/rook/latest/ceph-cluster-crd.html#osd-topology.
failureDomainLabel: topology.kubernetes.io/zone
zones:
# There must be exactly three zones in this list, with one of them being the arbiter
- name: arbiter
arbiter: true
- name: a
- name: b
The operator will track which mons belong to which zone. The zone assignment for each mon
will be stored in the rook-ceph-mon-endpoints
configmap in the same data structure where the host
assignment is stored for mons with node affinity.
For example, if the zones are called "arbiter", "zone1", and "zone2", the configmap would contain:
data:
data: a=10.99.109.200:6789,b=10.98.18.147:6789,c=10.96.86.248:6789,d=10.96.86.249:6789,e=10.96.86.250:6789
mapping: '{"node":{"a":"arbiter","b":"zone1","c":"zone1","d":"zone2","e":"zone2"}}'
maxMonId: "4"
If no zones are listed in the stretchCluster.zones
, it is not considered a stretch cluster.
If anything other than 3 zones, it would be an error and the cluster would not be configured.
Mon failover will replace a mon in the same zone where the previous mon failed.
The mons can all be backed by a host path or a PVC, similar to any other Rook cluster.
In this example, all the mons will be backed by the same PVC:
mon:
count: 5
allowMultiplePerNode: false
stretchCluster:
zones:
- name: arbiter
arbiter: true
- name: a
- name: b
volumeClaimTemplate:
spec:
storageClassName: gp2
resources:
requests:
storage: 10Gi
It is possible that the mons will need to be backed by a different type of storage in different zones. In that case, a volumeClaimTemplate specified under the zone will override the default storage setting for the mon.
For example, the arbiter could specify a different backing store in this manner:
mon:
count: 5
allowMultiplePerNode: false
stretchCluster:
zones:
- name: arbiter
arbiter: true
volumeClaimTemplate:
spec:
storageClassName: alternative-storage
resources:
requests:
storage: 10Gi
- name: a
- name: b
volumeClaimTemplate:
spec:
storageClassName: gp2
resources:
requests:
storage: 10Gi
Question: Is there a need for a dataDirHostPath to be specified for the mons in one zone when the other mons are using the PVCs? For now, we assume all the mons are either using a PVC or a host path, but not a mix.
Rook will set the mon election strategy to the new "connectivity" algorithm.
mon election default strategy: 3
The mons will be configured so Ceph will associate the mons with the correct failure domains. Rook will set each mon location to the zone name
$ ceph mon set_location <mon> <zone>
The stretch mode will be enabled with the command:
$ ceph mon enable_stretch_mode tiebreaker_mon <mon> new_crush_rule <rule> dividing_bucket zone
For data protection in the stretch cluster, all pools should be created with the following configuration, including pools for an rbd storage class (CephBlockPool), shared filesystem (CephFilesystem), or object store (CephObjectStore).
Two replicas of data are stored in each zone using a special CRUSH rule specified by replicasPerFailureDomain: 2
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: stretchedreplica
namespace: rook-ceph
spec:
failureDomain: zone
replicated:
size: 4
replicasPerFailureDomain: 2
See this issue for more details on implementation of the CRUSH rule.
Erasure coded pools are not supported currently in Ceph stretch clusters.