storage-class-device-set.md 17 KB

Rook StorageClassDeviceSets

Target version: Rook 1.1

Background

The primary motivation for this feature is to take advantage of the mobility of storage in cloud-based environments by defining storage based on sets of devices consumed as block-mode PVCs. In environments like AWS you can request remote storage that can be dynamically provisioned and attached to any node in a given Availability Zone (AZ). The design can also accommodate non-cloud environments through the use of local PVs.

StorageClassDeviceSet struct

struct StorageClassDeviceSet {
    Name                 string // A unique identifier for the set
    Count                int // Number of devices in this set
    Resources            v1.ResourceRequirements // Requests/limits for the devices
    Placement            rook.Placement // Placement constraints for the devices
    Config               map[string]string // Provider-specific device configuration
    volumeClaimTemplates []v1.PersistentVolumeClaim // List of PVC templates for the underlying storage devices
}

A provider will be able to use the StorageClassDeviceSet struct to describe the properties of a particular set of StorageClassDevices. In this design, the notion of a "StorageClassDevice" is an abstract concept, separate from underlying storage devices. There are three main aspects to this abstraction:

  1. A StorageClassDevice is both storage and the software required to make the storage available and manage it in the cluster. As such, the struct takes into account the resources required to run the associated software, if any.
  2. A single StorageClassDevice could be comprised of multiple underlying storage devices, specified by having more than one item in the volumeClaimTemplates field.
  3. Since any storage devices that are part of a StorageClassDevice will be represented by block-mode PVCs, they will need to be associated with a Pod so that they can be attached to cluster nodes.

A StorageClassDeviceSet will have the following fields associated with it:

  • name: A name for the set. [required]
  • count: The number of devices in the set. [required]
  • resources: The CPU and RAM requests/limits for the devices. Default is no resource requests.
  • placement: The placement criteria for the devices. Default is no placement criteria.
  • config: Granular device configuration. This is a generic map[string]string to allow for provider-specific configuration.
  • volumeClaimTemplates: A list of PVC templates to use for provisioning the underlying storage devices.

An entry in volumeClaimTemplates must specify the following fields:

  • resources.requests.storage: The desired capacity for the underlying storage devices.
  • storageClassName: The StorageClass to provision PVCs from. Default would be to use the cluster-default StorageClass.

Example Workflow: rook-ceph OSDs

The CephCluster CRD could be extended to include a new field: spec.storage.StorageClassDeviceSets, which would be a list of one or more StorageClassDeviceSets. If elements exist in this list, the CephCluster controller would then create enough PVCs to match each StorageClassDeviceSet's Count field, attach them to individual OsdPrepare Jobs, then attach them to OSD Pods once the Jobs are completed. For the initial implementation, only one entry in volumeClaimTemplates would be supported, if only to tighten the scope for an MVP.

The PVCs would be provisioned against a configured or default StorageClass. It is recommended that the admin setup a StorageClass with volumeBindingMode: WaitForFirstConsumer set.

If the admin wishes to control device placement, it will be up to them to make sure the desired nodes are labeled properly to ensure the Kubernetes scheduler will distribute the OSD Pods based on Placement criteria.

In keeping with current Rook-Ceph patterns, the resources and placement for the OSDs specified in the StorageClassDeviceSet would override any cluster-wide configurations for OSDs. Additionally, other conflicting configurations parameters in the CephCluster CRD,such as useAllDevices, will be ignored by device sets.

OSD Deployment Behavior

While the current strategy of deploying OSD Pods as individual Kubernetes Deployments, some changes to the deployment logic would need to change. The workflow would look something like this:

  1. Get all matching OSD PVCs
  2. Create any missing OSD PVCs
  3. Get all matching OSD Deployments
  4. Check that all OSD Deployments are using valid OSD PVCs
    • If not, probably remove the OSD Deployment?
    • Remove any PVCs used by OSD Deployments from the list of PVCs to be worked on
  5. Run an OsdPrepare Job on all unused and uninitialized PVCs
    • This would be one Job per PVC
  6. Create an OSD Deployment for each unused but initialized PVC
    • Deploy OSD with ceph-volume if available.
      • If PV is not backed by LV, create a LV in this PV.
      • If PV is backed by LV, use this PV as is.

Additional considerations for local storage

This design can also be applied to non-cloud environments. To take advantage of this, the admin should deploy the sig-storage-local-static-provisioner to create local PVs for the desired local storage devices and then follow the recommended directions for local PVs.

Possible Implementation: DriveGroups

Creating real-world OSDs is complex:

  • Some configurations deploy multiple OSDs on a single drive
  • Some configurations are using more than one drive for a single OSD.
  • Deployments often look similar on multiple hosts.
  • There are some advanced configurations possible, like encrypted drives.

All of these setups are valid real-world configurations that need to be supported.

The Ceph project defines a data structure that allows defining groups of drives to be provisioned in a specific way by ceph-volume: Drive Groups. Drive Groups were originally designed to be ephemeral, but it turns out that orchestrators like DeepSea store them permanently in order to have a source of truth when (re-)provisioning OSDs. Also, Drive Groups were originally designed to be host specific. But the concept of hosts is not really required for the Drive Group data structure make sense, as they only select a subset of a set of available drives.

DeepSea has a documentation of some example drive groups. A complete specification is documented in the ceph documentation.

DeviceSet vs DriveGroup

A DeviceSet to provision 8 OSDs on 8 drives could look like:

name: "my_device_set"
count: 8

The Drive Group would look like so:

host_pattern: "hostname1"
data_devices:
  count: 8

A Drive Group with 8 OSDs using a shared fast drive could look similar to this:

host_pattern: "hostname1"
data_devices:
  count: 8
  model: MC-55-44-XZ
db_devices:
  model: NVME
db_slots: 8

ResourceRequirements and Placement

Drive Groups don't yet provide orchestrator specific extensions, like resource requirements or placement specs, but that could be added trivially. Also a name could be added to Drive Groups.

OSD Configuration Examples

Given the complexity of this design, here are a few examples to showcase possible configurations for OSD StorageClassDeviceSets.

Example 1: AWS cross-AZ

type: CephCluster
name: cluster1
...
spec:
  ...
  storage:
    ...
    storageClassDeviceSets:
    - name: cluster1-set1
      count: 3
      resources:
        requests:
          cpu: 2
          memory: 4Gi
      placement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
              topologyKey: "failure-domain.beta.kubernetes.io/zone"
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 5Ti
          storageClassName: gp2-ebs

In this example, podAntiAffinity is used to spread the OSD Pods across as many AZs as possible. In addition, all Pods would have to be given the label rook.io/cluster=cluster1 to denote they belong to this cluster, such that the scheduler will know to try and not schedule multiple Pods with that label on the same nodes if possible. The CPU and memory requests would allow the scheduler to know if a given node can support running an OSD process.

It should be noted, in the case where the only nodes that can run a new OSD Pod are nodes with OSD Pods already on them, one of those nodes would be selected. In addition, EBS volumes may not cross between AZs once created, so a given Pod is guaranteed to always be limited to the AZ it was created in.

Example 2: Single AZ

type: CephCluster
name: cluster1
...
spec:
  ...
  resources:
    osd:
      requests:
        cpu: 2
        memory: 4Gi
  storage:
    ...
    storageClassDeviceSets:
    - name: cluster1-set1
      count: 3
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "failure-domain.beta.kubernetes.io/zone"
                operator: In
                values:
                - us-west-1a
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 5Ti
          storageClassName: gp2-ebs
    - name: cluster1-set2
      count: 3
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "failure-domain.beta.kubernetes.io/zone"
                operator: In
                values:
                - us-west-1b
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 5Ti
          storageClassName: gp2-ebs
    - name: cluster1-set3
      count: 3
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "failure-domain.beta.kubernetes.io/zone"
                operator: In
                values:
                - us-west-1c
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 5Ti
          storageClassName: gp2-ebs

In this example, we've added a nodeAffinity to the placement that restricts all OSD Pods to a specific AZ. This case is only really useful if you specify multiple StorageClassDeviceSets for different AZs, so that has been done here. We also specify a top-level resources definition, since we want that to be the same for all OSDs in the device sets.

Example 3: Different resource needs

type: CephCluster
name: cluster1
...
spec:
  ...
  storage:
    ...
    storageClassDeviceSets:
    - name: cluster1-set1
      count: 3
      resources:
        requests:
          cpu: 2
          memory: 4Gi
      placement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
              topologyKey: "failure-domain.beta.kubernetes.io/zone"
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 5Ti
          storageClassName: gp2-ebs
    - name: cluster1-set2
      count: 3
      resources:
        requests:
          cpu: 2
          memory: 8Gi
      placement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
              topologyKey: "failure-domain.beta.kubernetes.io/zone"
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 10Ti
          storageClassName: gp2-ebs

In this example, we have two StorageClassDeviceSets with different capacities for the devices in each set. The devices with larger capacities would require a greater amount of memory to operate the OSD Pods, so that is reflected in the resources field.

Example 4: Simple local storage

type: CephCluster
name: cluster1
...
spec:
  ...
  storage:
    ...
    storageClassDeviceSets:
    - name: cluster1-set1
      count: 3
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 1
          storageClassName: cluster1-local-storage

In this example, we expect there to be nodes that are configured with one local storage device each, but they would not be specified in the nodes list. Prior to this, the admin would have had to deploy the local-storage-provisioner, created local PVs for each of the devices, and created a StorageClass to allow binding of the PVs to PVCs. At this point, the same workflow would be the same as the cloud use case, where you simply specify a count of devices, and a template with a StorageClass. Two notes here:

  1. The count of devices would need to match the number of existing devices to consume them all.
  2. The capacity for each device is irrelevant, since we will simply consume the entire storage device and get that capacity regardless of what is set for the PVC.

Example 5: Multiple devices per OSD

type: CephCluster
name: cluster1
...
spec:
  ...
  storage:
    ...
    storageClassDeviceSets:
    - name: cluster1-set1
      count: 3
      config:
        metadataDevice: "/dev/rook/device1"
      volumeClaimTemplates:
      - metadata:
          name: osd-data
        spec:
          resources:
            requests:
              storage: 1
          storageClassName: cluster1-hdd-storage
      - metadata:
          name: osd-metadata
        spec:
          resources:
            requests:
              storage: 1
          storageClassName: cluster1-nvme-storage

In this example, we are using NVMe devices to store OSD metadata while having HDDs store the actual data. We do this by creating two StorageClasses, one for the NVMe devices and one for the HDDs. Then, if we assume our implementation will always provide the block devices in a deterministic manner, we specify the location of the NVMe devices (as seen in the container) as the metadataDevice in the OSD config. We can guarantee that a given OSD Pod will always select two devices that are on the same node if we configure volumeBindingMode: WaitForFirstConsumer in the StorageClasses, as that allows us to offload that logic to the Kubernetes scheduler. Finally, we also provide a name field for each device set, which can be used to identify which set a given PVC belongs to.

Example 6: Additional OSD configuration

type: CephCluster
name: cluster1
...
spec:
  ...
  storage:
    ...
    storageClassDeviceSets:
    - name: cluster1-set1
      count: 3
      config:
        osdsPerDevice: "3"
      volumeClaimTemplates:
      - spec:
          resources:
            requests:
              storage: 5Ti
          storageClassName: cluster1-local-storage

In this example, we show how we can provide additional OSD configuration in the StorageClassDeviceSet. The config field is just a map[string]string type, so anything can go in this field.