Targeted for v0.9
Provisioning OSDs today is done directly by Rook. This needs to be simplified and improved by building
on the functionality provided by the ceph-volume
tool that is included in the ceph image.
As Rook is implemented today, the provisioning has a lot of complexity around:
metadata
device where the WAL and DB are placed on a different device from the dataSince this is mostly handled by ceph-volume
now, Rook should replace its own provisioning code and rely on ceph-volume
.
ceph-volume
is a CLI tool included in the ceph/ceph
image that will be used to configure and run Ceph OSDs.
ceph-volume
will replace the OSD provisioning mentioned previously in the legacy design.
At a high level this flow remains unchanged from the flow in the one-osd-per-pod design. No new jobs or pods need to be launched from what we have today. The sequence of events in the OSD provisioning will be the following.
ceph-volume lvm batch
to prepare the OSDs on the node. A single call is made with all of the devices unless more specific settings are included for LVM and partitions.ceph-volume lvm list
to retrieve the results of the OSD configuration. Store the results in a configmap for the operator to take the next step.rook
is the entrypoint for the container.
ceph-volume lvm activate
is called to activate the osd, which mounts the config directory such as /var/lib/ceph/osd-0
, using a tempfs mount. The OSD options such as --bluestore
, --filestore
, OSD_ID
, and OSD_FSID
are passed to the command as necessary.ceph-osd
ceph-osd
exits, rook
will exit and the pod will be restarted by K8s.ceph-volume
enables rook to expose several new features:
The Cluster CRD will be updated with the following settings to enable these features. All of these settings can be specified
globally if under the storage
element as in this example. The config
element can also be specified under individual
nodes or devices.
storage:
config:
# whether to encrypt the contents of the OSD with dmcrypt
encryptedDevice: "true"
# how many OSDs should be configured on each device. only recommended to be greater than 1 for NVME devices
osdsPerDevice: 1
# the class name for the OSD(s) on devices
crushDeviceClass: ssd
If more flexibility is needed that consuming raw devices, LVM or partition names can also be used for specific nodes. Properties are shown for both bluestore and filestore OSDs.
storage:
nodes:
- name: node2
# OSDs on LVM (open design question: need to re-evaluate the logicalDevice settings when they are implemented after 0.9 and whether they should be under the more general storage node "config" settings)
logicalDevices:
# bluestore: the DB, WAL, and Data are on separate LVs
- db: db_lv1
wal: wal_lv1
data: data_lv1
dbVolumeGroup: db_vg
walVolumeGroup: wal_vg
dataVolumeGroup: data_vg
# bluestore: the DB, WAL, and Data are all on the same LV
- volume: my_lv1
volumeGroup: my_vg
# filestore: data and journal on the same LV
- data: my_lv2
dataVolumeGroup: my_vg
# filestore: data and journal on different LVs
- data: data_lv3
dataVolumeGroup: data_vg
journal: journal_lv3
journalVolumeGroup: journal_vg
# devices support both filestore and bluestore configurations based on the "config.storeType" setting at the global, node, or device level
devices:
# OSD on a raw device
- name: sdd
# OSD on a partition (partition support is new)
- name: sdf1
# Multiple OSDs on a high performance device
- name: nvme01
config:
osdsPerDevice: 5
The above options for LVM and partitions look very tedious. Questions:
Rook will need to continue supporting clusters that are running different types of OSDs. All of the v0.8 OSDs must continue running after Rook is upgraded to v0.9 and beyond, whether they were filestore or bluestore running on directories or devices.
Since ceph-volume
only supports devices that have not been previously configured by Rook:
directory
is specified in the CRD
ceph-volume
will be skipped and the OSDs will be started as previouslyceph-volume
Rook relies on very recent developments in ceph-volume
that are not yet available in luminous or mimic releases.
For example, rook needs to run the command:
ceph-volume lvm batch --prepare <devices>
The batch
command and the flag --prepare
have been added recently.
While the latest ceph-volume
changes will soon be merged to luminous and mimic, Rook needs to know if it is running an image that contains the required functionality.
To detect if ceph-volume
supports the required options, Rook will run the
command with all the flags that are required. To avoid side effects when testing for the version of ceph-volume
, no devices
are passed to the batch
command.
ceph-volume lvm batch --prepare
ceph-volume
has an exit code of 0
.ceph-volume
has an exit code of 2
.Since Rook orchestrates different versions of Ceph, Rook (at least initially) will need to support running images that may not
have the features necessary from ceph-volume
. When a supported version of ceph-volume
is not detected, Rook will
execute the legacy code to provision devices.