The solution plan agreed upon with the telemetry team is for the Rook operator to add telemetry to
the Ceph mon config-key
database, and Ceph will read each of those items for telemetry retrieval.
config-key
keys that can grow arbitrarily large to keep space
usage of the mon database low (limited growth is still acceptable)Metric names will indicate a hierarchy that can be parsed to add it to Ceph telemetry collection in a more ordered fashion.
For example rook/version
and rook/kubernetes/version
would be put into a structure like shown:
"rook": {
"version": "vx.y.z"
"kubernetes": {
"version": "vX.Y.Z"
}
}
rook/version
- Rook version.rook/kubernetes/...
rook/kubernetes/version
- Kubernetes version.rook/csi/...
rook/csi/version
- Ceph CSI version.rook/node/count/...
- Node scale information
rook/node/count/kubernetes-total
- Total number of Kubernetes nodesrook/node/count/with-ceph-daemons
- Number of nodes running Ceph daemons.-1
to represent "unknown"rook/node/count/with-csi-rbd-plugin
- Number of nodes with CSI RBD plugin podsrook/node/count/with-csi-cephfs-plugin
- Number of nodes with CSI CephFS plugin podsrook/node/count/with-csi-nfs-plugin
- Number of nodes with CSI NFS plugin podsrook/usage/storage-class/...
- Info about storage classes related to the Ceph cluster
rook/usage/storage-class/count/...
- Number of storage classes of a given typerook/usage/storage-class/count/total
- This is additionally useful in the case of a
newly-added storage class type not recognized by an older Ceph telemetry versionrook/usage/storage-class/count/rbd
rook/usage/storage-class/count/cephfs
rook/usage/storage-class/count/nfs
rook/usage/storage-class/count/bucket
rook/cluster/storage/...
- Info about storage configuration
rook/cluster/storage/device-set/...
- Info about storage class device setsrook/cluster/storage/device-set/count/...
- Number of device sets of given types
rook/cluster/storage/device-set/count/total
rook/cluster/storage/device-set/count/portable
rook/cluster/storage/device-set/count/non-portable
rook/cluster/mon/...
- Info about monitors and mon configuration
rook/cluster/mon/count
- The desired mon countrook/cluster/mon/allow-multiple-per-node
- true/false if allowing multiple mons per noderook/cluster/mon/max-id
- The highest mon ID, which increases as mons fail overrook/cluster/mon/pvc/enabled
- true/false whether mons are on PVCrook/cluster/mon/stretch/enabled
- true/false if mons are in a stretch configurationrook/cluster/network/...
rook/cluster/network/provider
- The network provider used for the cluster (default, host, multus)rook/cluster/external-mode
- true/false if the cluster is in external modeThis strategy will allow Rook time to add telemetry items as it is able without rushing. Because the telemetry fields will be approved all at once, it will also minimize the coordination that is required between Ceph and Rook. The Ceph team will not need to create PRs one-to-one with Rook, and we can limit version mismatch issues as the telemetry is being added.
Future updates will follow a similar pattern where new telemetry is suggested by updates to this design doc in Rook, then batch-added by Ceph.
Rook will define all telemetry config-keys in a common file to easily understand from code what telemetry is implemented by a given code version of Rook.
The below one-liner should list each individual metric in this design doc, which can help in creating Ceph issue trackers for adding Rook telemetry features.
grep -E -o -e '- `rook/.*[^\.]`' design/ceph/ceph-telemetry.md | grep -E -o -e 'rook/.*[^`]'
Rejected metrics are included to capture the full discussion, and they can be revisited at any time with new information or desires.
Count of each type of CR: cluster, object, file, object store, mirror, bucket topic, bucket notification, etc.
This was rejected for version one for a few reasons:
We can revisit this on a case-by-case basis for specific CRs or features. For example, we may wish to have ideas about COSI usage when that is available.
The memory/CPU requests/limits set on Ceph daemon types.
This was rejected for a few reasons:
Unless we can provide good reasoning for why this particular metric is valuable, this is likely too much work for too little benefit.
The number of PVCs/PVs of the different CSI types.
This was rejected primarily because it would require adding new get/list permissions to the Rook operator which is antithetical to Rook's desires to keep permissions as minimal as possible.