OpenTelemetry Bot d680729c09 [chore] Prepare release 0.90.0 (#29543) 1 年之前
..
images 60be78a1b8 [awscontainerinsights receiver]Add Readme for ECS (#4375) 3 年之前
internal f4c44858b5 [all][chore] Moved from interface{} to any for all go code (#29072) 1 年之前
testdata f343580ebb [chore] Change receiver config tests to unmarshal config only for that component. (#13383) 2 年之前
Makefile 961c9cd3aa set up the skeleton for aws container insight receiver (#3218) 3 年之前
README.md 4982f49841 [chore] update examples to use debugexporter (#26715) 1 年之前
config.go 5133f4ccd6 [chore] use license shortform (#22052) 1 年之前
config_test.go 3ec818bb72 [chore][receiver/awscontainerinsight] use generated status header (#22826) 1 年之前
design.md 60be78a1b8 [awscontainerinsights receiver]Add Readme for ECS (#4375) 3 年之前
doc.go 3ec818bb72 [chore][receiver/awscontainerinsight] use generated status header (#22826) 1 年之前
factory.go 3ec818bb72 [chore][receiver/awscontainerinsight] use generated status header (#22826) 1 年之前
factory_test.go 5133f4ccd6 [chore] use license shortform (#22052) 1 年之前
go.mod d680729c09 [chore] Prepare release 0.90.0 (#29543) 1 年之前
go.sum 40b485f08a Update core for v0.90.0 release (#29539) 1 年之前
metadata.yaml 8a4348cb00 [chore] add codeowners to metadata (#24404) 1 年之前
receiver.go e136bfdee5 [chore] Migrate all `aws` receviers to use errors.Join (#25185) 1 年之前
receiver_test.go a24294f2a1 [exporter/awsemf] Enforce TTL on metric calculator maps (#25066) 1 年之前

README.md

AWS Container Insights Receiver

Status
Stability beta: metrics
Distributions contrib, aws, observiq, sumo
Warnings Other
Issues Open issues Closed issues
Code Owners @Aneurysm9, @pxaws

Overview

AWS Container Insights Receiver (awscontainerinsightreceiver) is an AWS specific receiver that supports CloudWatch Container Insights. CloudWatch Container Insights collect, aggregate, and summarize metrics and logs from your containerized applications and microservices. Data are collected as as performance log events using embedded metric format. From the EMF data, Amazon CloudWatch can create the aggregated CloudWatch metrics at the cluster, node, pod, task, and service level.

CloudWatch Container Insights has been supported by ECS Agent and CloudWatch Agent to collect infrastructure metrics for many resources such as such as CPU, memory, disk, and network. To migrate existing customers to use OpenTelemetry, AWS Container Insights Receiver (together with CloudWatch EMF Exporter) aims to support the same CloudWatch Container Insights experience for the following platforms:

  • Amazon ECS
  • Amazon EKS
  • Kubernetes platforms on Amazon EC2

Design of AWS Container Insights Receiver

See the design doc

Configuration

Example configuration:

receivers:
  awscontainerinsightreceiver:
    # all parameters are optional
    collection_interval: 60s
    container_orchestrator: eks
    add_service_as_attribute: true 
    prefer_full_pod_name: false 
    add_full_pod_name_metric_label: false 

There is no need to provide any parameters since they are all optional.

collection_interval (optional)

The interval at which metrics should be collected. The default is 60 second.

container_orchestrator (optional)

The type of container orchestration service, e.g. eks or ecs. The default is eks.

add_service_as_attribute (optional)

Whether to add the associated service name as attribute. The default is true

prefer_full_pod_name (optional)

The "PodName" attribute is set based on the name of the relevant controllers like Daemonset, Job, ReplicaSet, ReplicationController, ... If it can not be set that way and PrefFullPodName is true, the "PodName" attribute is set to the pod's own name. The default value is false.

add_full_pod_name_metric_label (optional)

The "FullPodName" attribute is the pod name including suffix. If false FullPodName label is not added. The default value is false

Sample configuration for Container Insights

This is a sample configuration for AWS Container Insights using the awscontainerinsightreceiver and awsemfexporter for an EKS cluster:

# create namespace
apiVersion: v1
kind: Namespace
metadata:
  name: aws-otel-eks
  labels:
    name: aws-otel-eks

---
# create cwagent service account and role binding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-otel-sa
  namespace: aws-otel-eks

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: aoc-agent-role
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "endpoints"]
    verbs: ["list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["nodes/stats", "configmaps", "events"]
    verbs: ["create", "get"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["otel-container-insight-clusterleader"]
    verbs: ["get","update"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: aoc-agent-role-binding
subjects:
  - kind: ServiceAccount
    name: aws-otel-sa
    namespace: aws-otel-eks
roleRef:
  kind: ClusterRole
  name: aoc-agent-role
  apiGroup: rbac.authorization.k8s.io

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-agent-conf
  namespace: aws-otel-eks
  labels:
    app: opentelemetry
    component: otel-agent-conf
data:
  otel-agent-config: |
    extensions:
      health_check:

    receivers:
      awscontainerinsightreceiver:

    processors:
      batch/metrics:
        timeout: 60s

    exporters:
      awsemf:
        namespace: ContainerInsights
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{NodeName}'
        resource_to_telemetry_conversion:
          enabled: true
        dimension_rollup_option: NoDimensionRollup
        parse_json_encoded_attr_values: [Sources, kubernetes]
        metric_declarations:
          # node metrics
          - dimensions: [[NodeName, InstanceId, ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
              - node_cpu_usage_total
              - node_cpu_limit
              - node_memory_working_set
              - node_memory_limit

          # pod metrics
          - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - pod_cpu_utilization
              - pod_memory_utilization
              - pod_network_rx_bytes
              - pod_network_tx_bytes
              - pod_cpu_utilization_over_pod_limit
              - pod_memory_utilization_over_pod_limit
          - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - pod_cpu_reserved_capacity
              - pod_memory_reserved_capacity
          - dimensions: [[PodName, Namespace, ClusterName]]
            metric_name_selectors:
              - pod_number_of_container_restarts

          # cluster metrics
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - cluster_node_count
              - cluster_failed_node_count

          # service metrics
          - dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - service_number_of_running_pods

          # node fs metrics
          - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
            metric_name_selectors:
              - node_filesystem_utilization

          # namespace metrics
          - dimensions: [[Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - namespace_number_of_running_pods


      debug:
        verbosity: detailed

    service:
      pipelines:
        metrics:
          receivers: [awscontainerinsightreceiver]
          processors: [batch/metrics]
          exporters: [awsemf]

      extensions: [health_check]

---
# create Daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: aws-otel-eks-ci
  namespace: aws-otel-eks
spec:
  selector:
    matchLabels:
      name: aws-otel-eks-ci
  template:
    metadata:
      labels:
        name: aws-otel-eks-ci
    spec:
      containers:
        - name: aws-otel-collector
          image: {collector-image-url}
          env:
            #- name: AWS_REGION
            #  value: "us-east-1"
            - name: K8S_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                 fieldRef:
                   fieldPath: metadata.namespace
          imagePullPolicy: Always
          command:
            - "/awscollector"
            - "--config=/conf/otel-agent-config.yaml"
          volumeMounts:
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
            - name: otel-agent-config-vol
              mountPath: /conf
          resources:
            limits:
              cpu:  200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
      volumes:
        - configMap:
            name: otel-agent-conf
            items:
              - key: otel-agent-config
                path: otel-agent-config.yaml
          name: otel-agent-config-vol
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            path: /run/containerd/containerd.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      serviceAccountName: aws-otel-sa

To deploy to an EKS cluster

kubectl apply -f config.yaml

Available Metrics and Resource Attributes

Cluster

Metric Unit
cluster_failed_node_count Count
cluster_node_count Count



| Resource Attribute | |--------------------| | ClusterName | | NodeName | | Type | | Timestamp | | Version | | Sources |





Cluster Namespace

Metric Unit
namespace_number_of_running_pods Count



| Resource Attribute | |--------------------| | ClusterName | | NodeName | | Namespace | | Type | | Timestamp | | Version | | Sources | | kubernete |





Cluster Service

Metric Unit
service_number_of_running_pods Count



| Resource Attribute | |--------------------| | ClusterName | | NodeName | | Namespace | | Service | | Type | | Timestamp | | Version | | Sources | | kubernete |





Node

Metric Unit
node_cpu_limit Millicore
node_cpu_request Millicore
node_cpu_reserved_capacity Percent
node_cpu_usage_system Millicore
node_cpu_usage_total Millicore
node_cpu_usage_user Millicore
node_cpu_utilization Percent
node_memory_cache Bytes
node_memory_failcnt Count
node_memory_hierarchical_pgfault Count/Second
node_memory_hierarchical_pgmajfault Count/Second
node_memory_limit Bytes
node_memory_mapped_file Bytes
node_memory_max_usage Bytes
node_memory_pgfault Count/Second
node_memory_pgmajfault Count/Second
node_memory_request Bytes
node_memory_reserved_capacity Percent
node_memory_rss Bytes
node_memory_swap Bytes
node_memory_usage Bytes
node_memory_utilization Percent
node_memory_working_set Bytes
node_network_rx_bytes Bytes/Second
node_network_rx_dropped Count/Second
node_network_rx_errors Count/Second
node_network_rx_packets Count/Second
node_network_total_bytes Bytes/Second
node_network_tx_bytes Bytes/Second
node_network_tx_dropped Count/Second
node_network_tx_errors Count/Second
node_network_tx_packets Count/Second
node_number_of_running_containers Count
node_number_of_running_pods Count



| Resource Attribute | |----------------------| | ClusterName | | InstanceType | | NodeName | | Timestamp | | Type | | Version | | Sources | | kubernete |





Node Disk IO

Metric Unit
node_diskio_io_serviced_async Count/Second
node_diskio_io_serviced_read Count/Second
node_diskio_io_serviced_sync Count/Second
node_diskio_io_serviced_total Count/Second
node_diskio_io_serviced_write Count/Second
node_diskio_io_service_bytes_async Bytes/Second
node_diskio_io_service_bytes_read Bytes/Second
node_diskio_io_service_bytes_sync Bytes/Second
node_diskio_io_service_bytes_total Bytes/Second
node_diskio_io_service_bytes_write Bytes/Second



| Resource Attribute | |----------------------| | AutoScalingGroupName | | ClusterName | | InstanceId | | InstanceType | | NodeName | | Timestamp | | EBSVolumeId | | device | | Type | | Version | | Sources | | kubernete |



Node Filesystem

Metric Unit
node_filesystem_available Bytes
node_filesystem_capacity Bytes
node_filesystem_inodes Count
node_filesystem_inodes_free Count
node_filesystem_usage Bytes
node_filesystem_utilization Percent



| Resource Attribute | |----------------------| | AutoScalingGroupName | | ClusterName | | InstanceId | | InstanceType | | NodeName | | Timestamp | | EBSVolumeId | | device | | fstype | | Type | | Version | | Sources | | kubernete |



Node Network

Metric Unit
node_interface_network_rx_bytes Bytes/Second
node_interface_network_rx_dropped Count/Second
node_interface_network_rx_errors Count/Second
node_interface_network_rx_packets Count/Second
node_interface_network_total_bytes Bytes/Second
node_interface_network_tx_bytes Bytes/Second
node_interface_network_tx_dropped Count/Second
node_interface_network_tx_errors Count/Second
node_interface_network_tx_packets Count/Second



| Resource Attribute | |----------------------| | AutoScalingGroupName | | ClusterName | | InstanceId | | InstanceType | | NodeName | | Timestamp | | Type | | Version | | interface | | Sources | | kubernete |



Pod

Metric Unit
pod_cpu_limit Millicore
pod_cpu_request Millicore
pod_cpu_reserved_capacity Percent
pod_cpu_usage_system Millicore
pod_cpu_usage_total Millicore
pod_cpu_usage_user Millicore
pod_cpu_utilization Percent
pod_cpu_utilization_over_pod_limit Percent
pod_memory_cache Bytes
pod_memory_failcnt Count
pod_memory_hierarchical_pgfault Count/Second
pod_memory_hierarchical_pgmajfault Count/Second
pod_memory_limit Bytes
pod_memory_mapped_file Bytes
pod_memory_max_usage Bytes
pod_memory_pgfault Count/Second
pod_memory_pgmajfault Count/Second
pod_memory_request Bytes
pod_memory_reserved_capacity Percent
pod_memory_rss Bytes
pod_memory_swap Bytes
pod_memory_usage Bytes
pod_memory_utilization Percent
pod_memory_utilization_over_pod_limit Percent
pod_memory_working_set Bytes
pod_network_rx_bytes Bytes/Second
pod_network_rx_dropped Count/Second
pod_network_rx_errors Count/Second
pod_network_rx_packets Count/Second
pod_network_total_bytes Bytes/Second
pod_network_tx_bytes Bytes/Second
pod_network_tx_dropped Count/Second
pod_network_tx_errors Count/Second
pod_network_tx_packets Count/Second
pod_number_of_container_restarts Count
pod_number_of_containers Count
pod_number_of_running_containers Count
Resource Attribute
AutoScalingGroupName
ClusterName
InstanceId
InstanceType
K8sPodName
Namespace
NodeName
PodId
Timestamp
Type
Version
Sources
kubernete
pod_status



Pod Network

Metric Unit
pod_interface_network_rx_bytes Bytes/Second
pod_interface_network_rx_dropped Count/Second
pod_interface_network_rx_errors Count/Second
pod_interface_network_rx_packets Count/Second
pod_interface_network_total_bytes Bytes/Second
pod_interface_network_tx_bytes Bytes/Second
pod_interface_network_tx_dropped Count/Second
pod_interface_network_tx_errors Count/Second
pod_interface_network_tx_packets Count/Second



| Resource Attribute | |----------------------| | AutoScalingGroupName | | ClusterName | | InstanceId | | InstanceType | | K8sPodName | | Namespace | | NodeName | | PodId | | Timestamp | | Type | | Version | | interface | | Sources | | kubernete | | pod_status |



Container

Metric Unit
container_cpu_limit Millicore
container_cpu_request Millicore
container_cpu_usage_system Millicore
container_cpu_usage_total Millicore
container_cpu_usage_user Millicore
container_cpu_utilization Percent
container_memory_cache Bytes
container_memory_failcnt Count
container_memory_hierarchical_pgfault Count/Second
container_memory_hierarchical_pgmajfault Count/Second
container_memory_limit Bytes
container_memory_mapped_file Bytes
container_memory_max_usage Bytes
container_memory_pgfault Count/Second
container_memory_pgmajfault Count/Second
container_memory_request Bytes
container_memory_rss Bytes
container_memory_swap Bytes
container_memory_usage Bytes
container_memory_utilization Percent
container_memory_working_set Bytes
number_of_container_restarts Count



Resource Attribute
AutoScalingGroupName
ClusterName
ContainerId
ContainerName
InstanceId
InstanceType
K8sPodName
Namespace
NodeName
PodId
Timestamp
Type
Version
Sources
kubernetes
container_status
container_status_reason
container_last_termination_reason

The attribute container_status_reason is present only when container_status is in "Waiting" or "Terminated" State. The attribute container_last_termination_reason is present only when container_status is in "Terminated" State.

This is a sample configuration for AWS Container Insights using the awscontainerinsightreceiver and awsemfexporter for an ECS cluster to collect the instance level metrics:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 10s
    container_orchestrator: ecs

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsightsEC2Instance
    log_group_name: '/aws/ecs/containerinsights/{ClusterName}/performance'
    log_stream_name: 'instanceTelemetry/{ContainerInstanceId}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources]
    metric_declarations:
      # instance metrics
      - dimensions: [ [ ContainerInstanceId, InstanceId, ClusterName] ]
        metric_name_selectors:
          - instance_cpu_utilization
          - instance_memory_utilization
          - instance_network_total_bytes
          - instance_cpu_reserved_capacity
          - instance_memory_reserved_capacity
          - instance_number_of_running_tasks
          - instance_filesystem_utilization
      - dimensions: [ [ClusterName] ]
        metric_name_selectors:
          - instance_cpu_utilization
          - instance_memory_utilization
          - instance_network_total_bytes
          - instance_cpu_reserved_capacity
          - instance_memory_reserved_capacity
          - instance_number_of_running_tasks
          - instance_cpu_usage_total
          - instance_cpu_limit
          - instance_memory_working_set
          - instance_memory_limit
  debug:
    verbosity: detailed
service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf,debug]

To deploy to an ECS cluster check this doc for details

Available Metrics and Resource Attributes

Instance

Metric Unit
instance_cpu_limit Millicore
instance_cpu_reserved_capacity Percent
instance_cpu_usage_system Millicore
instance_cpu_usage_total Millicore
instance_cpu_usage_user Millicore
instance_cpu_utilization Percent
instance_memory_cache Bytes
instance_memory_failcnt Count
instance_memory_hierarchical_pgfault Count/Second
instance_memory_hierarchical_pgmajfault Count/Second
instance_memory_limit Bytes
instance_memory_mapped_file Bytes
instance_memory_max_usage Bytes
instance_memory_pgfault Count/Second
instance_memory_pgmajfault Count/Second
instance_memory_reserved_capacity Percent
instance_memory_rss Bytes
instance_memory_swap Bytes
instance_memory_usage Bytes
instance_memory_utilization Percent
instance_memory_working_set Bytes
instance_network_rx_bytes Bytes/Second
instance_network_rx_dropped Count/Second
instance_network_rx_errors Count/Second
instance_network_rx_packets Count/Second
instance_network_total_bytes Bytes/Second
instance_network_tx_bytes Bytes/Second
instance_network_tx_dropped Count/Second
instance_network_tx_errors Count/Second
instance_network_tx_packets Count/Second
instance_number_of_running_tasks Count



Resource Attribute
ClusterName
InstanceType
AutoScalingGroupName
Timestamp
Type
Version
Sources
ContainerInstanceId
InstanceId





Instance Disk IO

Metric Unit
instance_diskio_io_serviced_async Count/Second
instance_diskio_io_serviced_read Count/Second
instance_diskio_io_serviced_sync Count/Second
instance_diskio_io_serviced_total Count/Second
instance_diskio_io_serviced_write Count/Second
instance_diskio_io_service_bytes_async Bytes/Second
instance_diskio_io_service_bytes_read Bytes/Second
instance_diskio_io_service_bytes_sync Bytes/Second
instance_diskio_io_service_bytes_total Bytes/Second
instance_diskio_io_service_bytes_write Bytes/Second



Resource Attribute
ClusterName
InstanceType
AutoScalingGroupName
Timestamp
Type
Version
Sources
ContainerInstanceId
InstanceId
EBSVolumeId





Instance Filesystem

Metric Unit
instance_filesystem_available Bytes
instance_filesystem_capacity Bytes
instance_filesystem_inodes Count
instance_filesystem_inodes_free Count
instance_filesystem_usage Bytes
instance_filesystem_utilization Percent



| Resource Attribute | |----------------------| | ClusterName | | InstanceType | | AutoScalingGroupName | | Timestamp | | Type | | Version | | Sources | | ContainerInstanceId | | InstanceId | | EBSVolumeId |



Instance Network

Metric Unit
instance_interface_network_rx_bytes Bytes/Second
instance_interface_network_rx_dropped Count/Second
instance_interface_network_rx_errors Count/Second
instance_interface_network_rx_packets Count/Second
instance_interface_network_total_bytes Bytes/Second
instance_interface_network_tx_bytes Bytes/Second
instance_interface_network_tx_dropped Count/Second
instance_interface_network_tx_errors Count/Second
instance_interface_network_tx_packets Count/Second



| Resource Attribute | |----------------------| | ClusterName | | InstanceType | | AutoScalingGroupName | | Timestamp | | Type | | Version | | Sources | | ContainerInstanceId | | InstanceId | | EBSVolumeId |



Warnings

Root permissions

When using this component, the collector process needs root permission to be able to read the content of the files located in the following locations:

  • /
  • /var/run/docker.sock
  • /var/lib/docker
  • /run/containerd/containerd.sock
  • /sys
  • /dev/disk

This requirement comes from the fact that this component is based on cAdvisor.