Creating a Kubernetes Cluster with Runtime Observability

With contributions from Sebastian Choren, Adnan Rahić and Ken Hamric.

Kubernetes is an open source system widely used in the cloud native landscape to provide ways to deploy and scale containerized applications in the cloud. Its ability to observe logs and metrics is well-known and documented, but its observability regarding application traces is new.

Here is a brief synopsis of the recent activity in the Kubernetes ecosystem:

In investigating the current state of tracing with Kubernetes, we found very few articles documenting how to enable it, like this article on Kubernetes blog about kubelet observability. We decided to document our findings and provide step-by-step instructions to set Kubernetes up locally and inspect traces.

You’ll learn how to use this instrumentation with Kubernetes to start observing traces on its API (kube-apiserver), node agent (kubelet), and container runtime (containerd) by setting up a local observability environment and later doing a local install of Kubernetes with tracing enabled.

First, install the following tools on your local machine:

  • Docker: a container environment that allows us to run containerized environments
  • k3d: a wrapper to run k3s (a lightweight Kubernetes distribution) with Docker
  • kubectl: a Kubernetes CLI to interact with clusters

Setting up an Observability Stack to Monitor Traces

To set up the observability stack, you’ll run the OpenTelemetry (OTel) Collector, a tool that receives telemetry data from different apps and sends it to a tracing backend. As a tracing backend, you’ll use Jaeger, an open source tool that collects traces and lets you query them.

On your machine, create a directory called kubetracing and create a file called otel-collector.yaml, copy the contents of the following snippet, and save it in a folder of your preference.

This file will configure the OpenTelemetry Collector to receive traces in OpenTelemetry format and export them to Jaeger.

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 100
  batch:
    timeout: 100ms
exporters:
  logging:
    logLevel: debug
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [probabilistic_sampler, batch]
      exporters: [otlp/jaeger, logging]

After that, in the same folder, create a docker-compose.yaml file that will have two containers, one for Jaeger and another for the OpenTelemetry Collector.

services:
  jaeger:
    healthcheck:
      test:
        - CMD
        - wget
        - --spider
        - localhost:16686
      timeout: 3s
      interval: 1s
      retries: 60
    image: jaegertracing/all-in-one:latest
    restart: unless-stopped
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - 16686:16686
  otel-collector:
    command:
      - --config
      - /otel-local-config.yaml
    depends_on:
      jaeger:
        condition: service_started
    image: otel/opentelemetry-collector:0.54.0
    ports:
      - 4317:4317
    volumes:
      - ./otel-collector.yaml:/otel-local-config.yaml

Now, start the observability environment by running the following command in the kubetracing folder:

docker compose up

This will start both Jaeger and the OpenTelemetry Collector, enabling them to receive traces from other apps.

Creating a Kubernetes Cluster with Runtime Observability

With the observability environment set up, create the configuration files to enable OpenTelemetry tracing in kube-apiserver, kubelet, and containerd.

Inside the kubetracing folder, create a subfolder called config that will have the following two files.

First, the apiserver-tracing.yaml, which contains the tracing configuration used by kube-apiserver to export traces containing execution data of the Kubernetes API. In this configuration, set the API to send 100% of the traces with the samplingRatePerMillion config. Set the endpoint as host.k3d.internal:4317 to allow the cluster created by k3d/k3s to call another API on your machine. In this case, the OpenTelemetry Collector deployed via docker compose on port 4317.

apiVersion: apiserver.config.k8s.io/v1beta1
kind: TracingConfiguration
endpoint: host.k3d.internal:4317
samplingRatePerMillion: 1000000 # 100%

The second file is kubelet-tracing.yaml, which provides additional configuration for kubelet. Here you’ll enable the feature flag KubeletTracing (a beta feature in Kubernetes 1.27, the current version when this article was written) and set the same tracing settings that were set on kube-apiserver.

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  KubeletTracing: true
tracing:
  endpoint: host.k3d.internal:4317
  samplingRatePerMillion: 1000000 # 100%

Returning to the kubetracing folder, create the last file, config.toml.tmpl, which is a template file used by k3s to configure containerd. This file is similar to the default configuration that k3s uses, with two more sections at the end of the file that configures containerd to send traces.

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "{{ .NodeConfig.Containerd.Opt }}"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = {{ .NodeConfig.SELinux }}
  enable_unprivileged_ports = {{ .EnableUnprivileged }}
  enable_unprivileged_icmp = {{ .EnableUnprivileged }}

{{- if .DisableCgroup}}
  disable_cgroup = true
{{end}}
{{- if .IsRunningInUserNS }}
  disable_apparmor = true
  restrict_oom_score_adj = true
{{end}}

{{- if .NodeConfig.AgentConfig.PauseImage }}
  sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
{{end}}

{{- if .NodeConfig.AgentConfig.Snapshotter }}
[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "{{ .NodeConfig.AgentConfig.Snapshotter }}"
  disable_snapshot_annotations = {{ if eq .NodeConfig.AgentConfig.Snapshotter "stargz" }}false{{else}}true{{end}}
{{ if eq .NodeConfig.AgentConfig.Snapshotter "stargz" }}
{{ if .NodeConfig.AgentConfig.ImageServiceSocket }}
[plugins."io.containerd.snapshotter.v1.stargz"]
cri_keychain_image_service_path = "{{ .NodeConfig.AgentConfig.ImageServiceSocket }}"
[plugins."io.containerd.snapshotter.v1.stargz".cri_keychain]
enable_keychain = true
{{end}}
{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors."{{$k}}"]
  endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{if $v.Rewrites}}
  [plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors."{{$k}}".rewrite]
{{range $pattern, $replace := $v.Rewrites}}
    "{{$pattern}}" = "{{$replace}}"
{{end}}
{{end}}
{{end}}
{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.configs."{{$k}}".auth]
  {{ if $v.Auth.Username }}username = {{ printf "%q" $v.Auth.Username }}{{end}}
  {{ if $v.Auth.Password }}password = {{ printf "%q" $v.Auth.Password }}{{end}}
  {{ if $v.Auth.Auth }}auth = {{ printf "%q" $v.Auth.Auth }}{{end}}
  {{ if $v.Auth.IdentityToken }}identitytoken = {{ printf "%q" $v.Auth.IdentityToken }}{{end}}
{{end}}
{{ if $v.TLS }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.configs."{{$k}}".tls]
  {{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
  {{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
  {{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
  {{ if $v.TLS.InsecureSkipVerify }}insecure_skip_verify = true{{end}}
{{end}}
{{end}}
{{end}}
{{end}}
{{end}}

{{- if not .NodeConfig.NoFlannel }}
[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
  conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = {{ .SystemdCgroup }}

{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."{{$k}}"]
  endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{if $v.Rewrites}}
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."{{$k}}".rewrite]
{{range $pattern, $replace := $v.Rewrites}}
    "{{$pattern}}" = "{{$replace}}"
{{end}}
{{end}}
{{end}}

{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins."io.containerd.grpc.v1.cri".registry.configs."{{$k}}".auth]
  {{ if $v.Auth.Username }}username = {{ printf "%q" $v.Auth.Username }}{{end}}
  {{ if $v.Auth.Password }}password = {{ printf "%q" $v.Auth.Password }}{{end}}
  {{ if $v.Auth.Auth }}auth = {{ printf "%q" $v.Auth.Auth }}{{end}}
  {{ if $v.Auth.IdentityToken }}identitytoken = {{ printf "%q" $v.Auth.IdentityToken }}{{end}}
{{end}}
{{ if $v.TLS }}
[plugins."io.containerd.grpc.v1.cri".registry.configs."{{$k}}".tls]
  {{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
  {{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
  {{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
  {{ if $v.TLS.InsecureSkipVerify }}insecure_skip_verify = true{{end}}
{{end}}
{{end}}
{{end}}

{{range $k, $v := .ExtraRuntimes}}
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}"]
  runtime_type = "{{$v.RuntimeType}}"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}".options]
  BinaryName = "{{$v.BinaryName}}"
{{end}}

[plugins."io.containerd.tracing.processor.v1.otlp"]
  endpoint = "host.k3d.internal:4317"
  protocol = "grpc"
  insecure = true

[plugins."io.containerd.internal.v1.tracing"]
  sampling_ratio = 1.0
  service_name = "containerd"

After creating these files, open a terminal inside the kubetracing folder and run k3d to create a cluster. Before running this command, replace the [CURRENT_PATH] placeholder for the entire path of the kubetracing folder. You can retrieve it by running the echo $PWD command in the terminal in that folder.

k3d cluster create tracingcluster \
  --image=rancher/k3s:v1.27.1-k3s1 \
  --volume '[CURRENT_PATH]/config.toml.tmpl:/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl@server:*' \
  --volume '[CURRENT_PATH]/config:/etc/kube-tracing@server:*' \
  --k3s-arg '--kube-apiserver-arg=tracing-config-file=/etc/kube-tracing/apiserver-tracing.yaml@server:*' \
  --k3s-arg '--kube-apiserver-arg=feature-gates=APIServerTracing=true@server:*' \
  --k3s-arg '--kubelet-arg=config=/etc/kube-tracing/kubelet-tracing.yaml@server:*'

This command will create a Kubernetes cluster with version v1.27.1, and set up in three docker containers on your machine. If you run the command kubectl cluster-info now, you will see this output:

Kubernetes control plane is running at https://0.0.0.0:60503
CoreDNS is running at https://0.0.0.0:60503/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://0.0.0.0:60503/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy

Going back to the logs of the observability environment, you should see some spans of internal Kubernetes operations being published in OpenTelemetry Collector, like this:

Span #90
    Trace ID       : 03a7bf9008d54f02bcd4f14aa5438202
    Parent ID      :
    ID             : d7a10873192f7066
    Name           : KubernetesAPI
    Kind           : SPAN_KIND_SERVER
    Start time     : 2023-05-18 01:51:44.954563708 +0000 UTC
    End time       : 2023-05-18 01:51:44.957555323 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Attributes:
     -> net.transport: STRING(ip_tcp)
     -> net.peer.ip: STRING(127.0.0.1)
     -> net.peer.port: INT(54678)
     -> net.host.ip: STRING(127.0.0.1)
     -> net.host.port: INT(6443)
     -> http.target: STRING(/api/v1/namespaces/kube-system/pods/helm-install-traefik-crd-8w4wd)
     -> http.server_name: STRING(KubernetesAPI)
     -> http.user_agent: STRING(k3s/v1.27.1+k3s1 (linux/amd64) kubernetes/bc5b42c)
     -> http.scheme: STRING(https)
     -> http.host: STRING(127.0.0.1:6443)
     -> http.flavor: STRING(2)
     -> http.method: STRING(GET)
     -> http.wrote_bytes: INT(4724)
     -> http.status_code: INT(200)

Testing the Cluster Runtime

With the Observability environment and the Kubernetes cluster set up, you can now trigger commands against Kubernetes and see traces of these actions in Jaeger.

Open the browser, and navigate to the Jaeger UI located at http://localhost:16686/search. You’ll see that the apiserver, containerd, and kubelet services are publishing traces:

Jaeger screen with services dropdown open showing apiserver, containerd and kubelet services as options

Choose apiserver and click on “Find Traces”. Here you see traces from the Kubernetes control plane:

Jaeger screen showing a list of spans found for apiserver

Let’s run a sample command against Kubernetes with kubectl, like running an echo:

$ kubectl run -it --rm --restart=Never --image=alpine echo-command -- echo hi

# Output
# If you don't see a command prompt, try pressing enter.
# warning: couldn't attach to pod/echo-command, falling back to streaming logs: unable to upgrade connection: container echo-command not found in pod echo-command_default
# Hi
# pod "echo-command" deleted

And now, open Jaeger again, choose the kubelet service, operation syncPod, and add the tag k8s.pod=default/echo-command, you should be able to see spans related to this pod:

Jaeger screen showing a list of spans found for the syncPod operation on kubelet service

Expanding one trace, you’ll see the operations that created this pod:

Jaeger screen showing a single syncPod expanded

Conclusion

Even in beta, both traces for kubelet and apiserver can help a developer understand what’s happening under the hood in Kubernetes and start debugging issues.

This will be helpful for developers that create custom tasks, like Kubernetes Operators that update internal resources to add more functionalities to Kubernetes.

As a team focused on building an open source tool in the observability space, the opportunity to help the overall OpenTelemetry community was important to us. That’s why we were researching finding new ways of collecting traces from the core Kubernetes engine. With the current level of observability being exposed by Kubernetes we wanted to publish our findings in order to help others interested in seeing the current state of distributed tracing in the Kubernetes engine. Daniel Dias and Sebastian Choren are working on Tracetest, an open source tool that allows you to develop and test your distributed system with OpenTelemetry. It works with any OTel compatible system and enables trace–based tests to be created. Check it out at https://github.com/kubeshop/tracetest.

The example sources used in this article, and setup instructions are available from the Tracetest repository.

References