Skip to content

eBPF agent saturates internal queue (sending queue is full) on one Kubernetes node even with restricted instrumentation #957

@merouaneagar

Description

@merouaneagar

Description

Hello team,
I’m facing an issue where the eBPF OTel agent saturates its internal queue (sending queue is full) repeatedly on one Kubernetes node, even with:

  • restricted namespaces in discovery.instrument
  • additional exclude_services
  • reduced instrumentation scope
  • and the OpenTelemetry Collector in front of Instana (not sending directly)

Environment:

  • Kubernetes cluster: OpenShift 4.x (IBM Cloud)
  • eBPF instrumentation version: v0.2.0
  • Image: custom registry (ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation/ebpf-instrument)
  • Mode: DaemonSet (12 nodes)
  • Only one node out of the 12 produces the issue

Observed behavior:

On one worker node, the agent repeatedly logs:

time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"

This happens continuously.
The logs also show very frequent process attachment:

instrumenting process cmd=/myapp/app/bin/tg_loader ...
instrumenting process cmd=/myapp/app/bin/tg_daemon ...
instrumenting process cmd=/usr/bin/coreutils ...
instrumenting process cmd=/usr/bin/bash ...

Even after tightening the configuration (namespace restrictions, exe_path exclusions, etc.), the issue persists on this node only.

Troubleshooting done:

  • Restricted discovery.instrument to only the binaries of interest (/myapp/app/bin/*)
  • Excluded noisy executables (bash, coreutils, conmon)
  • Tried with and without OpenTelemetry Collector
  • Added sampling on the collector side
  • Enabled (or prepared to enable) k8s-cache component
  • Verified node load (CPU/Memory OK)
  • Verified that the OTLP receiver is not lagging

Questions:

  1. Is the internal eBPF queue size configurable?
  2. Is this behavior expected under high syscall churn?
  3. Could the process attach/detach loop cause queue saturation?
  4. Is this known / documented behavior in v0.2.0?
  5. Should the sampling or filtering also apply before the queue?
  6. Any recommended mitigations for a single-node hotspot?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions