-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Description
Description
Hello team,
I’m facing an issue where the eBPF OTel agent saturates its internal queue (sending queue is full) repeatedly on one Kubernetes node, even with:
- restricted namespaces in
discovery.instrument - additional
exclude_services - reduced instrumentation scope
- and the OpenTelemetry Collector in front of Instana (not sending directly)
Environment:
- Kubernetes cluster: OpenShift 4.x (IBM Cloud)
- eBPF instrumentation version: v0.2.0
- Image: custom registry (
ghcr.io/open-telemetry/opentelemetry-ebpf-instrumentation/ebpf-instrument) - Mode: DaemonSet (12 nodes)
- Only one node out of the 12 produces the issue
Observed behavior:
On one worker node, the agent repeatedly logs:
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
time=... level=ERROR msg="error sending trace to consumer" error="sending queue is full"
This happens continuously.
The logs also show very frequent process attachment:
instrumenting process cmd=/myapp/app/bin/tg_loader ...
instrumenting process cmd=/myapp/app/bin/tg_daemon ...
instrumenting process cmd=/usr/bin/coreutils ...
instrumenting process cmd=/usr/bin/bash ...
Even after tightening the configuration (namespace restrictions, exe_path exclusions, etc.), the issue persists on this node only.
Troubleshooting done:
- Restricted discovery.instrument to only the binaries of interest (/myapp/app/bin/*)
- Excluded noisy executables (bash, coreutils, conmon)
- Tried with and without OpenTelemetry Collector
- Added sampling on the collector side
- Enabled (or prepared to enable) k8s-cache component
- Verified node load (CPU/Memory OK)
- Verified that the OTLP receiver is not lagging
Questions:
- Is the internal eBPF queue size configurable?
- Is this behavior expected under high syscall churn?
- Could the process attach/detach loop cause queue saturation?
- Is this known / documented behavior in v0.2.0?
- Should the sampling or filtering also apply before the queue?
- Any recommended mitigations for a single-node hotspot?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels