-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Summary
E2E tests using Velero backup/restore can experience network connectivity failures in restored namespaces due to OVN-Kubernetes flow table corruption. Pods fail to initialize because they cannot reach DNS services or communicate with other pods.
Environment
- OpenShift: 4.21.0-ec.2
- OVN-Kubernetes networking
- Platform: AWS (ARM64)
- Test: CSI backup/restore e2e tests (mongo-persistent namespace)
Reproduction Steps
-
Run CSI backup/restore e2e test:
# Backup created: mongo-csi-e2e-471ef930-bf64-11f0-9ade-422e9fe364e5 # Restore created: mongo-csi-e2e-471f015a-bf64-11f0-9ade-422e9fe364e5
-
Observe restored pods:
NAME READY STATUS mongo-56447c6857-p5ltl 2/2 Running todolist-68896cbc79-6862c 0/1 Init:0/1 # Stuck -
Check init container logs:
kubectl logs todolist-68896cbc79-6862c -n mongo-persistent -c init-myservice # Output: Trying to connect to mongo DB port (repeating, timeout)
Root Cause Analysis
Network Connectivity Symptoms
Restored pods exhibited selective network failures:
- ✅ Working: Direct pod IP connectivity (e.g.,
10.131.0.101:27017) - ✅ Working: Service IP connectivity (e.g.,
172.30.130.155:27017) - ❌ Failing: DNS resolution (timeout to
172.30.0.10:53) - ❌ Failing: Gateway connectivity (100% packet loss to gateway)
- ❌ Failing: Pod-to-pod communication across subnets
OVS Flow Table Corruption
Investigation revealed missing OVS datapath flows on the affected node (ip-10-0-97-45.ec2.internal):
# Packet trace showing immediate drop
ovs-appctl ofproto/trace br-int in_port=98,icmp
Flow: icmp,in_port=98,...
bridge("br-int")
0. priority 0
drop
Final flow: unchanged
Datapath actions: dropKey findings:
- OVN logical ports: Correctly configured in NB/SB databases
- OVS interfaces: Properly created and linked to pods
- OVS flow tables: Missing ingress flows for application pod ports
- Only 54 flows existed in table 0 (should be hundreds)
- All flows were for infrastructure ports from ~4 days ago
- No flows programmed for any application pods
OVN Controller State
kubectl exec -n openshift-ovn-kubernetes ovnkube-node-fg5b4 -c ovn-controller -- \
ovn-appctl -t ovn-controller coverage/show | grep lflow_run
# Output: lflow_run 0.0/sec 0.000/sec 0.0000/sec total: 11The lflow_run count of only 11 indicated the controller was not properly processing flow updates. The chassis had nb_cfg: 0, confirming it wasn't syncing configuration changes from the northbound database.
Despite claiming ports and marking them "ovn-installed", the OVN controller failed to:
- Compute logical flows for the restored pods
- Translate them to OVS datapath flows
- Program the flows into the br-int bridge
Workaround
Manual Recovery Steps
-
Restart OVN pod on affected node:
kubectl delete pod ovnkube-node-<id> -n openshift-ovn-kubernetes --force
-
Recreate application pods to get fresh network state:
kubectl delete pod <pod-name> -n <namespace>
After these steps, the new OVN controller properly computed flows, and recreated pods had full network connectivity.
Impact on E2E Tests
This issue causes intermittent test failures in CSI backup/restore scenarios:
- Symptom: Init containers timeout waiting for services
- Frequency: Appears to affect specific worker nodes after restore operations
- Test Impact: False negatives - restore succeeds but pods can't communicate
Recommendations
Short-term (Test Suite)
-
Add post-restore validation:
# Check for network connectivity before declaring success kubectl exec <pod> -- ping -c 1 <gateway-ip> kubectl exec <pod> -- nslookup kubernetes.default.svc.cluster.local
-
Implement automatic recovery:
- Detect stuck init containers after restore
- Trigger OVN pod restart on affected nodes
- Force pod recreation if necessary
-
Monitor OVN health:
# Check chassis nb_cfg and lflow_run counts ovn-sbctl list chassis | grep nb_cfg ovn-appctl -t ovn-controller coverage/show | grep lflow_run
Long-term (Product)
- Post-restore hook: Add Velero restore hook to restart OVN pods on nodes with restored pods
- OVN controller resilience: Investigate why
lflow_runstops processing after certain restore operations - Flow verification: Add OVN controller checks to detect missing flows and trigger recomputation
- Documentation: Document this known issue in backup/restore troubleshooting guides
Related Information
- Test Type: CSI backup/restore with VolumeSnapshots
- Backup Storage: AWS S3 (BSL)
- Snapshot Location: AWS EBS CSI (VSL)
- Affected Namespace: mongo-persistent (with PVCs, Services, Deployments)
Additional Context
This appears to be a race condition or state management issue in OVN-Kubernetes when handling pods that are rapidly created from restored manifests. The controller claims the logical ports but enters a state where it stops computing flows (nb_cfg: 0, low lflow_run count).
The issue is node-specific - other worker nodes in the same cluster processed flows correctly for their pods.
Note
Responses generated with Claude