Skip to content

E2E Test: Network connectivity failures after Velero restore due to OVN flow corruption #2024

@kaovilai

Description

@kaovilai

Summary

E2E tests using Velero backup/restore can experience network connectivity failures in restored namespaces due to OVN-Kubernetes flow table corruption. Pods fail to initialize because they cannot reach DNS services or communicate with other pods.

Environment

  • OpenShift: 4.21.0-ec.2
  • OVN-Kubernetes networking
  • Platform: AWS (ARM64)
  • Test: CSI backup/restore e2e tests (mongo-persistent namespace)

Reproduction Steps

  1. Run CSI backup/restore e2e test:

    # Backup created: mongo-csi-e2e-471ef930-bf64-11f0-9ade-422e9fe364e5
    # Restore created: mongo-csi-e2e-471f015a-bf64-11f0-9ade-422e9fe364e5
  2. Observe restored pods:

    NAME                        READY   STATUS     
    mongo-56447c6857-p5ltl      2/2     Running    
    todolist-68896cbc79-6862c   0/1     Init:0/1   # Stuck
    
  3. Check init container logs:

    kubectl logs todolist-68896cbc79-6862c -n mongo-persistent -c init-myservice
    # Output: Trying to connect to mongo DB port (repeating, timeout)

Root Cause Analysis

Network Connectivity Symptoms

Restored pods exhibited selective network failures:

  • Working: Direct pod IP connectivity (e.g., 10.131.0.101:27017)
  • Working: Service IP connectivity (e.g., 172.30.130.155:27017)
  • Failing: DNS resolution (timeout to 172.30.0.10:53)
  • Failing: Gateway connectivity (100% packet loss to gateway)
  • Failing: Pod-to-pod communication across subnets

OVS Flow Table Corruption

Investigation revealed missing OVS datapath flows on the affected node (ip-10-0-97-45.ec2.internal):

# Packet trace showing immediate drop
ovs-appctl ofproto/trace br-int in_port=98,icmp
Flow: icmp,in_port=98,...
bridge("br-int")
 0. priority 0
    drop
Final flow: unchanged
Datapath actions: drop

Key findings:

  • OVN logical ports: Correctly configured in NB/SB databases
  • OVS interfaces: Properly created and linked to pods
  • OVS flow tables: Missing ingress flows for application pod ports
  • Only 54 flows existed in table 0 (should be hundreds)
  • All flows were for infrastructure ports from ~4 days ago
  • No flows programmed for any application pods

OVN Controller State

kubectl exec -n openshift-ovn-kubernetes ovnkube-node-fg5b4 -c ovn-controller -- \
  ovn-appctl -t ovn-controller coverage/show | grep lflow_run
# Output: lflow_run  0.0/sec  0.000/sec  0.0000/sec  total: 11

The lflow_run count of only 11 indicated the controller was not properly processing flow updates. The chassis had nb_cfg: 0, confirming it wasn't syncing configuration changes from the northbound database.

Despite claiming ports and marking them "ovn-installed", the OVN controller failed to:

  1. Compute logical flows for the restored pods
  2. Translate them to OVS datapath flows
  3. Program the flows into the br-int bridge

Workaround

Manual Recovery Steps

  1. Restart OVN pod on affected node:

    kubectl delete pod ovnkube-node-<id> -n openshift-ovn-kubernetes --force
  2. Recreate application pods to get fresh network state:

    kubectl delete pod <pod-name> -n <namespace>

After these steps, the new OVN controller properly computed flows, and recreated pods had full network connectivity.

Impact on E2E Tests

This issue causes intermittent test failures in CSI backup/restore scenarios:

  • Symptom: Init containers timeout waiting for services
  • Frequency: Appears to affect specific worker nodes after restore operations
  • Test Impact: False negatives - restore succeeds but pods can't communicate

Recommendations

Short-term (Test Suite)

  1. Add post-restore validation:

    # Check for network connectivity before declaring success
    kubectl exec <pod> -- ping -c 1 <gateway-ip>
    kubectl exec <pod> -- nslookup kubernetes.default.svc.cluster.local
  2. Implement automatic recovery:

    • Detect stuck init containers after restore
    • Trigger OVN pod restart on affected nodes
    • Force pod recreation if necessary
  3. Monitor OVN health:

    # Check chassis nb_cfg and lflow_run counts
    ovn-sbctl list chassis | grep nb_cfg
    ovn-appctl -t ovn-controller coverage/show | grep lflow_run

Long-term (Product)

  1. Post-restore hook: Add Velero restore hook to restart OVN pods on nodes with restored pods
  2. OVN controller resilience: Investigate why lflow_run stops processing after certain restore operations
  3. Flow verification: Add OVN controller checks to detect missing flows and trigger recomputation
  4. Documentation: Document this known issue in backup/restore troubleshooting guides

Related Information

  • Test Type: CSI backup/restore with VolumeSnapshots
  • Backup Storage: AWS S3 (BSL)
  • Snapshot Location: AWS EBS CSI (VSL)
  • Affected Namespace: mongo-persistent (with PVCs, Services, Deployments)

Additional Context

This appears to be a race condition or state management issue in OVN-Kubernetes when handling pods that are rapidly created from restored manifests. The controller claims the logical ports but enters a state where it stops computing flows (nb_cfg: 0, low lflow_run count).

The issue is node-specific - other worker nodes in the same cluster processed flows correctly for their pods.


Note

Responses generated with Claude

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions