E2E Test: Network connectivity failures after Velero restore due to OVN flow corruption

## Summary

E2E tests using Velero backup/restore can experience network connectivity failures in restored namespaces due to OVN-Kubernetes flow table corruption. Pods fail to initialize because they cannot reach DNS services or communicate with other pods.

## Environment

- OpenShift: 4.21.0-ec.2
- OVN-Kubernetes networking
- Platform: AWS (ARM64)
- Test: CSI backup/restore e2e tests (mongo-persistent namespace)

## Reproduction Steps

1. Run CSI backup/restore e2e test:
   ```bash
   # Backup created: mongo-csi-e2e-471ef930-bf64-11f0-9ade-422e9fe364e5
   # Restore created: mongo-csi-e2e-471f015a-bf64-11f0-9ade-422e9fe364e5
   ```

2. Observe restored pods:
   ```
   NAME                        READY   STATUS     
   mongo-56447c6857-p5ltl      2/2     Running    
   todolist-68896cbc79-6862c   0/1     Init:0/1   # Stuck
   ```

3. Check init container logs:
   ```bash
   kubectl logs todolist-68896cbc79-6862c -n mongo-persistent -c init-myservice
   # Output: Trying to connect to mongo DB port (repeating, timeout)
   ```

## Root Cause Analysis

### Network Connectivity Symptoms

Restored pods exhibited selective network failures:
- ✅ **Working**: Direct pod IP connectivity (e.g., `10.131.0.101:27017`)
- ✅ **Working**: Service IP connectivity (e.g., `172.30.130.155:27017`)
- ❌ **Failing**: DNS resolution (timeout to `172.30.0.10:53`)
- ❌ **Failing**: Gateway connectivity (100% packet loss to gateway)
- ❌ **Failing**: Pod-to-pod communication across subnets

### OVS Flow Table Corruption

Investigation revealed missing OVS datapath flows on the affected node (`ip-10-0-97-45.ec2.internal`):

```bash
# Packet trace showing immediate drop
ovs-appctl ofproto/trace br-int in_port=98,icmp
Flow: icmp,in_port=98,...
bridge("br-int")
 0. priority 0
    drop
Final flow: unchanged
Datapath actions: drop
```

**Key findings**:
- OVN logical ports: Correctly configured in NB/SB databases
- OVS interfaces: Properly created and linked to pods
- **OVS flow tables: Missing ingress flows for application pod ports**
- Only 54 flows existed in table 0 (should be hundreds)
- All flows were for infrastructure ports from ~4 days ago
- No flows programmed for any application pods

### OVN Controller State

```bash
kubectl exec -n openshift-ovn-kubernetes ovnkube-node-fg5b4 -c ovn-controller -- \
  ovn-appctl -t ovn-controller coverage/show | grep lflow_run
# Output: lflow_run  0.0/sec  0.000/sec  0.0000/sec  total: 11
```

The `lflow_run` count of only **11** indicated the controller was not properly processing flow updates. The chassis had `nb_cfg: 0`, confirming it wasn't syncing configuration changes from the northbound database.

Despite claiming ports and marking them "ovn-installed", the OVN controller failed to:
1. Compute logical flows for the restored pods
2. Translate them to OVS datapath flows
3. Program the flows into the br-int bridge

## Workaround

### Manual Recovery Steps

1. **Restart OVN pod** on affected node:
   ```bash
   kubectl delete pod ovnkube-node-<id> -n openshift-ovn-kubernetes --force
   ```

2. **Recreate application pods** to get fresh network state:
   ```bash
   kubectl delete pod <pod-name> -n <namespace>
   ```

After these steps, the new OVN controller properly computed flows, and recreated pods had full network connectivity.

## Impact on E2E Tests

This issue causes intermittent test failures in CSI backup/restore scenarios:

- **Symptom**: Init containers timeout waiting for services
- **Frequency**: Appears to affect specific worker nodes after restore operations
- **Test Impact**: False negatives - restore succeeds but pods can't communicate

## Recommendations

### Short-term (Test Suite)

1. **Add post-restore validation**:
   ```bash
   # Check for network connectivity before declaring success
   kubectl exec <pod> -- ping -c 1 <gateway-ip>
   kubectl exec <pod> -- nslookup kubernetes.default.svc.cluster.local
   ```

2. **Implement automatic recovery**:
   - Detect stuck init containers after restore
   - Trigger OVN pod restart on affected nodes
   - Force pod recreation if necessary

3. **Monitor OVN health**:
   ```bash
   # Check chassis nb_cfg and lflow_run counts
   ovn-sbctl list chassis | grep nb_cfg
   ovn-appctl -t ovn-controller coverage/show | grep lflow_run
   ```

### Long-term (Product)

1. **Post-restore hook**: Add Velero restore hook to restart OVN pods on nodes with restored pods
2. **OVN controller resilience**: Investigate why `lflow_run` stops processing after certain restore operations
3. **Flow verification**: Add OVN controller checks to detect missing flows and trigger recomputation
4. **Documentation**: Document this known issue in backup/restore troubleshooting guides

## Related Information

- **Test Type**: CSI backup/restore with VolumeSnapshots
- **Backup Storage**: AWS S3 (BSL)
- **Snapshot Location**: AWS EBS CSI (VSL)
- **Affected Namespace**: mongo-persistent (with PVCs, Services, Deployments)

## Additional Context

This appears to be a race condition or state management issue in OVN-Kubernetes when handling pods that are rapidly created from restored manifests. The controller claims the logical ports but enters a state where it stops computing flows (`nb_cfg: 0`, low `lflow_run` count).

The issue is node-specific - other worker nodes in the same cluster processed flows correctly for their pods.

---

> [!Note]
> Responses generated with Claude

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

E2E Test: Network connectivity failures after Velero restore due to OVN flow corruption #2024

Summary

Environment

Reproduction Steps

Root Cause Analysis

Network Connectivity Symptoms

OVS Flow Table Corruption

OVN Controller State

Workaround

Manual Recovery Steps

Impact on E2E Tests

Recommendations

Short-term (Test Suite)

Long-term (Product)

Related Information

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

E2E Test: Network connectivity failures after Velero restore due to OVN flow corruption #2024

Description

Summary

Environment

Reproduction Steps

Root Cause Analysis

Network Connectivity Symptoms

OVS Flow Table Corruption

OVN Controller State

Workaround

Manual Recovery Steps

Impact on E2E Tests

Recommendations

Short-term (Test Suite)

Long-term (Product)

Related Information

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions