Troubleshooting Guide¶
Overview¶
This comprehensive troubleshooting guide addresses common issues in the RH OVE ecosystem, providing systematic approaches to diagnose and resolve problems across virtualization, networking, storage, and monitoring components.
General Troubleshooting Approach¶
Diagnostic Flow¶
graph TD
A[Issue Identified] --> B[Gather Information]
B --> C[Check Logs]
C --> D[Verify Configuration]
D --> E[Test Components]
E --> F{Issue Resolved?}
F -->|No| G[Escalate/Deep Dive]
F -->|Yes| H[Document Solution]
G --> I[Advanced Diagnostics]
I --> J[Vendor Support]
Essential Commands¶
# Cluster overview
oc get nodes
oc get pods --all-namespaces
oc get events --all-namespaces --sort-by='.lastTimestamp'
# Resource utilization
oc adm top nodes
oc adm top pods --all-namespaces
# Detailed investigation
oc describe node <node-name>
oc logs -f <pod-name> -n <namespace>
Virtual Machine Issues¶
VM Won't Start¶
Symptoms¶
- VM remains in "Pending" or "Scheduling" state
- VM fails to boot or crashes during startup
Troubleshooting Steps¶
-
Check VM Definition
-
Verify Node Resources
-
Check DataVolume Status
-
Review Events
Common Solutions¶
- Insufficient Resources: Scale cluster or adjust VM specs
- DataVolume Issues: Check CDI logs and storage classes
- Node Affinity: Verify node selector and affinity rules
VM Performance Issues¶
Symptoms¶
- Slow VM performance
- High CPU/memory usage
- Network latency
Troubleshooting Steps¶
-
Check VM Resource Allocation
-
Monitor VM Metrics
-
Verify Host Resources
Solutions¶
- Adjust VM CPU/memory allocation
- Enable CPU pinning for critical VMs
- Check storage performance and IOPS limits
Networking Issues¶
Cilium Network Problems¶
Symptoms¶
- Pods cannot communicate
- Network policies not working
- DNS resolution failures
Troubleshooting Steps¶
-
Check Cilium Status
-
Verify Network Policies
-
Monitor Network Flows
Common Solutions¶
# Debug network connectivity
apiVersion: v1
kind: Pod
metadata:
name: network-debug
spec:
containers:
- name: debug
image: nicolaka/netshoot
command: ['sleep', '3600']
VM Network Connectivity¶
Symptoms¶
- VM cannot reach external networks
- Inter-VM communication failures
- Service discovery issues
Troubleshooting Steps¶
-
Check VM Network Configuration
-
Verify Service Configuration
-
Test Connectivity from VM
Storage Issues¶
DataVolume Problems¶
Symptoms¶
- DataVolume stuck in "Pending" state
- Import/clone operations failing
- Storage quota exceeded
Troubleshooting Steps¶
-
Check DataVolume Status
-
Review CDI Logs
-
Verify Storage Classes
Solutions¶
# Debug DataVolume with verbose logging
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: debug-dv
annotations:
cdi.kubevirt.io/debug: "true"
spec:
pvc:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
source:
blank: {}
Storage Performance Issues¶
Symptoms¶
- Slow disk I/O
- High storage latency
- VM disk full errors
Troubleshooting Steps¶
-
Check Storage Metrics
-
Verify PVC Usage
-
Monitor Storage Node Performance
Monitoring Issues¶
Dynatrace Agent Problems¶
Symptoms¶
- Missing VM metrics in Dynatrace
- OneAgent not reporting data
- High resource usage by monitoring
Troubleshooting Steps¶
-
Check OneAgent Status
-
Verify VM Annotations
-
Review Dynatrace Logs
Prometheus Metrics Missing¶
Symptoms¶
- Missing metrics in Grafana
- ServiceMonitor not working
- Prometheus targets down
Troubleshooting Steps¶
-
Check ServiceMonitor Configuration
-
Verify Metrics Endpoints
-
Check Prometheus Targets
GitOps and Argo CD Issues¶
Application Sync Failures¶
Symptoms¶
- Applications stuck in "OutOfSync" state
- Sync operations failing
- Resource conflicts
Troubleshooting Steps¶
-
Check Application Status
-
Verify Git Repository Access
-
Review Resource Conflicts
Solutions¶
# Force refresh and sync
argocd app refresh <app-name>
argocd app sync <app-name> --force
# Reset application state
argocd app actions run <app-name> restart --kind Deployment
Performance Issues¶
Cluster Resource Exhaustion¶
Symptoms¶
- High CPU/memory usage
- Pod evictions
- Slow response times
Troubleshooting Steps¶
-
Identify Resource Consumers
-
Check Node Capacity
-
Review Resource Quotas
VM Live Migration Issues¶
Symptoms¶
- Migration fails or takes too long
- VM downtime during migration
- Network connectivity loss
Troubleshooting Steps¶
-
Check Migration Status
-
Verify Node Compatibility
-
Monitor Migration Progress
Emergency Procedures¶
Cluster Recovery¶
When Multiple Nodes Are Down¶
-
Check etcd Health
-
Restore from Backup
VM Emergency Access¶
When VM Console Is Unresponsive¶
-
Use virtctl
-
Force VM Restart
Advanced Diagnostics¶
Debug Pod Creation¶
apiVersion: v1
kind: Pod
metadata:
name: debug-tools
spec:
containers:
- name: debug
image: registry.redhat.io/ubi8/ubi:latest
command: ['sleep', '3600']
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
nodeSelector:
kubernetes.io/hostname: <node-name>
Log Collection Script¶
#!/bin/bash
# Comprehensive log collection script
NAMESPACE=${1:-default}
OUTPUT_DIR="troubleshooting-$(date +%Y%m%d-%H%M%S)"
mkdir -p $OUTPUT_DIR
# Cluster information
oc cluster-info > $OUTPUT_DIR/cluster-info.txt
oc get nodes -o wide > $OUTPUT_DIR/nodes.txt
oc get pods --all-namespaces > $OUTPUT_DIR/all-pods.txt
# VM specific information
oc get vm --all-namespaces -o yaml > $OUTPUT_DIR/vms.yaml
oc get vmi --all-namespaces -o yaml > $OUTPUT_DIR/vmis.yaml
oc get datavolume --all-namespaces -o yaml > $OUTPUT_DIR/datavolumes.yaml
# Events
oc get events --all-namespaces --sort-by='.lastTimestamp' > $OUTPUT_DIR/events.txt
# Logs from key components
oc logs -n openshift-cnv deployment/virt-controller > $OUTPUT_DIR/virt-controller.log
oc logs -n openshift-cnv deployment/virt-api > $OUTPUT_DIR/virt-api.log
oc logs -n cdi deployment/cdi-controller > $OUTPUT_DIR/cdi-controller.log
echo "Logs collected in $OUTPUT_DIR"
tar -czf $OUTPUT_DIR.tar.gz $OUTPUT_DIR
Support and Escalation¶
When to Escalate¶
- Hardware failures
- Data corruption issues
- Security breaches
- Performance degradation > 50%
- Multiple component failures
Information to Gather¶
- Environment Details
- OpenShift version
- KubeVirt version
-
Cluster size and configuration
-
Problem Description
- Timeline of events
- Error messages
-
Impact assessment
-
Diagnostic Data
- Logs (sanitized)
- Configuration files
- Resource utilization data
Support Contacts¶
- Red Hat Support: https://access.redhat.com/support/
- Community Forums: https://commons.openshift.org/
- KubeVirt Community: https://kubevirt.io/community/
This troubleshooting guide provides systematic approaches to resolve common issues in the RH OVE ecosystem. Regular review and updates of this guide ensure it remains current with evolving technologies and operational experiences.