Skip to content

Day-2 Operations

Overview

This document covers day-2 operational activities essential for maintaining the multi-cluster RH OVE ecosystem. It includes guidelines for managing the management cluster and multiple application clusters, covering ongoing maintenance, upgrades, performance tuning, and operational tasks across the entire fleet.

Maintenance Tasks

Regular Cluster Health Checks

  • Node Status Monitoring: Regularly check node health and availability.

    oc get nodes -o wide
    

  • Resource Usage Monitoring: Monitor CPU, memory, and storage utilization.

    oc adm top nodes
    oc adm top pods --all-namespaces
    

Backup Management

  • Review Backup Logs: Ensure completion and verify logs for any anomalies.

    oc logs -n rubrik rubrik-agent-
    

  • Data Integrity Checks: Periodically verify backup integrity and accessibility.

Upgrades

OpenShift Cluster Upgrades

  • Plan Your Upgrade: Evaluate impact, and schedule during maintenance windows.
  • Review OpenShift Upgrade Guide

  • In-place Upgrades: Use OpenShift's upgrade capabilities to update cluster components.

    oc adm upgrade
    

Component Upgrades

  • Operator Lifecycle Management (OLM): Upgrade operators using OLM.

    oc get clusterserviceversions -n openshift-operators
    

  • KubeVirt Upgrades: Follow the KubeVirt upgrade process for virtualization components.

  • Refer to KubeVirt Upgrade Guide

Performance Tuning

Resource Balancing

  • Node Selector and Affinity Rules: Ensure workloads are distributed evenly.

    apiVersion: v1
    kind: Pod
    metadata:
      name: example-pod
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
    

  • Vertical and Horizontal Scaling: Utilize HPA and VPA for scaling applications.

Network Optimization

  • Cilium Policy Management: Optimize and tune Cilium network policies for performance.
    apiVersion: cilium.io/v2
    kind: CiliumNetworkPolicy
    metadata:
      name: optimized-policy
    spec:
      endpointSelector:
        matchLabels:
          app: myapp
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: trusted
    

Security and Compliance

Regular Security Audits

  • Policy Compliance: Ensure adherence to Kyverno policies and security standards.

    kubectl get cpol -o yaml
    

  • Vulnerability Scans: Run regular vulnerability assessments on container images and hosts.

Documentation and Reporting

Keeping Documentation Up-to-Date

  • Change Logs: Maintain a changelog for all configurations and updates.

  • Operational Runbooks: Create and update runbooks for standard operations.

Performance and Utilization Reports

  • Utilize Metrics Dashboards: Use Grafana and Prometheus to generate reports.

Conclusion

Following these day-2 operation guidelines helps maintain a stable, secure, and efficient RH OVE environment. Regular monitoring, updates, optimizations, and documentation ensure long-term success and reliability of the platform.