Global Architecture Overview¶

Overview¶

The RH OVE ecosystem is designed as a multi-cluster architecture that separates concerns between management operations and application workloads. This design provides scalability, security, and operational efficiency by dedicating specialized clusters for different purposes while maintaining centralized governance and oversight.

Architecture Principles¶

Separation of Concerns¶

Management Cluster: Centralized control plane for governance, policy, monitoring, and operations
Application Clusters: Dedicated workload execution environments for virtual machines and containers
Clear Boundaries: Well-defined interfaces and responsibilities between cluster types

Scalability and Growth¶

Horizontal Scaling: Add application clusters as demand grows
Regional Distribution: Deploy clusters across different geographic locations
Resource Optimization: Right-size clusters based on workload requirements

Security and Compliance¶

Zero Trust Architecture: Network-level security between clusters
Centralized Policy Management: Consistent security policies across all clusters
Compliance Monitoring: Unified compliance reporting and auditing

Multi-Cluster Topology¶

graph TB
    subgraph "Management Cluster"
        subgraph "GitOps Platform"
            ARGO[ArgoCD Hub]
            GIT[Git Repositories]
        end

        subgraph "Policy & Security"
            RHACS[Red Hat Advanced Cluster Security]
            POL[Policy Engine - Kyverno]
        end

        subgraph "Multi-Cluster Management"
            RHACM[Red Hat Advanced Cluster Management]
            FLEET[Fleet Management]
        end

        subgraph "Observability Stack"
            PROM[Prometheus Federation]
            GRAF[Grafana Central]
            ALERT[AlertManager]
            LOG[Logging Aggregation]
        end

        subgraph "Backup & DR"
            RUBRIK[Rubrik Management]
            BACKUP[Backup Policies]
        end
    end

    subgraph "Application Cluster 1 - Production"
        subgraph "Virtualization Stack 1"
            OVE1[OpenShift Virtualization]
            VM1[Virtual Machines]
            CDI1[Containerized Data Importer]
        end

        subgraph "Networking 1"
            CIL1[Cilium CNI]
            MULT1[Multus Multi-Network]
            SRIOV1[SR-IOV Networks]
        end

        subgraph "Storage 1"
            CSI1[CSI Drivers]
            PV1[Persistent Volumes]
        end

        subgraph "Local Agents"
            ARGO1[ArgoCD Agent]
            RHACS1[RHACS Agent]
            MON1[Monitoring Agents]
        end
    end

    subgraph "Application Cluster 2 - Staging"
        subgraph "Virtualization Stack 2"
            OVE2[OpenShift Virtualization]
            VM2[Virtual Machines]
            CDI2[Containerized Data Importer]
        end

        subgraph "Networking 2"
            CIL2[Cilium CNI]
            MULT2[Multus Multi-Network]
            SRIOV2[SR-IOV Networks]
        end

        subgraph "Storage 2"
            CSI2[CSI Drivers]
            PV2[Persistent Volumes]
        end

        subgraph "Local Agents"
            ARGO2[ArgoCD Agent]
            RHACS2[RHACS Agent]
            MON2[Monitoring Agents]
        end
    end

    subgraph "Application Cluster N - Development"
        subgraph "Virtualization Stack N"
            OVEN[OpenShift Virtualization]
            VMN[Virtual Machines]
            CDIN[Containerized Data Importer]
        end

        subgraph "Networking N"
            CILN[Cilium CNI]
            MULTN[Multus Multi-Network]
            SRIOVN[SR-IOV Networks]
        end

        subgraph "Storage N"
            CSIN[CSI Drivers]
            PVN[Persistent Volumes]
        end

        subgraph "Local Agents"
            ARGON[ArgoCD Agent]
            RHACSN[RHACS Agent]
            MONN[Monitoring Agents]
        end
    end

    %% Management to Application Connections
    ARGO --> ARGO1
    ARGO --> ARGO2
    ARGO --> ARGON

    RHACM --> ARGO1
    RHACM --> ARGO2
    RHACM --> ARGON

    RHACS --> RHACS1
    RHACS --> RHACS2
    RHACS --> RHACSN

    PROM --> MON1
    PROM --> MON2
    PROM --> MONN

    RUBRIK --> PV1
    RUBRIK --> PV2
    RUBRIK --> PVN

    %% Git to ArgoCD
    GIT --> ARGO

    %% Policy Distribution
    POL --> RHACS1
    POL --> RHACS2
    POL --> RHACSN

Management Cluster Components¶

Core Management Services¶

Red Hat Advanced Cluster Management (RHACM)¶

apiVersion: operator.open-cluster-management.io/v1
kind: MultiClusterHub
metadata:
  name: multiclusterhub
  namespace: open-cluster-management
spec:
  availabilityConfig: High
  enableClusterBackup: true
  overrides:
    components:
    - name: multicluster-observability-operator
      enabled: true
    - name: cluster-lifecycle
      enabled: true
    - name: cluster-permission
      enabled: true

Responsibilities: - Cluster lifecycle management - Policy distribution and compliance - Application deployment coordination - Resource optimization across clusters

ArgoCD Hub Configuration¶

apiVersion: argoproj.io/v1alpha1
kind: ArgoCD
metadata:
  name: argocd-hub
  namespace: argocd
spec:
  server:
    route:
      enabled: true
      tls:
        termination: reencrypt
    replicas: 3
  controller:
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 2
        memory: 4Gi
  dex:
    openShiftOAuth: true
  ha:
    enabled: true
  rbac:
    defaultPolicy: 'role:readonly'
    policy: |
      p, role:admin, applications, *, */*, allow
      p, role:admin, clusters, *, *, allow
      p, role:admin, repositories, *, *, allow
      g, argocd-admins, role:admin

Responsibilities: - GitOps workflow orchestration - Application deployment to target clusters - Configuration drift detection and remediation - Multi-cluster application synchronization

Security and Compliance¶

Red Hat Advanced Cluster Security (RHACS)¶

apiVersion: platform.stackrox.io/v1alpha1
kind: Central
metadata:
  name: stackrox-central-services
  namespace: stackrox
spec:
  central:
    exposure:
      loadBalancer:
        enabled: true
    persistence:
      persistentVolumeClaim:
        claimName: central-db
    resources:
      requests:
        cpu: 1500m
        memory: 4Gi
      limits:
        cpu: 4000m
        memory: 8Gi
  scanner:
    resources:
      requests:
        cpu: 200m
        memory: 200Mi
      limits:
        cpu: 2000m
        memory: 4Gi

Responsibilities: - Centralized security policy management - Vulnerability scanning across clusters - Runtime threat detection - Compliance reporting and audit trails

Policy Engine (Kyverno)¶

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: multi-cluster-vm-policy
spec:
  validationFailureAction: enforce
  background: true
  rules:
  - name: require-vm-labels
    match:
      any:
      - resources:
          kinds:
          - VirtualMachine
    validate:
      message: "VMs must have required labels: environment, owner, backup-policy"
      pattern:
        metadata:
          labels:
            environment: "?*"
            owner: "?*"
            backup-policy: "?*"

Observability and Monitoring¶

Federated Prometheus Configuration¶

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-federation
  namespace: monitoring
spec:
  replicas: 3
  retention: 30d
  storage:
    volumeClaimTemplate:
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 500Gi
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      prometheus: federation
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml

Federation Configuration:

- job_name: 'federate-app-clusters'
  scrape_interval: 15s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"kubernetes-.*"}'
      - '{job=~"node-.*"}'
      - '{job=~"kubevirt-.*"}'
  static_configs:
    - targets:
      - 'prometheus-app-cluster-1.monitoring.svc.cluster.local:9090'
      - 'prometheus-app-cluster-2.monitoring.svc.cluster.local:9090'
      - 'prometheus-app-cluster-n.monitoring.svc.cluster.local:9090'

Centralized Logging¶

apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: central-log-forwarder
  namespace: openshift-logging
spec:
  outputs:
  - name: central-elasticsearch
    type: elasticsearch
    url: https://elasticsearch-central.logging.svc.cluster.local:9200
    secret:
      name: elasticsearch-central-secret
  pipelines:
  - name: forward-app-logs
    inputRefs:
    - application
    - infrastructure
    - audit
    outputRefs:
    - central-elasticsearch

Application Cluster Architecture¶

Cluster Sizing and Resource Allocation¶

Production Cluster Profile¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-profile-production
data:
  profile: |
    cluster_type: production
    node_count: 12
    master_nodes: 3
    worker_nodes: 9
    storage_nodes: 3

    node_specifications:
      master:
        cpu: 16
        memory: 64Gi
        storage: 500Gi SSD
      worker:
        cpu: 32
        memory: 128Gi
        storage: 1Ti NVMe
      storage:
        cpu: 8
        memory: 32Gi
        storage: 4Ti SSD

    network_configuration:
      cni: cilium
      multi_network: multus
      sr_iov: enabled
      encryption: wireguard

    virtualization:
      kubevirt_version: "v1.1.0"
      nested_virtualization: true
      hugepages: 1Gi
      cpu_pinning: enabled

Staging/Development Cluster Profile¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-profile-staging
data:
  profile: |
    cluster_type: staging
    node_count: 6
    master_nodes: 3
    worker_nodes: 3

    node_specifications:
      master:
        cpu: 8
        memory: 32Gi
        storage: 200Gi SSD
      worker:
        cpu: 16
        memory: 64Gi
        storage: 500Gi SSD

    network_configuration:
      cni: cilium
      multi_network: multus
      sr_iov: optional
      encryption: ipsec

    virtualization:
      kubevirt_version: "v1.1.0"
      nested_virtualization: false
      hugepages: optional
      cpu_pinning: disabled

Virtualization Stack Configuration¶

OpenShift Virtualization Deployment¶

apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  infra:
    nodePlacement:
      nodeSelector:
        node-role.kubernetes.io/worker: ""
  workloads:
    nodePlacement:
      nodeSelector:
        node-role.kubernetes.io/worker: ""
  featureGates:
    enableCommonBootImageImport: true
    deployTektonTaskResources: true
    enableApplicationAwareQuota: true
  configuration:
    network:
      networkBinding:
        plugins:
          macvtap: {}
          passt: {}
    virtualMachineOptions:
      disableFreePageReporting: false
      disableSerialConsoleLog: false

Multi-Network Configuration¶

Network Attachment Definitions for Different Environments¶

# Production Network Configuration
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: prod-management-network
  namespace: vm-production
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "prod-management-network",
      "type": "macvlan",
      "master": "ens192",
      "mode": "bridge",
      "ipam": {
        "type": "static"
      }
    }
---
# Staging Network Configuration
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: staging-management-network
  namespace: vm-staging
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "staging-management-network",
      "type": "macvlan",
      "master": "ens192",
      "mode": "bridge",
      "vlan": 100,
      "ipam": {
        "type": "dhcp"
      }
    }

Cluster Lifecycle Management¶

Cluster Provisioning Workflow¶

sequenceDiagram
    participant Admin as Platform Admin
    participant RHACM as RHACM Hub
    participant Git as Git Repository
    participant ArgoCD as ArgoCD Hub
    participant Cluster as New Cluster

    Admin->>Git: Commit cluster definition
    Git->>ArgoCD: Webhook trigger
    ArgoCD->>RHACM: Apply cluster manifest
    RHACM->>Cluster: Provision cluster
    Cluster->>RHACM: Registration
    RHACM->>ArgoCD: Cluster ready notification
    ArgoCD->>Cluster: Deploy applications
    Cluster->>Admin: Cluster operational

Cluster Template¶

apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  name: app-cluster-{{ .Values.environment }}-{{ .Values.region }}
  labels:
    environment: {{ .Values.environment }}
    region: {{ .Values.region }}
    cluster.open-cluster-management.io/clusterset: {{ .Values.clusterset }}
spec:
  hubAcceptsClient: true
  leaseDurationSeconds: 60
---
apiVersion: agent.open-cluster-management.io/v1
kind: KlusterletAddonConfig
metadata:
  name: app-cluster-{{ .Values.environment }}-{{ .Values.region }}
  namespace: app-cluster-{{ .Values.environment }}-{{ .Values.region }}
spec:
  clusterName: app-cluster-{{ .Values.environment }}-{{ .Values.region }}
  clusterNamespace: app-cluster-{{ .Values.environment }}-{{ .Values.region }}
  clusterLabels:
    environment: {{ .Values.environment }}
    region: {{ .Values.region }}
  applicationManager:
    enabled: true
  policyController:
    enabled: true
  searchCollector:
    enabled: true
  certPolicyController:
    enabled: true

Multi-Cluster Networking¶

Cluster Network Isolation¶

graph TB
    subgraph "Management Network - 10.0.0.0/16"
        MGT[Management Cluster]
        MGT_API[API Endpoints]
        MGT_MON[Monitoring Services]
    end

    subgraph "Production Network - 10.1.0.0/16"
        PROD[Production Cluster]
        PROD_VM[Production VMs]
        PROD_SVC[Production Services]
    end

    subgraph "Staging Network - 10.2.0.0/16"
        STAGE[Staging Cluster]
        STAGE_VM[Staging VMs]
        STAGE_SVC[Staging Services]
    end

    subgraph "Development Network - 10.3.0.0/16"
        DEV[Development Cluster]
        DEV_VM[Development VMs]
        DEV_SVC[Development Services]
    end

    subgraph "Shared Services Network - 10.254.0.0/16"
        DNS[DNS Services]
        NTP[NTP Services]
        LDAP[LDAP/AD Services]
        BACKUP[Backup Services]
    end

    %% Management connections
    MGT_API -.-> PROD
    MGT_API -.-> STAGE
    MGT_API -.-> DEV
    MGT_MON -.-> PROD
    MGT_MON -.-> STAGE
    MGT_MON -.-> DEV

    %% Shared services connections
    PROD -.-> DNS
    STAGE -.-> DNS
    DEV -.-> DNS
    PROD -.-> BACKUP
    STAGE -.-> BACKUP
    DEV -.-> BACKUP

Service Mesh Integration¶

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: cross-cluster-vm-service
spec:
  hosts:
  - vm-service.production.svc.cluster.local
  gateways:
  - mesh
  - cross-cluster-gateway
  http:
  - match:
    - headers:
        cluster:
          exact: staging
    route:
    - destination:
        host: vm-service.staging.svc.cluster.local
  - route:
    - destination:
        host: vm-service.production.svc.cluster.local

Disaster Recovery and Business Continuity¶

Multi-Cluster Backup Strategy¶

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: multi-cluster-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
    - vm-production
    - vm-staging
    - openshift-cnv
    excludedResources:
    - pods
    - replicasets
    snapshotVolumes: true
    ttl: 720h  # 30 days
    hooks:
      resources:
      - name: vm-backup-hook
        includedNamespaces:
        - vm-production
        - vm-staging
        labelSelector:
          matchLabels:
            backup.kubevirt.io/enable: "true"
        pre:
        - exec:
            container: virt-launcher
            command:
            - /bin/bash
            - -c
            - "virtctl freeze --namespace $NAMESPACE $VM_NAME"
        post:
        - exec:
            container: virt-launcher
            command:
            - /bin/bash
            - -c
            - "virtctl unfreeze --namespace $NAMESPACE $VM_NAME"

Cross-Cluster Failover¶

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
  name: vm-workload-placement
  namespace: vm-production
spec:
  predicates:
  - requiredClusterSelector:
      labelSelector:
        matchLabels:
          environment: production
          region: primary
  - requiredClusterSelector:
      labelSelector:
        matchLabels:
          environment: production
          region: secondary
  numberOfClusters: 2
  prioritizerPolicy:
    mode: Additive
    configurations:
    - scoreCoordinate:
        type: BuiltIn
        builtIn: Steady
      weight: 1
    - scoreCoordinate:
        type: BuiltIn
        builtIn: ResourceAllocatableCPU
      weight: 1

Scalability and Performance¶

Cluster Auto-Scaling¶

apiVersion: machine.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: worker-autoscaler
  namespace: openshift-machine-api
spec:
  minReplicas: 3
  maxReplicas: 20
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: worker-machineset
---
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  podPriorityThreshold: -10
  resourceLimits:
    maxNodesTotal: 50
    cores:
      min: 16
      max: 1000
    memory:
      min: 64Gi
      max: 4000Gi
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 10s
    delayAfterFailure: 30s
    unneededTime: 60s

VM Resource Management¶

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: scalable-vm-template
  namespace: vm-production
spec:
  template:
    spec:
      domain:
        cpu:
          cores: 4
          sockets: 1
          threads: 1
        memory:
          guest: 8Gi
        resources:
          requests:
            cpu: 2
            memory: 4Gi
          limits:
            cpu: 4
            memory: 8Gi
        devices:
          autoattachPodInterface: false
          autoattachSerialConsole: true
          autoattachGraphicsDevice: true
      evictionStrategy: LiveMigrate
      terminationGracePeriodSeconds: 180
      nodeSelector:
        node-role.kubernetes.io/worker: ""
        vm-workload: "true"
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: vm.kubevirt.io/name
                  operator: Exists
              topologyKey: kubernetes.io/hostname

Operational Procedures¶

Day-2 Operations Workflow¶

graph TB
    subgraph "Management Operations"
        PATCH[Security Patches]
        UPDATE[Component Updates]
        SCALE[Capacity Scaling]
        BACKUP[Backup Verification]
    end

    subgraph "Application Operations"
        DEPLOY[VM Deployment]
        MIGRATE[VM Migration]
        MONITOR[Performance Monitoring]
        TROUBLESHOOT[Issue Resolution]
    end

    subgraph "Governance"
        POLICY[Policy Compliance]
        AUDIT[Security Audit]
        REPORT[Reporting]
        REVIEW[Architecture Review]
    end

    PATCH --> UPDATE
    UPDATE --> SCALE
    SCALE --> BACKUP

    DEPLOY --> MIGRATE
    MIGRATE --> MONITOR
    MONITOR --> TROUBLESHOOT

    POLICY --> AUDIT
    AUDIT --> REPORT
    REPORT --> REVIEW

    BACKUP -.-> DEPLOY
    TROUBLESHOOT -.-> POLICY
    REVIEW -.-> PATCH

Monitoring and Alerting¶

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: multi-cluster-alerts
  namespace: monitoring
spec:
  groups:
  - name: cluster.health
    rules:
    - alert: ClusterDown
      expr: up{job="kubernetes-apiservers"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Cluster {{ $labels.cluster }} is down"
        description: "Cluster {{ $labels.cluster }} has been down for more than 5 minutes"

    - alert: VMHighMemory
      expr: kubevirt_vm_memory_usage_bytes / kubevirt_vm_memory_available_bytes > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "VM {{ $labels.name }} high memory usage"
        description: "VM {{ $labels.name }} in cluster {{ $labels.cluster }} has high memory usage"

    - alert: VMMigrationFailed
      expr: increase(kubevirt_vm_migration_failed_total[5m]) > 0
      labels:
        severity: critical
      annotations:
        summary: "VM migration failed"
        description: "VM migration failed in cluster {{ $labels.cluster }}"

Best Practices and Recommendations¶

Cluster Design Guidelines¶

Resource Planning
Size clusters based on workload requirements
Plan for 20-30% overhead for system components
Consider NUMA topology for high-performance VMs
Network Segmentation
Isolate management and data plane traffic
Use VLANs for multi-tenant environments
Implement east-west encryption
Storage Strategy
Use local storage for high-performance workloads
Implement storage classes for different performance tiers
Plan for backup and disaster recovery
Security Architecture
Implement pod security standards
Use network policies for microsegmentation
Regular security scanning and compliance checks

Operational Excellence¶

GitOps Workflow
All changes through version control
Automated testing and validation
Rollback capabilities
Monitoring Strategy
Proactive alerting and monitoring
Centralized logging and metrics
Regular performance reviews
Disaster Recovery
Regular backup testing
Cross-region replication
Documented recovery procedures

This global architecture overview provides a comprehensive foundation for understanding how the RH OVE ecosystem scales across multiple clusters while maintaining centralized governance, security, and operational efficiency. The architecture supports growth from small deployments to large-scale multi-region installations while preserving consistent management and security practices.