Skip to content

ADR-007: Monitoring Strategy for RH OVE Ecosystem

Status

Accepted

Date

2024-12-01

Context

For the RH OVE multi-cluster setup, a comprehensive monitoring solution is necessary to ensure operational visibility, performance management, and incident response capability for both containerized and VM-based workloads.

Decision

Implement an integrated monitoring solution using Prometheus and Grafana for metrics collection and visualization, enhanced by Dynatrace for application performance monitoring and Hubble for network observability.

Rationale

Prometheus & Grafana

  1. Scalability: Native Kubernetes support, able to scale for large environments
  2. Flexibility: Customizable dashboards and extensibility with plugins
  3. Community Support: Active ecosystem with numerous exporters and integrations
  4. Real-time Metrics: Capable of handling thousands of unique time-series metrics
  5. Alerting: Integrated alert management with Prometheus Alertmanager

Dynatrace

  1. Full-Stack Monitoring: Covers both infrastructure and application layers
  2. AI-Powered Analytics: Automated anomaly detection and root cause analysis
  3. Cloud-Native Support: Strong support for Kubernetes and container environments
  4. Unified Observability: Centralized insights across microservices and legacy apps

Hubble

  1. eBPF-powered Network Insights: Detailed flow visibility and security audits
  2. High Throughput: Capable of capturing thousands of network flows per second
  3. Deployment Simplicity: Out-of-the-box integration with Cilium

Alternatives Considered

  1. OpenShift Monitoring Stack
  2. Pros: Native solution, well-integrated
  3. Cons: Lacks depth in application performance monitoring
  4. Rejected: Chosen instead for basic cluster health visibility

  5. Elastic Stack

  6. Pros: Full-text search capabilities
  7. Cons: Complexity and resource consumption
  8. Rejected: Simplified requirements focused on metrics

  9. DataDog

  10. Pros: Comprehensive feature set, SaaS model
  11. Cons: Cost concerns for large-scale deployment
  12. Rejected: Cost prohibitive compared to chosen solutions

Implementation Details

Prometheus Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: global-prometheus
  namespace: monitoring
spec:
  replicas: 3
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      team: observability
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi

Grafana Setup

  • Dashboards: Pre-configured dashboards for cluster health, application performance, VM metrics
  • Themes: Custom theming for alignment with corporate branding
  • User Access Control: Integrated with OAuth for SSO

Dynatrace Integration

  • Deployment of OneAgent across clusters for full-stack visibility
  • Integration with CI/CD pipelines for real-time performance feedback
  • Automated tagging for dynamic cloud workloads

Hubble Configuration

  • Enable flow aggregation and analysis for detailed network observability
  • Real-time flow filtering and visualization of network policies

Security and Compliance Considerations

  • Data Encryption: All telemetry data encrypted in transit
  • Role-Based Access Control: Segmented access to monitoring data
  • Compliance Monitoring: Automated checks for regulatory compliance
  • Audit Logging: Capture all configuration and access attempts

Consequences

Positive

  • Operational Efficiency: Reduce MTTR with real-time insights and alerting
  • Proactive Performance Management: Identify and resolve issues before impacting users
  • Unified Observability: Single-pane monitoring across clusters and applications

Negative

  • Complexity of Integration: Requires coordination across multiple tools
  • Resource Overhead: Higher costs in terms of storage and compute resources
  • Training Requirements: Teams need to become familiar with monitoring tools

Migration Strategy

Phase 1: Initial Setup and Configuration

  1. Deploy base Prometheus and Grafana setup in the management cluster
  2. Establish Dynatrace integration for application monitoring
  3. Enable Hubble for network flow visibility

Phase 2: Metrics and Dashboard Customization

  1. Design and implement custom dashboards for key performance indicators
  2. Configure alerting thresholds and incident response playbooks
  3. Integrate monitoring data with existing ITSM tools

Phase 3: Continuous Optimization

  1. Conduct regular review of metrics and dashboards for continuous improvement
  2. Leverage Dynatrace AI insights for proactive tuning and capacity planning
  3. Regularly assess network flow policies for efficiency and security

Monitoring and Metrics

Key Performance Indicators

  • CPU, memory, and storage utilization
  • Network latency and throughput
  • Application response times and error rates
  • VM and container health

Alerting Rules

  • Resource exhaustion (CPU, Memory, Storage)
  • Network policy violations
  • Anomalous application behavior

This robust monitoring strategy ensures RH OVE achieves operational excellence, rapid issue resolution, and strategic insight into both infrastructure performance and applications across the multi-cluster environment.