Use Case: End-to-End Application Observability in RH OVE¶

Business Context¶

In the Red Hat OpenShift Virtualization Engine (RH OVE) ecosystem, comprehensive observability is essential for monitoring both containerized applications and virtual machines, understanding performance bottlenecks, troubleshooting issues, and ensuring optimal resource utilization across hybrid workloads. This use case demonstrates two complementary approaches: native OpenShift observability tools and integration with Dynatrace for enterprise-grade observability.

What Developers Need to Expose¶

For effective end-to-end observability, developers must instrument their applications to expose:

Required Metrics¶

Business Metrics: Transaction counts, success rates, revenue metrics
Application Metrics: Response times, error rates, throughput
Resource Metrics: CPU, memory, disk I/O, network usage
Custom Metrics: Domain-specific KPIs and performance indicators

Required Traces¶

Request Traces: End-to-end request flow across microservices
Database Traces: SQL queries and database connection metrics
External Service Traces: API calls to third-party services
Async Operations: Message queue operations, background jobs

Required Logs¶

Structured Logs: JSON formatted with consistent fields
Error Logs: Exception details with stack traces
Audit Logs: Security and compliance events
Performance Logs: Slow queries, long-running operations

Health Endpoints¶

Liveness Probes: /health/live - Application is running
Readiness Probes: /health/ready - Application ready to serve traffic
Metrics Endpoint: /metrics - Prometheus-formatted metrics
Info Endpoint: /info - Application version and build information

1. Native OpenShift Observability¶

Infrastructure Requirements¶

OpenShift 4.12+ with built-in monitoring stack
OpenShift Data Foundation for persistent storage
Red Hat OpenShift Logging (based on Loki)
Red Hat OpenShift distributed tracing (Jaeger)
Cilium Hubble for network observability
KubeVirt monitoring for VM workloads

Architecture Overview¶

graph TD
    subgraph "RH OVE Application Layer"
        CONTAINER_APPS["Container Applications"]
        VM_WORKLOADS["VM Workloads"]
        HYBRID_APPS["Hybrid Applications"]
    end

    subgraph "OpenShift Native Observability Stack"
        OCP_PROMETHEUS["OpenShift Prometheus"]
        OCP_GRAFANA["OpenShift Console & Grafana"]
        OCP_JAEGER["Red Hat OpenShift distributed tracing"]
        OCP_LOKI["Red Hat OpenShift Logging"]
        CILIUM_HUBBLE["Cilium Hubble"]
        KUBEVIRT_METRICS["KubeVirt Metrics"]
    end

    subgraph "Storage & Processing"
        ODF_STORAGE["OpenShift Data Foundation"]
        ALERTMANAGER["AlertManager"]
    end

    CONTAINER_APPS --> OCP_PROMETHEUS
    VM_WORKLOADS --> KUBEVIRT_METRICS
    HYBRID_APPS --> OCP_PROMETHEUS

    CONTAINER_APPS --> OCP_JAEGER
    HYBRID_APPS --> OCP_JAEGER

    CONTAINER_APPS --> OCP_LOKI
    VM_WORKLOADS --> OCP_LOKI
    HYBRID_APPS --> OCP_LOKI

    CILIUM_HUBBLE --> OCP_PROMETHEUS
    KUBEVIRT_METRICS --> OCP_PROMETHEUS

    OCP_PROMETHEUS --> OCP_GRAFANA
    OCP_JAEGER --> OCP_GRAFANA
    OCP_LOKI --> OCP_GRAFANA

    OCP_PROMETHEUS --> ALERTMANAGER
    OCP_PROMETHEUS --> ODF_STORAGE
    OCP_LOKI --> ODF_STORAGE

    style OCP_PROMETHEUS fill:#f9f,stroke:#333
    style OCP_GRAFANA fill:#99f,stroke:#333
    style OCP_JAEGER fill:#9ff,stroke:#333
    style OCP_LOKI fill:#ff9,stroke:#333
    style CILIUM_HUBBLE fill:#f99,stroke:#333

Implementation Steps¶

Step 1: Enable OpenShift Built-in Monitoring¶

Configure User Workload Monitoring¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkloadMonitoring: true
    prometheusK8s:
      retention: 30d
      volumeClaimTemplate:
        spec:
          storageClassName: ocs-storagecluster-ceph-rbd
          resources:
            requests:
              storage: 100Gi

Step 2: Application Instrumentation for Container Applications¶

Comprehensive Metrics Configuration (Go Example)¶

package main

import (
    "context"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Business metrics
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "myapp_http_requests_total",
            Help: "Total number of HTTP requests by status code and method",
        },
        []string{"method", "status_code", "endpoint"},
    )

    // Performance metrics
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "myapp_http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )

    // Resource metrics
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "myapp_active_connections",
            Help: "Number of active connections",
        },
    )

    // Custom business metrics
    ordersProcessed = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "myapp_orders_processed_total",
            Help: "Total number of orders processed",
        },
        []string{"status"},
    )
)

func instrumentHandler(next http.HandlerFunc, endpoint string) http.HandlerFunc {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Process request
        next.ServeHTTP(w, r)

        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, "200", endpoint).Inc()
    })
}

func main() {
    // Health endpoints
    http.HandleFunc("/health/live", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("alive"))
    })

    http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
        // Check dependencies (DB, external services)
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("ready"))
    })

    // Metrics endpoint
    http.Handle("/metrics", promhttp.Handler())

    // Business endpoints with instrumentation
    http.HandleFunc("/api/orders", instrumentHandler(ordersHandler, "/api/orders"))

    http.ListenAndServe(":8080", nil)
}

Distributed Tracing Configuration (Node.js Example)¶

const { NodeSDK } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

// Configure Jaeger exporter for OpenShift distributed tracing
const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger-collector.openshift-distributed-tracing-system.svc.cluster.local:14268/api/traces',
});

// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'myapp-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
  }),
  traceExporter: jaegerExporter,
});

sdk.start();

const express = require('express');
const { trace, context } = require('@opentelemetry/api');
const app = express();

// Custom tracing for business operations
app.get('/api/orders/:id', async (req, res) => {
  const tracer = trace.getTracer('myapp');

  await tracer.startActiveSpan('process_order', async (span) => {
    try {
      // Add custom attributes
      span.setAttributes({
        'order.id': req.params.id,
        'user.id': req.headers['user-id'],
        'operation.type': 'order_processing'
      });

      // Simulate database call with tracing
      await tracer.startActiveSpan('database_query', async (dbSpan) => {
        // Database operation
        dbSpan.setAttributes({
          'db.operation': 'SELECT',
          'db.table': 'orders'
        });
        dbSpan.end();
      });

      // Simulate external API call
      await tracer.startActiveSpan('external_api_call', async (apiSpan) => {
        apiSpan.setAttributes({
          'http.method': 'POST',
          'http.url': 'https://payment-service/process'
        });
        apiSpan.end();
      });

      res.json({ orderId: req.params.id, status: 'processed' });
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: trace.SpanStatusCode.ERROR, message: error.message });
      res.status(500).json({ error: 'Processing failed' });
    } finally {
      span.end();
    }
  });
});

Structured Logging Configuration¶

# Python example with structured logging for OpenShift Logging
import logging
import json
import sys
from datetime import datetime

class StructuredLogger:
    def __init__(self, service_name):
        self.service_name = service_name
        self.logger = logging.getLogger(service_name)
        self.logger.setLevel(logging.INFO)

        # Configure JSON formatter for OpenShift Logging
        handler = logging.StreamHandler(sys.stdout)
        handler.setFormatter(self.JsonFormatter())
        self.logger.addHandler(handler)

    class JsonFormatter(logging.Formatter):
        def format(self, record):
            log_entry = {
                'timestamp': datetime.utcnow().isoformat() + 'Z',
                'level': record.levelname,
                'service': record.name,
                'message': record.getMessage(),
            }

            # Add custom fields if present
            if hasattr(record, 'user_id'):
                log_entry['user_id'] = record.user_id
            if hasattr(record, 'trace_id'):
                log_entry['trace_id'] = record.trace_id
            if hasattr(record, 'span_id'):
                log_entry['span_id'] = record.span_id

            return json.dumps(log_entry)

    def info(self, message, **kwargs):
        extra = {k: v for k, v in kwargs.items()}
        self.logger.info(message, extra=extra)

    def error(self, message, **kwargs):
        extra = {k: v for k, v in kwargs.items()}
        self.logger.error(message, extra=extra)

# Usage in application
logger = StructuredLogger('myapp-service')

def process_order(order_id, user_id):
    logger.info(
        "Processing order",
        user_id=user_id,
        order_id=order_id,
        operation='order_processing'
    )

    try:
        # Business logic
        result = do_business_logic()
        logger.info(
            "Order processed successfully",
            user_id=user_id,
            order_id=order_id,
            result=result
        )
    except Exception as e:
        logger.error(
            "Order processing failed",
            user_id=user_id,
            order_id=order_id,
            error=str(e),
            stack_trace=traceback.format_exc()
        )
        raise

Step 3: Configure OpenShift Native Observability Components¶

Enable Red Hat OpenShift Logging¶

apiVersion: logging.coreos.com/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  managementState: Managed
  logStore:
    type: lokistack
    lokistack:
      name: logging-loki
  collection:
    type: vector
    vector:
      resources:
        limits:
          memory: 1Gi
        requests:
          memory: 512Mi
  visualization:
    type: ocp-console
---
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  size: 1x.small
  storage:
    schemas:
    - version: v12
      effectiveDate: '2022-06-01'
    secret:
      name: logging-loki-s3
      type: s3
  storageClassName: ocs-storagecluster-ceph-rbd
  tenants:
    mode: openshift-logging

Deploy Red Hat OpenShift distributed tracing¶

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-production
  namespace: openshift-distributed-tracing-system
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      storage:
        storageClassName: ocs-storagecluster-ceph-rbd
        size: 100Gi
      resources:
        requests:
          memory: 4Gi
          cpu: 1
        limits:
          memory: 4Gi
          cpu: 1

Step 4: Configure Application Monitoring¶

ServiceMonitor for Application Metrics¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-metrics
  namespace: myapp-namespace
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    honorLabels: true
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-metrics
  namespace: myapp-namespace
  labels:
    app: myapp
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080
  selector:
    app: myapp

PrometheusRule for Custom Alerts¶

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: myapp-namespace
spec:
  groups:
  - name: myapp.rules
    rules:
    - alert: MyAppHighErrorRate
      expr: |
        (
          sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(myapp_http_requests_total[5m]))
        ) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected in MyApp"
        description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"

    - alert: MyAppHighLatency
      expr: |
        histogram_quantile(0.95, 
          sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)
        ) > 1.0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected in MyApp"
        description: "95th percentile latency is {{ $value }}s"

    - alert: MyAppPodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total{namespace="myapp-namespace"}[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "MyApp pod is crash looping"
        description: "Pod {{ $labels.pod }} is restarting frequently"

Step 5: VM Workload Monitoring¶

KubeVirt VM Monitoring¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kubevirt-vm-metrics
  namespace: kubevirt-system
spec:
  selector:
    matchLabels:
      prometheus.kubevirt.io: "true"
  endpoints:
  - port: metrics
    interval: 30s
    honorLabels: true
---
# VM-specific PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vm-alerts
  namespace: vm-workloads
spec:
  groups:
  - name: vm.rules
    rules:
    - alert: VMHighCPUUsage
      expr: kubevirt_vmi_vcpu_seconds_total > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "VM {{ $labels.name }} has high CPU usage"

    - alert: VMHighMemoryUsage
      expr: |
        (
          kubevirt_vmi_memory_resident_bytes
          /
          kubevirt_vmi_memory_maximum_bytes
        ) > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "VM {{ $labels.name }} has high memory usage"

Step 6: Network Observability with Cilium Hubble¶

Enable Cilium Hubble¶

apiVersion: cilium.io/v2alpha1
kind: CiliumConfig
metadata:
  name: cilium-config
  namespace: cilium-system
spec:
  hubble:
    enabled: true
    metrics:
      enabled:
      - dns:query;ignoreAAAA
      - drop
      - tcp
      - flow
      - icmp
      - http
    relay:
      enabled: true
    ui:
      enabled: true

Best Practices for Native Observability¶

Consistent Labeling: Use standardized labels across all metrics (service, version, environment)
Cardinality Management: Avoid high-cardinality labels that can overwhelm Prometheus
Sampling Strategy: Implement trace sampling for high-traffic applications (1-10% sample rate)
Log Levels: Use appropriate log levels and structured logging with consistent fields
Resource Limits: Set appropriate resource limits for observability components
Retention Policies: Configure appropriate retention for metrics (30d) and logs (7d for debug, 30d for info/error)
Alert Fatigue: Create meaningful alerts with proper thresholds and runbooks

2. Observability with Dynatrace¶

Infrastructure Requirements¶

Dynatrace OneAgent deployed on OpenShift nodes
Dynatrace SaaS or Managed account
Network access to Dynatrace monitoring endpoints

Architecture Overview¶

graph TD
    subgraph "Application Layer"
        APP1["Microservice Application"]
    end

    subgraph "Dynatrace Observability"
        ONEAGENT["Dynatrace OneAgent"]
        DT_SAAS["Dynatrace SaaS"]
    end

    APP1 -- data --> ONEAGENT
    ONEAGENT --> DT_SAAS

    style ONEAGENT fill:#f9f,stroke:#333
    style DT_SAAS fill:#99f,stroke:#333

Implementation Steps¶

Step 1: Deploy Dynatrace OneAgent¶

Use Dynatrace Operator for OpenShift to deploy OneAgent.

apiVersion: dynatrace.com/v1alpha1
kind: Dynakube
metadata:
  name: dynakube
  namespace: dynatrace
spec:
  oneAgent:
    classicFullStack: true
  apiUrl: "https://<environment-id>.live.dynatrace.com/api"
  tokens: "api-token"

Step 2: Application Configuration¶

No changes required for application code, as OneAgent will automatically instrument all services.

Step 3: Monitor and Analyze¶

Use Dynatrace dashboards for comprehensive observability and performance analysis.
Implement AI-driven alerts for proactive issue detection.

Best Practices¶

Ensure Network Connectivity: Verify network connectivity to Dynatrace endpoints.
Optimize Resource Allocation: Ensure sufficient resources for OneAgent processing.
Leverage Dynatrace AI: Utilize Dynatrace's AI capabilities for automated root cause analysis.

This comprehensive guide provides both native and third-party observability solutions, enabling holistic insights into application performance and behavior within the RH OVE ecosystem.