Skip to content

System Architecture

This document provides a comprehensive overview of the Temporal.io enterprise deployment system architecture, designed for production Kubernetes environments with enterprise-grade requirements.

Overview

The Temporal.io deployment follows a microservices architecture pattern with clear separation of concerns, high availability, and scalability built-in from the ground up. The system is designed to handle enterprise workloads while maintaining security, observability, and operational excellence.

High-Level Architecture

graph TB
    subgraph "External Layer"
        EXT1[API Manager<br/>Gravitee.io]
        EXT2[Load Balancer<br/>NGINX/HAProxy]
        EXT3[CDN<br/>CloudFlare]
        EXT4[External Monitoring<br/>Datadog/New Relic]
    end

    subgraph "Security Layer"
        SEC1[WAF<br/>Web Application Firewall]
        SEC2[SSO Provider<br/>Authentik]
        SEC3[Secrets Management<br/>HashiCorp Vault]
        SEC4[Certificate Management<br/>cert-manager]
    end

    subgraph "Kubernetes Cluster"
        subgraph "Ingress Layer"
            ING1[Ingress Controller<br/>NGINX/Traefik]
            ING2[Service Mesh<br/>Istio/Linkerd]
        end

        subgraph "temporal-backend Namespace"
            TB1[Temporal Server<br/>Frontend Service]
            TB2[Temporal Server<br/>History Service]
            TB3[Temporal Server<br/>Matching Service]
            TB4[Temporal Server<br/>Worker Service]
            TB5[Temporal Web UI]
            TB6[Admin Tools]
        end

        subgraph "temporal-product Namespace"
            TP1[Business Workers<br/>Python/Go]
            TP2[FastAPI Services<br/>REST APIs]
            TP3[Background Jobs<br/>Schedulers]
        end

        subgraph "Data Layer"
            DB1[PostgreSQL Primary<br/>Persistence Store]
            DB2[PostgreSQL Replica<br/>Read Replicas]
            DB3[Elasticsearch<br/>Visibility Store]
            DB4[Redis<br/>Caching Layer]
        end

        subgraph "Monitoring Layer"
            MON1[Prometheus<br/>Metrics Collection]
            MON2[Grafana<br/>Dashboards]
            MON3[Jaeger<br/>Distributed Tracing]
            MON4[Fluent Bit<br/>Log Collection]
            MON5[OpenTelemetry<br/>Observability]
        end

        subgraph "Infrastructure Services"
            INF1[ArgoCD<br/>GitOps Controller]
            INF2[External Secrets<br/>Secrets Sync]
            INF3[Backup Controller<br/>Velero]
        end
    end

    subgraph "External Dependencies"
        DEP1[GitLab<br/>Source Control & CI/CD]
        DEP2[JFrog Artifactory<br/>Artifact Repository]
        DEP3[External Database<br/>Cloud SQL/RDS]
        DEP4[Object Storage<br/>S3/GCS/Azure Blob]
    end

    EXT1 --> SEC1
    SEC1 --> ING1
    EXT2 --> ING1
    SEC2 --> TB1
    SEC3 --> INF2

    ING1 --> TB5
    ING1 --> TP2

    TB1 --> TB2
    TB1 --> TB3
    TB1 --> TB4

    TP1 --> TB1
    TP2 --> TB1

    TB2 --> DB1
    TB3 --> DB1
    TB4 --> DB1
    TB1 --> DB3

    MON1 --> TB1
    MON1 --> TP1
    MON2 --> MON1
    MON3 --> TB1

    INF1 --> DEP1
    INF2 --> SEC3
    DEP2 --> TP1

Architecture Principles

1. Separation of Concerns

  • Control Plane: Temporal server components (frontend, history, matching, worker)
  • Data Plane: Business applications and workers
  • Infrastructure Plane: Monitoring, security, and operational tools

2. High Availability

  • Multi-node Kubernetes cluster with zone distribution
  • Database replication and failover capabilities
  • Load balancing across all service instances
  • Circuit breakers and retry mechanisms

3. Scalability

  • Horizontal scaling for all Temporal components
  • Auto-scaling based on metrics (CPU, memory, custom metrics)
  • Partitioned databases with sharding support
  • Queue-based task distribution

4. Security by Design

  • Zero-trust network architecture
  • End-to-end encryption (TLS 1.3)
  • Identity and access management integration
  • Secrets management with rotation
  • Network segmentation with policies

5. Observability

  • Comprehensive metrics collection
  • Distributed tracing
  • Structured logging
  • Real-time monitoring and alerting

Component Architecture

Temporal Server Components

Frontend Service

Component: temporal-frontend
Purpose: gRPC API endpoint for client connections
Responsibilities:
  - Client request handling
  - Authentication and authorization
  - Request routing and load balancing
  - Rate limiting and throttling
Scaling: Horizontal (3+ instances)
Dependencies: Database, Elasticsearch

History Service

Component: temporal-history
Purpose: Workflow execution state management
Responsibilities:
  - Workflow state persistence
  - Event history management
  - Decision task processing
  - Timer management
Scaling: Horizontal with sharding (512 shards)
Dependencies: Database (primary dependency)

Matching Service

Component: temporal-matching
Purpose: Task queue management and distribution
Responsibilities:
  - Task queue operations
  - Task routing to workers
  - Load balancing across workers
  - Sticky worker assignments
Scaling: Horizontal (2+ instances)
Dependencies: Database

Worker Service

Component: temporal-worker
Purpose: Internal system operations
Responsibilities:
  - System workflow execution
  - Archival operations
  - Replication tasks
  - System maintenance
Scaling: Horizontal (1+ instances)
Dependencies: Database, Object Storage

Business Application Layer

Temporal Workers

Component: business-workers
Technology: Python/Go applications
Purpose: Execute business workflows and activities
Characteristics:
  - Stateless execution
  - Auto-scaling based on queue depth
  - Circuit breaker patterns
  - Health monitoring
Deployment: Kubernetes Deployment with HPA

API Services

Component: fastapi-services
Technology: Python FastAPI
Purpose: REST API endpoints for business operations
Characteristics:
  - Async/await patterns
  - Database connection pooling
  - Caching layer integration
  - Rate limiting
Deployment: Kubernetes Deployment with Ingress

Data Architecture

Primary Database (PostgreSQL)

Temporal Default Store

-- Core Temporal tables
Tables:
  - executions: Workflow execution state
  - history_tree: Workflow history events
  - tasks: Task queue items
  - timers: Scheduled operations
  - activity_info: Activity execution state
  - child_execution_info: Child workflow tracking

Temporal Visibility Store

-- Search and filtering capabilities
Tables:
  - executions_visibility: Searchable execution data
  - workflow_search_attributes: Custom search fields
Indexes:
  - Execution time ranges
  - Workflow types
  - Custom search attributes

Search Layer (Elasticsearch)

Advanced Visibility

{
  "temporal_visibility_v1_prod": {
    "mappings": {
      "properties": {
        "WorkflowId": {"type": "keyword"},
        "WorkflowType": {"type": "keyword"},
        "StartTime": {"type": "date"},
        "CloseTime": {"type": "date"},
        "ExecutionStatus": {"type": "keyword"},
        "CustomSearchAttributes": {"type": "object"}
      }
    }
  }
}

Network Architecture

Namespace Segmentation

Namespaces:
  temporal-backend:
    purpose: Temporal server components and infrastructure
    network_policy: restricted_ingress_egress
    resources: high_priority

  temporal-product:
    purpose: Business applications and workers
    network_policy: restricted_egress_to_backend
    resources: auto_scaling

  monitoring:
    purpose: Observability stack
    network_policy: metrics_collection_only
    resources: persistent_storage

  security:
    purpose: Security tools and certificate management
    network_policy: cluster_wide_access
    resources: minimal

Service Communication

Internal Communication

  • gRPC: Temporal client-server communication
  • HTTP/REST: Web UI and API services
  • Database Protocol: PostgreSQL native protocol
  • HTTP: Elasticsearch REST API

External Communication

  • HTTPS: All external traffic (TLS 1.3)
  • mTLS: Service-to-service communication
  • gRPC-TLS: Temporal client connections

Deployment Architecture

Multi-Environment Strategy

Environments:
  development:
    cluster_size: 3_nodes
    database: single_instance
    monitoring: basic
    security: development_tls

  staging:
    cluster_size: 6_nodes
    database: replica_setup
    monitoring: full_stack
    security: production_like

  production:
    cluster_size: 12_nodes
    database: ha_cluster
    monitoring: enterprise_grade
    security: zero_trust

Resource Distribution

Node Classification

Node Types:
  control-plane:
    count: 3
    purpose: Kubernetes control plane
    taints: NoSchedule

  temporal-backend:
    count: 4
    purpose: Temporal server components
    labels: tier=backend
    resources: cpu_optimized

  temporal-workers:
    count: 4
    purpose: Business application workers
    labels: tier=workers
    resources: memory_optimized

  data-layer:
    count: 3
    purpose: Database and storage
    labels: tier=data
    resources: storage_optimized

  monitoring:
    count: 2
    purpose: Observability stack
    labels: tier=monitoring
    resources: balanced

Integration Patterns

Event-Driven Architecture

sequenceDiagram
    participant Client
    participant API_Gateway
    participant FastAPI
    participant Temporal_Client
    participant Temporal_Server
    participant Worker
    participant Database

    Client->>API_Gateway: HTTP Request
    API_Gateway->>FastAPI: Authenticated Request
    FastAPI->>Temporal_Client: Start Workflow
    Temporal_Client->>Temporal_Server: gRPC StartWorkflow
    Temporal_Server->>Database: Persist Execution
    Temporal_Server->>Worker: Schedule Activity
    Worker->>Temporal_Server: Complete Activity
    Temporal_Server->>Database: Update State
    Temporal_Server->>Temporal_Client: Workflow Complete
    Temporal_Client->>FastAPI: Result
    FastAPI->>API_Gateway: HTTP Response
    API_Gateway->>Client: Response

Workflow Patterns

Long-Running Processes

@workflow.defn
class OrderProcessingWorkflow:
    @workflow.run
    async def run(self, order_id: str) -> OrderResult:
        # Validate order (Activity)
        validation = await workflow.execute_activity(
            validate_order,
            order_id,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Process payment (Activity with retry)
        payment = await workflow.execute_activity(
            process_payment,
            validation.payment_info,
            start_to_close_timeout=timedelta(minutes=10),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Wait for fulfillment (Signal/Timer)
        await workflow.wait_condition(
            lambda: self.fulfillment_complete,
            timeout=timedelta(days=7)
        )

        return OrderResult(order_id=order_id, status="completed")

Performance Characteristics

Throughput Specifications

Performance Targets:
  workflow_starts_per_second: 1000+
  activity_executions_per_second: 10000+
  concurrent_workflows: 100000+
  history_events_per_workflow: unlimited

Database Performance:
  read_iops: 10000+
  write_iops: 5000+
  connection_pool_size: 100
  query_timeout: 5s

Network Performance:
  internal_latency: <1ms
  external_latency: <10ms
  throughput: 10Gbps

Scaling Characteristics

Horizontal Scaling Limits

Component Scaling:
  temporal_frontend: 1-20_instances
  temporal_history: 1-50_instances
  temporal_matching: 1-10_instances
  temporal_worker: 1-5_instances
  business_workers: 1-100_instances

Database Scaling:
  postgresql_connections: 100-1000
  elasticsearch_nodes: 3-20
  redis_instances: 1-10

Disaster Recovery Architecture

Backup Strategy

Backup Components:
  database:
    frequency: continuous_wal_streaming
    retention: 30_days
    rto: 15_minutes
    rpo: 1_minute

  elasticsearch:
    frequency: hourly_snapshots
    retention: 7_days
    rto: 30_minutes
    rpo: 1_hour

  kubernetes_state:
    frequency: daily_etcd_backup
    retention: 14_days
    rto: 1_hour
    rpo: 24_hours

Multi-Region Setup

Region Strategy:
  primary_region: us-east-1
  secondary_region: us-west-2

  replication:
    database: async_streaming
    object_storage: cross_region_sync
    kubernetes: independent_clusters

  failover:
    automatic: database_only
    manual: full_stack
    rto: 1_hour
    rpo: 5_minutes

Security Architecture Integration

Zero Trust Implementation

  • All communication encrypted (TLS 1.3)
  • Identity verification for every request
  • Principle of least privilege access
  • Network micro-segmentation
  • Continuous security monitoring

Compliance Requirements

  • SOC 2 Type II compliance
  • GDPR data protection
  • PCI DSS for payment processing
  • HIPAA for healthcare workflows
  • Custom audit logging

Monitoring and Observability

Metrics Collection

Metric Categories:
  business_metrics:
    - workflow_completion_rate
    - activity_success_rate
    - processing_duration

  system_metrics:
    - resource_utilization
    - error_rates
    - response_times

  infrastructure_metrics:
    - node_health
    - network_performance
    - storage_usage

Alerting Strategy

Alert Levels:
  critical:
    - service_unavailable
    - data_loss_risk
    - security_breach

  warning:
    - performance_degradation
    - resource_constraints
    - configuration_drift

  info:
    - deployment_events
    - scaling_operations
    - maintenance_windows

Future Architecture Considerations

Roadmap Items

  1. Multi-tenancy: Namespace isolation per tenant
  2. Edge Computing: Regional Temporal clusters
  3. AI/ML Integration: Workflow optimization
  4. Serverless Workers: FaaS-based activity execution
  5. Advanced Analytics: Real-time business intelligence

Technology Evolution

  • Container Runtime: Docker → containerd → gVisor
  • Service Mesh: Istio → Linkerd → Cilium
  • Database: PostgreSQL → CockroachDB (for global scale)
  • Monitoring: Prometheus → OpenTelemetry native

This system architecture provides a robust foundation for enterprise Temporal.io deployments with built-in scalability, security, and operational excellence.