Skip to content

LiteLLM Proxy Service

LiteLLM Proxy serves as a unified API gateway that provides a standardized OpenAI-compatible interface for 100+ large language model providers. It enables seamless switching between different LLM providers while maintaining consistent API contracts and adding enterprise features like load balancing, rate limiting, and cost tracking.

Architecture Overview

graph TB
    A[Client Applications] -->|OpenAI Format| B[LiteLLM Proxy]
    B -->|Route & Transform| C[Request Router]
    C -->|Load Balance| D[Provider Pool]
    D --> E[OpenAI GPT-4]
    D --> F[Anthropic Claude]
    D --> G[Google Gemini]
    D --> H[Cohere Command]
    D --> I[Local Ollama]
    B -->|Log & Monitor| J[Analytics DB]
    B -->|Cache Responses| K[Redis Cache]
    B -->|Rate Limit| L[Rate Limiter]

    %% Styling
    classDef clientClass fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
    classDef proxyClass fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,color:#000
    classDef routerClass fill:#fff8e1,stroke:#f9a825,stroke-width:2px,color:#000
    classDef poolClass fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#000
    classDef openaiClass fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000
    classDef anthropicClass fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef googleClass fill:#e0f2f1,stroke:#00796b,stroke-width:2px,color:#000
    classDef cohereClass fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000
    classDef localClass fill:#ede7f6,stroke:#512da8,stroke-width:2px,color:#000
    classDef analyticsClass fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#000
    classDef cacheClass fill:#ffebee,stroke:#e91e63,stroke-width:2px,color:#000
    classDef limitClass fill:#e0f7fa,stroke:#00838f,stroke-width:2px,color:#000

    class A clientClass
    class B proxyClass
    class C routerClass
    class D poolClass
    class E openaiClass
    class F anthropicClass
    class G googleClass
    class H cohereClass
    class I localClass
    class J analyticsClass
    class K cacheClass
    class L limitClass

Key Features

Unified API Interface

  • OpenAI-compatible endpoints for all supported providers
  • Consistent request/response format across different models
  • Automatic parameter translation between provider schemas
  • Streaming support for real-time responses

Enterprise-Grade Functionality

  • Load balancing across multiple model instances
  • Intelligent failover when providers are unavailable
  • Rate limiting per user, API key, or model
  • Cost tracking and budget enforcement
  • Request/response caching for improved performance

Multi-Provider Support

  • 100+ LLM providers supported out of the box
  • Custom endpoint configuration for proprietary models
  • Model routing strategies (cost-based, performance-based)
  • Provider-specific optimizations and retry logic

Observability & Analytics

  • Real-time usage dashboards with Langfuse integration
  • Detailed cost breakdowns per model and user
  • Performance metrics (latency, throughput, error rates)
  • Audit logs for compliance and debugging

Configuration Schema

Environment Variables

# Core Settings
LITELLM_MASTER_KEY=sk-your-master-key-here
DATABASE_URL=postgresql://postgres:postgres@postgres:5432/litellm

# Redis Configuration
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=

# UI Authentication
UI_USERNAME=${LITELLM_UI_USERNAME}
UI_PASSWORD=${LITELLM_UI_PASSWORD}

# Provider API Keys
OPENAI_API_KEY=sk-proj-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
GEMINI_API_KEY=your-gemini-key
COHERE_API_KEY=your-cohere-key

# Langfuse Integration
LANGFUSE_PUBLIC_KEY=pk-your-langfuse-public-key
LANGFUSE_SECRET_KEY=sk-your-langfuse-secret-key
LANGFUSE_HOST=http://langfuse:3000

# Logging
LITELLM_LOG=INFO
LITELLM_DEBUG=false

# Telemetry Settings
LITELLM_TELEMETRY=false

Model Configuration (litellm_config.yaml)

model_list:
  # OpenAI Models
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4
      api_key: os.environ/OPENAI_API_KEY
      max_tokens: 4096
      temperature: 0.7

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: openai/gpt-3.5-turbo
      api_key: os.environ/OPENAI_API_KEY
      max_tokens: 2048

  # Anthropic Models
  - model_name: claude-3-sonnet
    litellm_params:
      model: anthropic/claude-3-sonnet-20240229
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 2048

  # Google Models
  - model_name: gemini-pro
    litellm_params:
      model: gemini/gemini-pro
      api_key: os.environ/GEMINI_API_KEY

  # Local Ollama Models
  - model_name: llama2
    litellm_params:
      model: ollama/llama2
      api_base: http://host.docker.internal:11434

# Router Configuration
router_settings:
  routing_strategy: usage-based-routing
  model_group_alias:
    gpt-4-group: ["gpt-4", "gpt-4-turbo"]
    claude-group: ["claude-3-opus", "claude-3-sonnet", "claude-3-haiku"]
  fallbacks:
    - gpt-4: ["gpt-3.5-turbo"]
    - claude-3-opus: ["claude-3-sonnet", "gpt-4"]

# General Settings
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

  # Cost & Budget Controls
  track_cost_per_model: true
  max_budget: 100.0
  budget_duration: 30d

  # Caching
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    ttl: 600

  # Rate Limiting
  tpm_limit: 10000  # tokens per minute
  rpm_limit: 100    # requests per minute

  # Callbacks for Observability
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

  # Security
  allowed_ips: ["*"]
  blocked_ips: []

API Request Schema

{
  "model": "gpt-4",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 1000,
  "stream": false
}

API Response Schema

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 12,
    "total_tokens": 32
  }
}

Access

LiteLLM Proxy is accessible at:

http://localhost:4000/

Key Endpoints

  • Chat Completions: POST /v1/chat/completions
  • Completions: POST /v1/completions
  • Models List: GET /v1/models
  • Health Check: GET /health
  • UI Dashboard: GET /ui (admin interface)
  • Metrics: GET /metrics (Prometheus format)

Supported Providers

Major Cloud Providers

  • OpenAI: GPT-3.5, GPT-4, GPT-4 Vision, DALL-E
  • Anthropic: Claude 3 (Haiku, Sonnet, Opus)
  • Google: Gemini Pro, Gemini Pro Vision, PaLM
  • Cohere: Command, Command-Light, Embed
  • Azure OpenAI: All OpenAI models via Azure
  • AWS Bedrock: Titan, Claude, Llama models

Open Source & Local

  • Ollama: Local model serving
  • Hugging Face: Transformers models
  • vLLM: High-performance inference
  • Together AI: Open source models
  • Replicate: Community-hosted models

Specialized Providers

  • Stability AI: Stable Diffusion models
  • Mistral AI: Mistral 7B, Mixtral 8x7B
  • AI21: Jurassic models
  • Aleph Alpha: Luminous models

Advanced Features

Load Balancing Strategies

router_settings:
  routing_strategy: "usage-based-routing"  # or "simple-shuffle", "least-busy", "latency-based"
  cooldown_time: 1  # seconds between retries
  retry_policy:
    max_retries: 3
    base_delay: 1
    max_delay: 10

Cost Controls

general_settings:
  max_budget: 100.0
  budget_duration: "30d"
  budget_reset_at: "00:00"
  cost_per_token:
    gpt-4: 0.00006  # per token
    gpt-3.5-turbo: 0.000002

Rate Limiting

general_settings:
  tpm_limit: 10000  # tokens per minute
  rpm_limit: 100    # requests per minute
  max_parallel_requests: 10
  user_rate_limit:
    "user-1": {"tpm": 5000, "rpm": 50}

Monitoring & Observability

Health Check Response

{
  "status": "healthy",
  "healthy_endpoints": [
    {
      "model": "gpt-4",
      "status": "healthy",
      "latency_ms": 245
    }
  ],
  "unhealthy_endpoints": [],
  "version": "1.0.0",
  "uptime": "2h 15m 30s"
}

Usage Analytics

  • Request volume per model and time period
  • Cost analysis with breakdown by user/model
  • Performance metrics (P95, P99 latency)
  • Error rate tracking with categorization
  • Token usage patterns and optimization insights

Troubleshooting

Common Issues

  1. "Invalid HTTP request received" warnings
  2. Usually caused by invalid API keys or authentication issues
  3. See LiteLLM Troubleshooting guide

  4. Model unavailable errors

  5. Check provider API key configuration
  6. Verify model name spelling and availability
  7. Review rate limiting settings

  8. High latency issues

  9. Enable response caching
  10. Implement load balancing
  11. Check provider geographic regions

Debug Commands

# Check health status
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health

# List available models
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/v1/models

# View metrics
curl http://localhost:4000/metrics

CLI Integration for Ollama Management

The AI Dev Local CLI provides seamless integration for managing Ollama models with LiteLLM:

# Browse available Ollama models
ai-dev-local ollama list-available
ai-dev-local ollama list-available --category code
ai-dev-local ollama list-available --search llama

# Pull and install Ollama models
ai-dev-local ollama pull llama2:7b
ai-dev-local ollama pull codellama:7b

# Sync installed Ollama models to LiteLLM configuration
ai-dev-local ollama sync-litellm

# Preview changes before applying
ai-dev-local ollama sync-litellm --dry-run

# Restart LiteLLM to apply new configuration
docker-compose restart litellm

Automatic Configuration Management: - Detects all installed Ollama models automatically - Updates litellm_config.yaml with current Ollama models - Removes outdated Ollama model entries - Preserves all non-Ollama model configurations - Updates router group aliases for proper load balancing - Creates timestamped backups before making changes

Typical Workflow: 1. Start services with Ollama: ai-dev-local start --ollama 2. Browse available models: ai-dev-local ollama list-available --category code 3. Install desired models: ai-dev-local ollama pull codellama:7b 4. Sync to LiteLLM: ai-dev-local ollama sync-litellm 5. Restart LiteLLM: docker-compose restart litellm 6. Verify in health check: Models should appear as healthy endpoints

See the CLI Reference for complete documentation of all Ollama management commands.

Online Resources

Use Cases

  • Multi-Provider Abstraction: Single API for multiple LLM providers
  • Cost Optimization: Automatic routing to cost-effective models
  • High Availability: Failover between providers for reliability
  • Development & Testing: Easy model comparison and A/B testing
  • Enterprise Deployment: Centralized LLM access with governance
  • Microservices Architecture: Unified LLM gateway for distributed systems

Performance Optimizations

  • Response Caching: Redis-based caching for repeated queries
  • Connection Pooling: Efficient HTTP connection management
  • Batch Processing: Support for batch API requests
  • Streaming Responses: Real-time response streaming
  • Geographic Routing: Route to nearest provider endpoints

LiteLLM Proxy is essential for organizations requiring a robust, scalable, and cost-effective solution for managing multiple LLM providers through a single, standardized interface.