LiteLLM Proxy Service¶
LiteLLM Proxy serves as a unified API gateway that provides a standardized OpenAI-compatible interface for 100+ large language model providers. It enables seamless switching between different LLM providers while maintaining consistent API contracts and adding enterprise features like load balancing, rate limiting, and cost tracking.
Architecture Overview¶
graph TB
A[Client Applications] -->|OpenAI Format| B[LiteLLM Proxy]
B -->|Route & Transform| C[Request Router]
C -->|Load Balance| D[Provider Pool]
D --> E[OpenAI GPT-4]
D --> F[Anthropic Claude]
D --> G[Google Gemini]
D --> H[Cohere Command]
D --> I[Local Ollama]
B -->|Log & Monitor| J[Analytics DB]
B -->|Cache Responses| K[Redis Cache]
B -->|Rate Limit| L[Rate Limiter]
%% Styling
classDef clientClass fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
classDef proxyClass fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,color:#000
classDef routerClass fill:#fff8e1,stroke:#f9a825,stroke-width:2px,color:#000
classDef poolClass fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#000
classDef openaiClass fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000
classDef anthropicClass fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef googleClass fill:#e0f2f1,stroke:#00796b,stroke-width:2px,color:#000
classDef cohereClass fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000
classDef localClass fill:#ede7f6,stroke:#512da8,stroke-width:2px,color:#000
classDef analyticsClass fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#000
classDef cacheClass fill:#ffebee,stroke:#e91e63,stroke-width:2px,color:#000
classDef limitClass fill:#e0f7fa,stroke:#00838f,stroke-width:2px,color:#000
class A clientClass
class B proxyClass
class C routerClass
class D poolClass
class E openaiClass
class F anthropicClass
class G googleClass
class H cohereClass
class I localClass
class J analyticsClass
class K cacheClass
class L limitClass
Key Features¶
Unified API Interface¶
- OpenAI-compatible endpoints for all supported providers
- Consistent request/response format across different models
- Automatic parameter translation between provider schemas
- Streaming support for real-time responses
Enterprise-Grade Functionality¶
- Load balancing across multiple model instances
- Intelligent failover when providers are unavailable
- Rate limiting per user, API key, or model
- Cost tracking and budget enforcement
- Request/response caching for improved performance
Multi-Provider Support¶
- 100+ LLM providers supported out of the box
- Custom endpoint configuration for proprietary models
- Model routing strategies (cost-based, performance-based)
- Provider-specific optimizations and retry logic
Observability & Analytics¶
- Real-time usage dashboards with Langfuse integration
- Detailed cost breakdowns per model and user
- Performance metrics (latency, throughput, error rates)
- Audit logs for compliance and debugging
Configuration Schema¶
Environment Variables¶
# Core Settings
LITELLM_MASTER_KEY=sk-your-master-key-here
DATABASE_URL=postgresql://postgres:postgres@postgres:5432/litellm
# Redis Configuration
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=
# UI Authentication
UI_USERNAME=${LITELLM_UI_USERNAME}
UI_PASSWORD=${LITELLM_UI_PASSWORD}
# Provider API Keys
OPENAI_API_KEY=sk-proj-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
GEMINI_API_KEY=your-gemini-key
COHERE_API_KEY=your-cohere-key
# Langfuse Integration
LANGFUSE_PUBLIC_KEY=pk-your-langfuse-public-key
LANGFUSE_SECRET_KEY=sk-your-langfuse-secret-key
LANGFUSE_HOST=http://langfuse:3000
# Logging
LITELLM_LOG=INFO
LITELLM_DEBUG=false
# Telemetry Settings
LITELLM_TELEMETRY=false
Model Configuration (litellm_config.yaml)¶
model_list:
# OpenAI Models
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY
max_tokens: 4096
temperature: 0.7
- model_name: gpt-3.5-turbo
litellm_params:
model: openai/gpt-3.5-turbo
api_key: os.environ/OPENAI_API_KEY
max_tokens: 2048
# Anthropic Models
- model_name: claude-3-sonnet
litellm_params:
model: anthropic/claude-3-sonnet-20240229
api_key: os.environ/ANTHROPIC_API_KEY
max_tokens: 2048
# Google Models
- model_name: gemini-pro
litellm_params:
model: gemini/gemini-pro
api_key: os.environ/GEMINI_API_KEY
# Local Ollama Models
- model_name: llama2
litellm_params:
model: ollama/llama2
api_base: http://host.docker.internal:11434
# Router Configuration
router_settings:
routing_strategy: usage-based-routing
model_group_alias:
gpt-4-group: ["gpt-4", "gpt-4-turbo"]
claude-group: ["claude-3-opus", "claude-3-sonnet", "claude-3-haiku"]
fallbacks:
- gpt-4: ["gpt-3.5-turbo"]
- claude-3-opus: ["claude-3-sonnet", "gpt-4"]
# General Settings
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
# Cost & Budget Controls
track_cost_per_model: true
max_budget: 100.0
budget_duration: 30d
# Caching
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: os.environ/REDIS_PORT
ttl: 600
# Rate Limiting
tpm_limit: 10000 # tokens per minute
rpm_limit: 100 # requests per minute
# Callbacks for Observability
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
# Security
allowed_ips: ["*"]
blocked_ips: []
API Request Schema¶
{
"model": "gpt-4",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
],
"temperature": 0.7,
"max_tokens": 1000,
"stream": false
}
API Response Schema¶
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I'm doing well, thank you for asking."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 12,
"total_tokens": 32
}
}
Access¶
LiteLLM Proxy is accessible at:
Key Endpoints¶
- Chat Completions:
POST /v1/chat/completions
- Completions:
POST /v1/completions
- Models List:
GET /v1/models
- Health Check:
GET /health
- UI Dashboard:
GET /ui
(admin interface) - Metrics:
GET /metrics
(Prometheus format)
Supported Providers¶
Major Cloud Providers¶
- OpenAI: GPT-3.5, GPT-4, GPT-4 Vision, DALL-E
- Anthropic: Claude 3 (Haiku, Sonnet, Opus)
- Google: Gemini Pro, Gemini Pro Vision, PaLM
- Cohere: Command, Command-Light, Embed
- Azure OpenAI: All OpenAI models via Azure
- AWS Bedrock: Titan, Claude, Llama models
Open Source & Local¶
- Ollama: Local model serving
- Hugging Face: Transformers models
- vLLM: High-performance inference
- Together AI: Open source models
- Replicate: Community-hosted models
Specialized Providers¶
- Stability AI: Stable Diffusion models
- Mistral AI: Mistral 7B, Mixtral 8x7B
- AI21: Jurassic models
- Aleph Alpha: Luminous models
Advanced Features¶
Load Balancing Strategies¶
router_settings:
routing_strategy: "usage-based-routing" # or "simple-shuffle", "least-busy", "latency-based"
cooldown_time: 1 # seconds between retries
retry_policy:
max_retries: 3
base_delay: 1
max_delay: 10
Cost Controls¶
general_settings:
max_budget: 100.0
budget_duration: "30d"
budget_reset_at: "00:00"
cost_per_token:
gpt-4: 0.00006 # per token
gpt-3.5-turbo: 0.000002
Rate Limiting¶
general_settings:
tpm_limit: 10000 # tokens per minute
rpm_limit: 100 # requests per minute
max_parallel_requests: 10
user_rate_limit:
"user-1": {"tpm": 5000, "rpm": 50}
Monitoring & Observability¶
Health Check Response¶
{
"status": "healthy",
"healthy_endpoints": [
{
"model": "gpt-4",
"status": "healthy",
"latency_ms": 245
}
],
"unhealthy_endpoints": [],
"version": "1.0.0",
"uptime": "2h 15m 30s"
}
Usage Analytics¶
- Request volume per model and time period
- Cost analysis with breakdown by user/model
- Performance metrics (P95, P99 latency)
- Error rate tracking with categorization
- Token usage patterns and optimization insights
Troubleshooting¶
Common Issues¶
- "Invalid HTTP request received" warnings
- Usually caused by invalid API keys or authentication issues
-
See LiteLLM Troubleshooting guide
-
Model unavailable errors
- Check provider API key configuration
- Verify model name spelling and availability
-
Review rate limiting settings
-
High latency issues
- Enable response caching
- Implement load balancing
- Check provider geographic regions
Debug Commands¶
# Check health status
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health
# List available models
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/v1/models
# View metrics
curl http://localhost:4000/metrics
CLI Integration for Ollama Management¶
The AI Dev Local CLI provides seamless integration for managing Ollama models with LiteLLM:
# Browse available Ollama models
ai-dev-local ollama list-available
ai-dev-local ollama list-available --category code
ai-dev-local ollama list-available --search llama
# Pull and install Ollama models
ai-dev-local ollama pull llama2:7b
ai-dev-local ollama pull codellama:7b
# Sync installed Ollama models to LiteLLM configuration
ai-dev-local ollama sync-litellm
# Preview changes before applying
ai-dev-local ollama sync-litellm --dry-run
# Restart LiteLLM to apply new configuration
docker-compose restart litellm
Automatic Configuration Management:
- Detects all installed Ollama models automatically
- Updates litellm_config.yaml
with current Ollama models
- Removes outdated Ollama model entries
- Preserves all non-Ollama model configurations
- Updates router group aliases for proper load balancing
- Creates timestamped backups before making changes
Typical Workflow:
1. Start services with Ollama: ai-dev-local start --ollama
2. Browse available models: ai-dev-local ollama list-available --category code
3. Install desired models: ai-dev-local ollama pull codellama:7b
4. Sync to LiteLLM: ai-dev-local ollama sync-litellm
5. Restart LiteLLM: docker-compose restart litellm
6. Verify in health check: Models should appear as healthy endpoints
See the CLI Reference for complete documentation of all Ollama management commands.
Online Resources¶
- GitHub Repository: LiteLLM GitHub
- Official Website: LiteLLM.ai
- Documentation: LiteLLM Docs
- API Reference: API Docs
- Community: Discord Server
- Docker Hub: LiteLLM Images
Use Cases¶
- Multi-Provider Abstraction: Single API for multiple LLM providers
- Cost Optimization: Automatic routing to cost-effective models
- High Availability: Failover between providers for reliability
- Development & Testing: Easy model comparison and A/B testing
- Enterprise Deployment: Centralized LLM access with governance
- Microservices Architecture: Unified LLM gateway for distributed systems
Performance Optimizations¶
- Response Caching: Redis-based caching for repeated queries
- Connection Pooling: Efficient HTTP connection management
- Batch Processing: Support for batch API requests
- Streaming Responses: Real-time response streaming
- Geographic Routing: Route to nearest provider endpoints
LiteLLM Proxy is essential for organizations requiring a robust, scalable, and cost-effective solution for managing multiple LLM providers through a single, standardized interface.