My AI Stack: A Unified Gateway Architecture

This post covers the architecture of my AI infrastructure, built around OpenWebUI (OWUI) as a unified API gateway. The core insight is simple: applications should point to a model name/alias, not a specific backend. This lets me swap out the underlying LLM engine without touching any application configuration.

Network Topology

The infrastructure spans multiple servers, each serving a specific role:

Host IP Role
DGX Spark Cluster (2 node) 10.0.0.11/12 vLLM (rarely llama.cpp or Ollama)
Unraid 10.0.0.10 OWUI, Kokoro, Ollama on-demand (dual 5060ti 16GB)
Dev Server 10.0.0.8 LLM Dev - single 5060ti 16GB & 256gb RAM

Traffic Flow

Traefik → llm.mywebsite.com → OWUI
                     ↓
    ┌─────────────────────────────────────────┐
    │         Workspace Models                │
    │   "coder" → MiniMax-M2.5                │
    │   "agent" → MiniMax-M2.5                │
    │   "micro" → Qwen3.5-27b                 │
    └─────────────────────────────────────────┘
                      ↓
    ┌───────────────────────────────────────────────┐
    │           LLM Engine Backends                 │
    │  vLLM (10.0.0.11, 10.0.0.12)                  │
    │  Ollama (10.0.0.10, 10.0.0.8)                 │
    │  llama.cpp (10.0.0.11, 10.0.0.12)             │
    └───────────────────────────────────────────────┘

Applications (OpenCode, Cline, Roo, Kilo, OpenCLAW, etc.) all point to the same endpoint: llm.mywebsite.com. They specify either the “coder” model or the “agent” model. OWUI handles routing to whichever backend is currently serving that model.

Traefik Entry Point

Traefik handles TLS termination and routes traffic to OWUI. We use a service thats routes to an externalName.

apiVersion: v1
kind: Service
metadata:
  name: ai
  namespace: ai
spec:
  type: ExternalName
  externalName: 10.0.0.10
  ports:
    - name: ai
      port: 8080
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: ai-ingress
  namespace: ai
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`llm.mywebsite.com`)
      kind: Rule
      services:
        - name: ai
          port: 8080
  tls:
    certResolver: letsencrypt

This exposes OWUI at https://llm.mywebsite.com with automatic TLS from Let’s Encrypt.

Why OWUI?

I chose OWUI because it acts as an API gateway with workspace model routing and api key/token management for api auth.

OWUI: The Unified API Gateway

OWUI serves as a single entry point for all AI APIs. It aggregates:

  • OpenAI-compatible API endpoints
  • Ollama API endpoints
  • Anthropic-compatible endpoints

So regardless of what backend is being used OWUI acts at the front end. OpenAI compatible API at https://llm.mywebsite.com/api/v1 Anthropic compatible API at https://llm.mywebsite.com/api

Workspace Models

The magic is in workspace model configuration. Instead of applications pointing to a specific model like qwen3-coder-next, they point to:

  • coder: For IDE integrations (OpenCode, Claude Code, Cline, Roo, Kilo)
  • agent: For AI agents (OpenCLAW)

Both currently route to MiniMax-M2.5. If I want to switch to a different model, I simply update the OWUI configuration, without needing to change the configuration for each application (OpenCode, Claude Code, Cline, Roo, Kilo).

Config Examples

Claude Code

cat ~/.claude/settings.json
{
  "env": {
    "ANTHROPIC_BASE_URL": "https://llm.mywebsite.com/api",
    "ANTHROPIC_AUTH_TOKEN": "sk-MY-SECRET-TOKEN",
    "ANTHROPIC_MODEL": "coder",
    "ANTHROPIC_SMALL_FAST_MODEL": "coder"
  }
}

OpenCode

{
  "$schema": "https://opencode.ai/config.json",
  "model": "rush-ai/coder",
  "provider": {
    "rush-ai": {
      "options": {
        "apiKey": "sk-MY-SECRET-TOKEN",
        "baseURL": "https://llm.mywebsite.com/api/v1"
      },
      "models": {
        "coder": {
          "limit": {
            "context": 196608,
            "output": 65536
        }
      }
    }
  }
}
}

OWUI Configuration

# API Endpoints
OPENAI_API_BASE_URLS: http://10.0.0.11:11434/v1;http://10.0.0.8:11434/v1;http://10.0.0.11:8000/v1
OLLAMA_BASE_URLS: http://10.0.0.10:11435; http://10.0.0.8:11435; http://10.0.0.10:11435; http://10.0.0.11:11435;
ENABLE_OPENAI_API: true
ENABLE_OLLAMA_API: true
ENABLE_CUSTOM_MODEL_FALLBACK: true
BYPASS_MODEL_ACCESS_CONTROL: true
USER_PERMISSIONS_WORKSPACE_MODELS_ACCESS: true
DEVICE_TYPE: cuda
USE_CUDA_DOCKER: true
ENABLE_SIGNUP: false
ENABLE_API_KEYS: true
WEBUI_URL: https://llm.mywebsite.com

The key insight is the multiple backend URLs in OPENAI_API_BASE_URLS and OLLAMA_BASE_URLS that route your request to the right Model/LLM engine.

vLLM

The primary inference engines are running on dual DGX Sparks:

  • DGX Spark Cluster (2 nodes) (10.0.0.11/12): vLLM

These provide the bulk of inference capacity. vLLM offers excellent throughput for production workloads, and has been way more stable then Ollama and llama.cpp

Unraid Server (10.0.0.10:11435)

The Unraid server has dual 5060ti 16GB GPUs and runs Ollama alongside OWUI:

OLLAMA_HOST: 0.0.0.0:11434
OLLAMA_KEEP_ALIVE: 1800       # 30 minute TTL
OLLAMA_MAX_LOADED_MODELS: 1   # Single model in memory
OLLAMA_CONTEXT_LENGTH: 131072 # 128K context
OLLAMA_FLASH_ATTENTION: 1
OLLAMA_NUM_PARALLEL: 1        # No concurrency, only 1 KV Cache
OLLAMA_SCHED_SPREAD: true

Key settings:

  • 1800 second TTL: Models auto-unload from GPU memory after 30 minutes of inactivity
  • Single model: Only one model loaded at a time to maximize available VRAM
  • Flash attention: Enabled for longer context efficiency

Dev Server (10.0.0.8:11435)

The dev server with a single 5060ti 16GB and 256gb system memory. A place where I like to experiment with other models, inference engines, or configurations.

TTS (Optional)

Kokoro TTS is available for voice conversations through OWUI. When voice mode is enabled in the UI, it routes to a Kokoro container running on the Unraid server.

Flexibility First

The architecture is designed for flexibility.

  1. Multiple engine types: vLLM, Ollama, and llama.cpp all are all available from the same api endpoint via OWUI.

  2. On-demand loading: Ollama models load only when needed and unload after 30 minutes, reducing power when not in use.

  3. Application independence: Applications never know or care which backend or model is being used. They simply request the “coder” or “agent” model and OWUI handles the rest.

Summary

This architecture provides:

  • Unified API: Single endpoint for all AI tools
  • Model flexibility: Swap models without touching application configs
  • Resilience: Multiple backends for each engine type
  • Resource efficiency: On-demand Ollama loading with TTL, dedicated vLLM for batch workloads
  • GPU utilization: Dual DGX Sparks for large inference and Dual 5060ti on Unraid for embeddings and on-demand models.

The key insight is decoupling: applications are decoupled from specific models via workspace models, and the gateway is decoupled from specific backends via multiple endpoint configuration. This gives me the flexibility to experiment with new models or scale infrastructure without disrupting the tools I use every day.