Core Relay System
1 Overview
The Core Relay System is the execution backbone of the LLMs x420 Gateway. It orchestrates the full lifecycle of AI service requests — from authentication, request parsing, intelligent routing, upstream communication, to response delivery — across diverse LLM providers such as OpenAI, Claude, Gemini, as well as multimodal services including image and audio generation.
Its mission: to provide a standardized, OpenAI-compatible gateway that dynamically integrates multiple upstream intelligence providers via programmable routing, quota enforcement, and resilient middleware.

2 Architecture Design Principles
The system follows a layered middleware architecture, where each processing layer is responsible for a specific operational concern:
Authentication & Context Loading
Channel Routing & Load Distribution
API Format Normalization
Token Accounting & Quota Enforcement
Fault Tolerance & Retry
Format-Specific Dispatching
Logging & Real-Time Response
This modularity ensures maximum composability, observability, and resilience — ideal for high-throughput agent-driven applications.
Core Design Tenets:
Format Agnosticism: Seamless translation across OpenAI, Claude, Gemini, and custom schema
Pessimistic Quota Control: Reserve-before-call model to guarantee cost containment
Smart Retry Logic: Built-in failover, adaptive retries, and channel auto-healing
3 Authentication Layer
The gateway supports multi-modal credentialing including:
Bearer Tokens (OpenAI-compatible)
WebSocket session headers
Claude API keys
Gemini credentials
Midjourney access secrets
Once authenticated, the system loads user metadata (e.g. quota class, allowed models, IP policy) and resolves access rights. This lays the groundwork for:
Dynamic access control
Group-level feature gating
Quota scope enforcement
Billing traceability
4 Channel Distribution & Load Routing
The distribution layer handles channel selection based on:
Token configuration (explicit channels)
Weighted round-robin (implicit channels)
Model availability
Group permissions
Real-time channel health scores
During retries, the system excludes previously failed channels and deterministically selects alternatives. This adaptive routing ensures:
Graceful degradation
Intelligent failover
Avoidance of retry loops
Source:
distributor.go:28–123
5 Format Parsing & Validation
Supports native parsing for:
OpenAI (chat/completion/embedding)
Claude (messages, system prompts)
Gemini (generation, embedding)
Suno, Midjourney, Audio-to-Text (specialized schemas)
Each format is converted into a unified internal structure: RelayInfo. This holds all relevant metadata:
Model alias
Streaming flags
Token budget
Channel hints
Logging trace ID
6 Token & Cost Computation
The gateway performs pre-consumption estimation using model-specific tokenizers and modality-aware counting:
Text: Tokenized with per-model overhead handling
Images: Counted via dimension × quality heuristics
Audio: Duration-based pricing
Tools: Accounted as extra token blocks
Final cost is determined by:
Model price profile
User multiplier (e.g., developer tier)
Ratio of prompt vs completion tokens
This ensures quota safety and billing precision before hitting upstream APIs.
7 Quota Enforcement
Quota enforcement uses a pessimistic strategy:
Pre-consumption: Deduct quota based on estimation
Deferred Reconciliation: On failure, quota is refunded
Post-call Adjustment: Real usage reconciled post-response
Benefits:
Prevents runaway calls
Ensures billing consistency
Supports granular user-level metering
8 Intelligent Retry Engine
Retries are handled based on error classification:
5xx / 429
Retry w/ new route
Timeout
Retry once
4xx
Abort fast
Channel-specific errors trigger:
Logging to channel health registry
Optional automatic disablement
Alternate channel selection on retry
This provides self-healing, high-availability behavior under partial outages.
9 Format-Specific Handlers
Requests are routed to dedicated adaptors:
OpenAI
WebSocket (realtime) + REST
Claude
Prompt rewriting, “thinking” mode
Gemini
Embedding / generation dispatch
Image
Midjourney relay, callback logic
Audio
Whisper / TTS handler
Adaptors abstract upstream calls and normalize responses into OpenAI format before streaming back to the client.
10 Error Handling & Channel Stability
Error logs include request ID, trace path, payload shape, channel metrics
Automatic disabling thresholds prevent cascading channel failures
Re-enablement requires health check or admin override
Stats feed into performance dashboards and pricing policies
11 Streaming & Logging
Streaming responses are pipelined directly to the client with <150ms delay
Token, cost, duration, and channel metadata are recorded
Logs are shipped to internal analytics and billing engine for reconciliation
12 Specialized Relay Modes
The system supports asynchronous workflows for advanced use cases:
Midjourney: Image task queuing, status polling, retry
Suno / Video: Task-style async relay with callback registration
Tool Calls: Multimodal request proxying and function chaining
13 Summary
The Core Relay System is more than a gateway — it's an intelligent execution graph that abstracts away multi-provider complexity, enforces quota integrity, and optimizes request outcomes.
Built for production-scale agent networks, its modular middleware enables LLMs x420 to act as a programmable compute marketplace — one where access, pricing, and reliability are enforced at the protocol level.
Last updated

