Core Relay System

1 Overview

The Core Relay System is the execution backbone of the LLMs x420 Gateway. It orchestrates the full lifecycle of AI service requests — from authentication, request parsing, intelligent routing, upstream communication, to response delivery — across diverse LLM providers such as OpenAI, Claude, Gemini, as well as multimodal services including image and audio generation.

Its mission: to provide a standardized, OpenAI-compatible gateway that dynamically integrates multiple upstream intelligence providers via programmable routing, quota enforcement, and resilient middleware.


2 Architecture Design Principles

The system follows a layered middleware architecture, where each processing layer is responsible for a specific operational concern:

  • Authentication & Context Loading

  • Channel Routing & Load Distribution

  • API Format Normalization

  • Token Accounting & Quota Enforcement

  • Fault Tolerance & Retry

  • Format-Specific Dispatching

  • Logging & Real-Time Response

This modularity ensures maximum composability, observability, and resilience — ideal for high-throughput agent-driven applications.

Core Design Tenets:

  • Format Agnosticism: Seamless translation across OpenAI, Claude, Gemini, and custom schema

  • Pessimistic Quota Control: Reserve-before-call model to guarantee cost containment

  • Smart Retry Logic: Built-in failover, adaptive retries, and channel auto-healing


3 Authentication Layer

The gateway supports multi-modal credentialing including:

  • Bearer Tokens (OpenAI-compatible)

  • WebSocket session headers

  • Claude API keys

  • Gemini credentials

  • Midjourney access secrets

Once authenticated, the system loads user metadata (e.g. quota class, allowed models, IP policy) and resolves access rights. This lays the groundwork for:

  • Dynamic access control

  • Group-level feature gating

  • Quota scope enforcement

  • Billing traceability


4 Channel Distribution & Load Routing

The distribution layer handles channel selection based on:

  • Token configuration (explicit channels)

  • Weighted round-robin (implicit channels)

  • Model availability

  • Group permissions

  • Real-time channel health scores

During retries, the system excludes previously failed channels and deterministically selects alternatives. This adaptive routing ensures:

  • Graceful degradation

  • Intelligent failover

  • Avoidance of retry loops

Source: distributor.go:28–123


5 Format Parsing & Validation

Supports native parsing for:

  • OpenAI (chat/completion/embedding)

  • Claude (messages, system prompts)

  • Gemini (generation, embedding)

  • Suno, Midjourney, Audio-to-Text (specialized schemas)

Each format is converted into a unified internal structure: RelayInfo. This holds all relevant metadata:

  • Model alias

  • Streaming flags

  • Token budget

  • Channel hints

  • Logging trace ID


6 Token & Cost Computation

The gateway performs pre-consumption estimation using model-specific tokenizers and modality-aware counting:

  • Text: Tokenized with per-model overhead handling

  • Images: Counted via dimension × quality heuristics

  • Audio: Duration-based pricing

  • Tools: Accounted as extra token blocks

Final cost is determined by:

  • Model price profile

  • User multiplier (e.g., developer tier)

  • Ratio of prompt vs completion tokens

This ensures quota safety and billing precision before hitting upstream APIs.


7 Quota Enforcement

Quota enforcement uses a pessimistic strategy:

  • Pre-consumption: Deduct quota based on estimation

  • Deferred Reconciliation: On failure, quota is refunded

  • Post-call Adjustment: Real usage reconciled post-response

Benefits:

  • Prevents runaway calls

  • Ensures billing consistency

  • Supports granular user-level metering


8 Intelligent Retry Engine

Retries are handled based on error classification:

Error Type
Retry Behavior

5xx / 429

Retry w/ new route

Timeout

Retry once

4xx

Abort fast

Channel-specific errors trigger:

  • Logging to channel health registry

  • Optional automatic disablement

  • Alternate channel selection on retry

This provides self-healing, high-availability behavior under partial outages.


9 Format-Specific Handlers

Requests are routed to dedicated adaptors:

Format
Handler Function

OpenAI

WebSocket (realtime) + REST

Claude

Prompt rewriting, “thinking” mode

Gemini

Embedding / generation dispatch

Image

Midjourney relay, callback logic

Audio

Whisper / TTS handler

Adaptors abstract upstream calls and normalize responses into OpenAI format before streaming back to the client.


10 Error Handling & Channel Stability

  • Error logs include request ID, trace path, payload shape, channel metrics

  • Automatic disabling thresholds prevent cascading channel failures

  • Re-enablement requires health check or admin override

  • Stats feed into performance dashboards and pricing policies


11 Streaming & Logging

  • Streaming responses are pipelined directly to the client with <150ms delay

  • Token, cost, duration, and channel metadata are recorded

  • Logs are shipped to internal analytics and billing engine for reconciliation


12 Specialized Relay Modes

The system supports asynchronous workflows for advanced use cases:

  • Midjourney: Image task queuing, status polling, retry

  • Suno / Video: Task-style async relay with callback registration

  • Tool Calls: Multimodal request proxying and function chaining


13 Summary

The Core Relay System is more than a gateway — it's an intelligent execution graph that abstracts away multi-provider complexity, enforces quota integrity, and optimizes request outcomes.

Built for production-scale agent networks, its modular middleware enables LLMs x420 to act as a programmable compute marketplace — one where access, pricing, and reliability are enforced at the protocol level.

Last updated