Skip to Content
ConceptsInference Routing

Inference Routing

When a buyer sends an inference request, the INFER platform selects the best available node to handle it. This page explains the routing algorithm and selection criteria.

How Routing Works

The routing system uses a weighted selection algorithm that balances performance, cost, and reliability.

Step 1: Filter by Model

Only nodes that have the requested model loaded are considered. If a buyer requests llama3.1:70b, only nodes with that exact model registered are eligible.

Step 2: Filter by Health

Nodes must pass health checks to be eligible:

  • Online status — Node responded to the last health check (within 60 seconds)
  • Success rate — Nodes with less than 90% success rate are excluded
  • Endpoint reachable — TLS endpoint must be accessible from the platform

Step 3: Score and Rank

Eligible nodes are scored on three factors:

FactorWeightDescription
Latency40%Average response time over the last hour
Load35%Current active request count relative to capacity
Price25%Operator’s price per 1K tokens

Step 4: Weighted Random Selection

The top 3 nodes by score are selected using weighted random choice. This prevents hot-spotting where a single node receives all traffic.

scores = [0.92, 0.87, 0.83] weights = normalize(scores) # [0.35, 0.33, 0.32] selected = weighted_random(top_3, weights)

Failover

If the selected node fails to respond within 10 seconds:

  1. The request is retried with the next-best node
  2. A maximum of 2 retries are attempted
  3. If all retries fail, the buyer receives a 503 Service Unavailable error
  4. The failing node’s health score is penalized

Preferred Providers

Buyers can optionally set a preferred provider through the marketplace. When a preferred provider is set:

  1. If the preferred node is online and has the requested model, it’s selected
  2. If the preferred node is unavailable, standard routing takes over
  3. Preferred routing does not override health checks — unhealthy nodes are still excluded

Model Availability

The platform maintains a real-time index of which models are available on which nodes. This index is updated:

  • When a node registers or updates its model list
  • During periodic health checks (models are re-scanned)
  • When an operator loads or unloads a model

Latency Considerations

End-to-end latency includes:

ComponentTypical Latency
Platform routing overhead10–30ms
Network (platform → node)20–100ms
LLM inference (first token)200–2000ms
LLM inference (per token)10–50ms

The routing algorithm optimizes for time to first token, which is the most noticeable latency component for streaming responses.