Inference Routing
When a buyer sends an inference request, the INFER platform selects the best available node to handle it. This page explains the routing algorithm and selection criteria.
How Routing Works
The routing system uses a weighted selection algorithm that balances performance, cost, and reliability.
Step 1: Filter by Model
Only nodes that have the requested model loaded are considered. If a buyer requests llama3.1:70b, only nodes with that exact model registered are eligible.
Step 2: Filter by Health
Nodes must pass health checks to be eligible:
- Online status — Node responded to the last health check (within 60 seconds)
- Success rate — Nodes with less than 90% success rate are excluded
- Endpoint reachable — TLS endpoint must be accessible from the platform
Step 3: Score and Rank
Eligible nodes are scored on three factors:
| Factor | Weight | Description |
|---|---|---|
| Latency | 40% | Average response time over the last hour |
| Load | 35% | Current active request count relative to capacity |
| Price | 25% | Operator’s price per 1K tokens |
Step 4: Weighted Random Selection
The top 3 nodes by score are selected using weighted random choice. This prevents hot-spotting where a single node receives all traffic.
scores = [0.92, 0.87, 0.83]
weights = normalize(scores) # [0.35, 0.33, 0.32]
selected = weighted_random(top_3, weights)Failover
If the selected node fails to respond within 10 seconds:
- The request is retried with the next-best node
- A maximum of 2 retries are attempted
- If all retries fail, the buyer receives a
503 Service Unavailableerror - The failing node’s health score is penalized
Preferred Providers
Buyers can optionally set a preferred provider through the marketplace. When a preferred provider is set:
- If the preferred node is online and has the requested model, it’s selected
- If the preferred node is unavailable, standard routing takes over
- Preferred routing does not override health checks — unhealthy nodes are still excluded
Model Availability
The platform maintains a real-time index of which models are available on which nodes. This index is updated:
- When a node registers or updates its model list
- During periodic health checks (models are re-scanned)
- When an operator loads or unloads a model
Latency Considerations
End-to-end latency includes:
| Component | Typical Latency |
|---|---|
| Platform routing overhead | 10–30ms |
| Network (platform → node) | 20–100ms |
| LLM inference (first token) | 200–2000ms |
| LLM inference (per token) | 10–50ms |
The routing algorithm optimizes for time to first token, which is the most noticeable latency component for streaming responses.