Architecture
INFER is a decentralized inference marketplace built on a hub-and-spoke architecture. The central platform handles routing, billing, and discovery while node operators run LLM runtimes on their own hardware.
System Overview
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Buyers │────▶│ INFER Platform │────▶│ Node Operators │
│ (API Keys) │◀────│ (Cloud Run) │◀────│ (Ollama / Exo) │
└─────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────┴──────┐
│ Cloud SQL │
│ (PostgreSQL)│
└─────────────┘Components
INFER Platform (Cloud Run)
The central platform is a Next.js application deployed on Google Cloud Run. It handles:
- User authentication — Email/password and Google OAuth via NextAuth
- API key management — Create, rotate, and revoke API keys for programmatic access
- Inference routing — Receives requests from buyers and forwards them to the best available node
- Billing & metering — Tracks token usage, deducts balances, and calculates operator earnings
- Node registry — Maintains a directory of all registered nodes with their capabilities
- Marketplace — Browsable provider directory with ratings and reviews
Node Operators
Node operators run LLM inference runtimes on their own hardware. Supported runtimes:
- Ollama — Runs on port 11434, supports a wide range of open models
- Exo — Runs on port 52415, optimized for distributed inference across devices
Operators expose their runtimes through a public endpoint (via TLS reverse proxy) and register with the INFER platform.
Desktop App (Electron)
The macOS desktop app provides a native interface for node operators:
- Auto-detects local LLM runtimes (Ollama, Exo)
- One-click node registration with GPU auto-detection
- System tray status monitoring
- Native notifications for earnings and status changes
- Auto-updates via GCS feed
Database (Cloud SQL)
PostgreSQL database storing:
- User accounts and sessions
- API keys (encrypted at rest)
- Node registrations and health status
- Usage records and billing transactions
- Audit logs
Data Flow
Inference Request Flow
- Buyer sends a chat completion request to
POST /api/v1/chat/completionswith their API key - Platform validates the API key, checks the user’s balance
- Router selects the best available node based on model availability, latency, and load
- Platform forwards the request to the selected node’s endpoint
- Node processes the request using its local LLM runtime and streams tokens back
- Platform streams the response to the buyer while counting tokens
- Billing deducts the cost from the buyer’s balance and credits the operator
Node Registration Flow
- Operator installs a supported LLM runtime and downloads models
- Desktop app auto-detects the runtime and available models
- Operator clicks “Register” — the app pre-fills GPU info, models, and endpoint
- Platform creates the node record and begins health monitoring
- Health checks run every 60 seconds to verify the node is reachable
Security
- All API keys are encrypted with AES-256-GCM before storage
- Session tokens use NextAuth with JWT strategy
- Database connections use Cloud SQL Auth Proxy (IAM-based)
- Rate limiting protects all endpoints (Upstash Redis in production)
- CORS restricts API access to authorized origins
- CSP headers prevent XSS and injection attacks