Architecture

INFER is a decentralized inference marketplace built on a hub-and-spoke architecture. The central platform handles routing, billing, and discovery while node operators run LLM runtimes on their own hardware.

System Overview


┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Buyers    │────▶│  INFER Platform   │────▶│  Node Operators │
│  (API Keys) │◀────│  (Cloud Run)      │◀────│  (Ollama / Exo) │
└─────────────┘     └──────────────────┘     └─────────────────┘
                           │
                    ┌──────┴──────┐
                    │  Cloud SQL  │
                    │ (PostgreSQL)│
                    └─────────────┘

Components

INFER Platform (Cloud Run)

The central platform is a Next.js application deployed on Google Cloud Run. It handles:

User authentication — Email/password and Google OAuth via NextAuth
API key management — Create, rotate, and revoke API keys for programmatic access
Inference routing — Receives requests from buyers and forwards them to the best available node
Billing & metering — Tracks token usage, deducts balances, and calculates operator earnings
Node registry — Maintains a directory of all registered nodes with their capabilities
Marketplace — Browsable provider directory with ratings and reviews

Node Operators

Node operators run LLM inference runtimes on their own hardware. Supported runtimes:

Ollama — Runs on port 11434, supports a wide range of open models
Exo — Runs on port 52415, optimized for distributed inference across devices

Operators expose their runtimes through a public endpoint (via TLS reverse proxy) and register with the INFER platform.

Desktop App (Electron)

The macOS desktop app provides a native interface for node operators:

Auto-detects local LLM runtimes (Ollama, Exo)
One-click node registration with GPU auto-detection
System tray status monitoring
Native notifications for earnings and status changes
Auto-updates via GCS feed

Database (Cloud SQL)

PostgreSQL database storing:

User accounts and sessions
API keys (encrypted at rest)
Node registrations and health status
Usage records and billing transactions
Audit logs

Data Flow

Inference Request Flow

Buyer sends a chat completion request to POST /api/v1/chat/completions with their API key
Platform validates the API key, checks the user’s balance
Router selects the best available node based on model availability, latency, and load
Platform forwards the request to the selected node’s endpoint
Node processes the request using its local LLM runtime and streams tokens back
Platform streams the response to the buyer while counting tokens
Billing deducts the cost from the buyer’s balance and credits the operator

Node Registration Flow

Operator installs a supported LLM runtime and downloads models
Desktop app auto-detects the runtime and available models
Operator clicks “Register” — the app pre-fills GPU info, models, and endpoint
Platform creates the node record and begins health monitoring
Health checks run every 60 seconds to verify the node is reachable

Security

All API keys are encrypted with AES-256-GCM before storage
Session tokens use NextAuth with JWT strategy
Database connections use Cloud SQL Auth Proxy (IAM-based)
Rate limiting protects all endpoints (Upstash Redis in production)
CORS restricts API access to authorized origins
CSP headers prevent XSS and injection attacks