Skip to Content
ConceptsArchitecture

Architecture

INFER is a decentralized inference marketplace built on a hub-and-spoke architecture. The central platform handles routing, billing, and discovery while node operators run LLM runtimes on their own hardware.

System Overview

┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Buyers │────▶│ INFER Platform │────▶│ Node Operators │ │ (API Keys) │◀────│ (Cloud Run) │◀────│ (Ollama / Exo) │ └─────────────┘ └──────────────────┘ └─────────────────┘ ┌──────┴──────┐ │ Cloud SQL │ │ (PostgreSQL)│ └─────────────┘

Components

INFER Platform (Cloud Run)

The central platform is a Next.js application deployed on Google Cloud Run. It handles:

  • User authentication — Email/password and Google OAuth via NextAuth
  • API key management — Create, rotate, and revoke API keys for programmatic access
  • Inference routing — Receives requests from buyers and forwards them to the best available node
  • Billing & metering — Tracks token usage, deducts balances, and calculates operator earnings
  • Node registry — Maintains a directory of all registered nodes with their capabilities
  • Marketplace — Browsable provider directory with ratings and reviews

Node Operators

Node operators run LLM inference runtimes on their own hardware. Supported runtimes:

  • Ollama — Runs on port 11434, supports a wide range of open models
  • Exo — Runs on port 52415, optimized for distributed inference across devices

Operators expose their runtimes through a public endpoint (via TLS reverse proxy) and register with the INFER platform.

Desktop App (Electron)

The macOS desktop app provides a native interface for node operators:

  • Auto-detects local LLM runtimes (Ollama, Exo)
  • One-click node registration with GPU auto-detection
  • System tray status monitoring
  • Native notifications for earnings and status changes
  • Auto-updates via GCS feed

Database (Cloud SQL)

PostgreSQL database storing:

  • User accounts and sessions
  • API keys (encrypted at rest)
  • Node registrations and health status
  • Usage records and billing transactions
  • Audit logs

Data Flow

Inference Request Flow

  1. Buyer sends a chat completion request to POST /api/v1/chat/completions with their API key
  2. Platform validates the API key, checks the user’s balance
  3. Router selects the best available node based on model availability, latency, and load
  4. Platform forwards the request to the selected node’s endpoint
  5. Node processes the request using its local LLM runtime and streams tokens back
  6. Platform streams the response to the buyer while counting tokens
  7. Billing deducts the cost from the buyer’s balance and credits the operator

Node Registration Flow

  1. Operator installs a supported LLM runtime and downloads models
  2. Desktop app auto-detects the runtime and available models
  3. Operator clicks “Register” — the app pre-fills GPU info, models, and endpoint
  4. Platform creates the node record and begins health monitoring
  5. Health checks run every 60 seconds to verify the node is reachable

Security

  • All API keys are encrypted with AES-256-GCM before storage
  • Session tokens use NextAuth with JWT strategy
  • Database connections use Cloud SQL Auth Proxy (IAM-based)
  • Rate limiting protects all endpoints (Upstash Redis in production)
  • CORS restricts API access to authorized origins
  • CSP headers prevent XSS and injection attacks