Skip to main content
Decision intelligence needs more than one model. Agent Inference is the model layer beneath HQ: every agent reaches frontier models through one interface, routed to whichever fits the task, alongside open models HQ serves on its own GPUs. This page is part of qOS and complements How HQ works.

Multi-model routing

Every agent reaches frontier models from Anthropic, OpenAI, xAI, and Cerebras behind one interface. Each call goes to the model that fits the task — quality, latency, or cost — so the platform follows the frontier as it moves rather than being wired to whichever vendor was integrated first. Picking a model is part of defining an agent, not a buried platform setting. Give a research agent a frontier reasoning model and a triage agent a fast, small one, and run them side by side in the same workspace.

A model per agent

Give each agent the model that fits its job, and run a fleet where every member is on a different model at once.

Automatic failover

If a provider errors or goes down, the call retries against a healthy one inside the same request.
→ See what’s available via List models, and select one per agent.

Own-GPU inference for sensitive work

There is a quieter problem underneath model choice: most agent platforms ship your data to a third-party API just to embed or rerank it — the highest-volume calls they make. HQ answers this by serving the cheap, constant, and data-sensitive paths on hardware it runs itself.

Open models we serve ourselves

Open-weight models run on our own GPUs, so the high-volume and data-sensitive paths never have to leave our infrastructure.

Embeddings, in-house

The vectors behind memory and search come from an embedding model we host, so indexed text is never sent to an outside API to be turned into numbers.

Reranking, in-house

A self-hosted reranking model does the final relevance pass on memory and corpus lookups, with no third-party round-trip.

A batch path at fleet scale

Large non-interactive jobs run through a batch lane for work that can wait rather than answer in the moment.
The split is deliberate: frontier models for the hard, rare reasoning; owned open models for the cheap, constant, and sensitive work — not a degraded fallback.

Independent of the runtime

The model and the runtime are two separate dials. Each agent runs on a vendor’s official agent loop — the Claude Agent SDK, the OpenAI Agents SDK, or the open OpenCode runtime — and any model can run under any of those runtimes, so each choice is made on its own merits.
The model layer sits inside qOS next to search, cache, and storage, sharing the same hardware, isolation, and audit trail as everything else an agent touches. Adding or swapping a model is a config change, with nothing to alter in the agents on top.