Models and Serving

Unify base, commercial and your fine-tuned models. Create client endpoints with token/QPS budgets and API keys. Serve with the optimal backend by hardware and control metrics and costs.

"Verified compatibility" indicates internal testing until 09/2025. Some routes are in beta and require specific driver/hardware versions.

What it solves

One registry for everything: base, commercial and fine-tuned models (Technology/Domain Packs).
Client endpoints with limits and security (token/QPS, API keys, IP allowlist).
Multi-backend serving: we choose the best route (Transformers, vLLM, ONNX Runtime, OpenVINO, etc.) based on your hardware and policy (performance, memory or cost).
Observability and auditing: P95/P99 latency, throughput, estimated cost and request/response traces.

Backends and formats (verified compatibility)

Transformers (Hugging Face)

general (CPU/NVIDIA) · GA

vLLM

high throughput for chat/completions (NVIDIA) · GA

ONNX Runtime

portable CPU/GPU · GA

OpenVINO

Intel CPU (GA), iGPU/NPU (beta)

Beta

TensorRT‑LLM

NVIDIA optimization (beta)

Beta

Triton Inference Server

production serving and batching (beta)

Beta

llama.cpp / GGUF

lightweight quantized (beta)

Beta

ROCm

AMD (beta)

Beta

View full compatibility matrix →

Serving Architecture

Models and serving infrastructure architecture

Unified model management and serving system with multiple backends

Client endpoints (SaaS/Hosting and multi-tenant)

One endpoint per account with its model/prompt/versioning.

Budgets and limits: tokens/month, QPS, rate-limit.

Security: API keys per endpoint, rotation, IP allowlist.

White-label per client (OEM): domain and branding per account.

Usage/billing: CSV/JSON export per endpoint/client.

Optimization and quantization

Project policies: balanced · performance · memory_efficient · max_performance.
Automatic backend selection by hardware/precision (with fallback).
Supported precisions where applicable: FP32/BF16/FP16/INT8/INT4.
Quantization and efficient fine-tuning methods: PEFT/QLoRA (GA), GPTQ/AWQ (beta).

Metrics and auditing

Performance: P50/P95/P99 latency, throughput, tokens per second.
Estimated cost per 1,000 tokens/query (if applicable provider).
Health: errors, timeouts, saturation, memory/GPU usage.
Auditing: signed traces of inputs/outputs and model/prompt versions.

How to use (typical flow)

Create an endpoint (optional: per client) and select policy and preferred backend.

Define budgets (tokens/QPS), generate API keys and configure IP allowlist.

Activate metrics/alerts and review auditing.

(SaaS/Hosting) Export usage for billing or integrate with your billing.

Availability

Available today (v0.10 – 10/10/2025)

Unified registry and client endpoints with budgets/API keys/allowlist.
Backends: Transformers, vLLM, ONNX Runtime and OpenVINO (CPU GA).
Basic metrics (latency/cost) and initial auditing.
Usage export (CSV/JSON) per endpoint/client.

In development (v0.11 – Q4 2025)

Triton and TensorRT‑LLM (hardened beta), more quota/plan controls, expanded tenant panels.

Planned (2026)

Distributed multi-node mode, advanced merges and tenant catalog.

Want to consolidate your serving and control costs?

View pricing Become Early Adopter Back to Platform