Models and Serving
Unify base, commercial and your fine-tuned models. Create client endpoints with token/QPS budgets and API keys. Serve with the optimal backend by hardware and control metrics and costs.
"Verified compatibility" indicates internal testing until 09/2025. Some routes are in beta and require specific driver/hardware versions.
What it solves
- One registry for everything: base, commercial and fine-tuned models (Technology/Domain Packs).
- Client endpoints with limits and security (token/QPS, API keys, IP allowlist).
- Multi-backend serving: we choose the best route (Transformers, vLLM, ONNX Runtime, OpenVINO, etc.) based on your hardware and policy (performance, memory or cost).
- Observability and auditing: P95/P99 latency, throughput, estimated cost and request/response traces.
Backends and formats (verified compatibility)
Transformers (Hugging Face)
general (CPU/NVIDIA) · GA
vLLM
high throughput for chat/completions (NVIDIA) · GA
ONNX Runtime
portable CPU/GPU · GA
OpenVINO
Intel CPU (GA), iGPU/NPU (beta)
TensorRT‑LLM
NVIDIA optimization (beta)
Triton Inference Server
production serving and batching (beta)
llama.cpp / GGUF
lightweight quantized (beta)
ROCm
AMD (beta)
Serving Architecture

Unified model management and serving system with multiple backends
Client endpoints (SaaS/Hosting and multi-tenant)
One endpoint per account with its model/prompt/versioning.
Budgets and limits: tokens/month, QPS, rate-limit.
Security: API keys per endpoint, rotation, IP allowlist.
White-label per client (OEM): domain and branding per account.
Usage/billing: CSV/JSON export per endpoint/client.
Optimization and quantization
- Project policies: balanced · performance · memory_efficient · max_performance.
- Automatic backend selection by hardware/precision (with fallback).
- Supported precisions where applicable: FP32/BF16/FP16/INT8/INT4.
- Quantization and efficient fine-tuning methods: PEFT/QLoRA (GA), GPTQ/AWQ (beta).
Metrics and auditing
- Performance: P50/P95/P99 latency, throughput, tokens per second.
- Estimated cost per 1,000 tokens/query (if applicable provider).
- Health: errors, timeouts, saturation, memory/GPU usage.
- Auditing: signed traces of inputs/outputs and model/prompt versions.
How to use (typical flow)
1
Register a base/commercial model or your fine-tuned model (Pack).
2
Create an endpoint (optional: per client) and select policy and preferred backend.
3
Define budgets (tokens/QPS), generate API keys and configure IP allowlist.
4
Activate metrics/alerts and review auditing.
5
(SaaS/Hosting) Export usage for billing or integrate with your billing.
Availability
Available today (v0.10 – 10/10/2025)
- Unified registry and client endpoints with budgets/API keys/allowlist.
- Backends: Transformers, vLLM, ONNX Runtime and OpenVINO (CPU GA).
- Basic metrics (latency/cost) and initial auditing.
- Usage export (CSV/JSON) per endpoint/client.
In development (v0.11 – Q4 2025)
- Triton and TensorRT‑LLM (hardened beta), more quota/plan controls, expanded tenant panels.
Planned (2026)
- Distributed multi-node mode, advanced merges and tenant catalog.