Models and Serving

Unify base, commercial and your fine-tuned models. Create client endpoints with token/QPS budgets and API keys. Serve with the optimal backend by hardware and control metrics and costs.

"Verified compatibility" indicates internal testing until 09/2025. Some routes are in beta and require specific driver/hardware versions.

What it solves

  • One registry for everything: base, commercial and fine-tuned models (Technology/Domain Packs).
  • Client endpoints with limits and security (token/QPS, API keys, IP allowlist).
  • Multi-backend serving: we choose the best route (Transformers, vLLM, ONNX Runtime, OpenVINO, etc.) based on your hardware and policy (performance, memory or cost).
  • Observability and auditing: P95/P99 latency, throughput, estimated cost and request/response traces.

Backends and formats (verified compatibility)

Transformers (Hugging Face)
general (CPU/NVIDIA) · GA
GA
vLLM
high throughput for chat/completions (NVIDIA) · GA
GA
ONNX Runtime
portable CPU/GPU · GA
GA
OpenVINO
Intel CPU (GA), iGPU/NPU (beta)
Beta
TensorRT‑LLM
NVIDIA optimization (beta)
Beta
Triton Inference Server
production serving and batching (beta)
Beta
llama.cpp / GGUF
lightweight quantized (beta)
Beta
ROCm
AMD (beta)
Beta

Serving Architecture

Models and serving infrastructure architecture

Unified model management and serving system with multiple backends

Client endpoints (SaaS/Hosting and multi-tenant)

One endpoint per account with its model/prompt/versioning.
Budgets and limits: tokens/month, QPS, rate-limit.
Security: API keys per endpoint, rotation, IP allowlist.
White-label per client (OEM): domain and branding per account.
Usage/billing: CSV/JSON export per endpoint/client.

Optimization and quantization

  • Project policies: balanced · performance · memory_efficient · max_performance.
  • Automatic backend selection by hardware/precision (with fallback).
  • Supported precisions where applicable: FP32/BF16/FP16/INT8/INT4.
  • Quantization and efficient fine-tuning methods: PEFT/QLoRA (GA), GPTQ/AWQ (beta).

Metrics and auditing

  • Performance: P50/P95/P99 latency, throughput, tokens per second.
  • Estimated cost per 1,000 tokens/query (if applicable provider).
  • Health: errors, timeouts, saturation, memory/GPU usage.
  • Auditing: signed traces of inputs/outputs and model/prompt versions.

How to use (typical flow)

1

Register a base/commercial model or your fine-tuned model (Pack).

2

Create an endpoint (optional: per client) and select policy and preferred backend.

3

Define budgets (tokens/QPS), generate API keys and configure IP allowlist.

4

Activate metrics/alerts and review auditing.

5

(SaaS/Hosting) Export usage for billing or integrate with your billing.

Availability

Available today (v0.10 – 10/10/2025)

  • Unified registry and client endpoints with budgets/API keys/allowlist.
  • Backends: Transformers, vLLM, ONNX Runtime and OpenVINO (CPU GA).
  • Basic metrics (latency/cost) and initial auditing.
  • Usage export (CSV/JSON) per endpoint/client.

In development (v0.11 – Q4 2025)

  • Triton and TensorRT‑LLM (hardened beta), more quota/plan controls, expanded tenant panels.

Planned (2026)

  • Distributed multi-node mode, advanced merges and tenant catalog.

Want to consolidate your serving and control costs?