vLLM deployment advisor

Pulls weight sizes from Hugging Face, estimates KV memory, and suggests tensor parallelism and vllm serve commands. Add several models to estimate total GPUs on your preferred GPU type (separate vLLM instances). Estimates are heuristic — validate on your hardware.

Hugging Face models (one per serving endpoint)

Each model is a separate vllm serve process. Planning assumes tensor-parallel groups do not share GPUs with another model unless you colocate manually.

HF token (optional, for gated/private)

Preferred GPU (for TP & totals)

Weight memory (dtype)

KV cache dtype

Max model length (tokens)

Target GPU memory utilization

Concurrent sequences per model (KV hint)

vLLM deployment advisor

Multi-model deployment (preferred GPU)

Models & shards (from Hub)

Memory breakdown

GPU catalog

vLLM commands