Pulls weight sizes from Hugging Face, estimates KV memory, and suggests tensor parallelism and vllm serve commands. Add several models to estimate total GPUs on your preferred GPU type (separate vLLM instances). Estimates are heuristic — validate on your hardware.
Each model is a separate vllm serve process. Planning assumes tensor-parallel groups do not share GPUs with another model unless you colocate manually.
Click a GPU for full specs. Your preferred choice is highlighted for multi-model totals above.
Use a different --port per model when running on the same host. Adjust --tensor-parallel-size if your cluster differs. See vLLM docs.