Back to all skills
โš™๏ธ
Data & Analytics

vLLM Inference

Serve open-weight LLMs locally with high throughput via vLLM.

4.6rating
4,600 installs
mlops/inference/vllm
Max required

About this skill

vLLM Inference wraps the vLLM serving engine so you can run an open-weight LLM (Llama, Mistral, Qwen, DeepSeek) on your own hardware with batching, paged attention, and OpenAI-compatible HTTP. Use it to escape API costs for high-volume workloads, keep sensitive prompts local, or run a fine-tuned model next to your main assistant.

What it does

  • High-throughput batched inference
  • OpenAI-compatible HTTP endpoint
  • Supports most Llama-family checkpoints
  • Paged attention for long contexts
  • Runs on CUDA or Metal

Use cases

  • Serve a fine-tuned model for internal tools
  • Offload high-volume classification to local hardware
  • Keep sensitive prompts out of third-party APIs