Back to all skills

⚙️

Data & Analytics

vLLM Inference

Serve open-weight LLMs locally with high throughput via vLLM.

4.6rating

4,600 installs

mlops/inference/vllm

Max required

About this skill

vLLM Inference wraps the vLLM serving engine so you can run an open-weight LLM (Llama, Mistral, Qwen, DeepSeek) on your own hardware with batching, paged attention, and OpenAI-compatible HTTP. Use it to escape API costs for high-volume workloads, keep sensitive prompts local, or run a fine-tuned model next to your main assistant.

What it does

High-throughput batched inference
OpenAI-compatible HTTP endpoint
Supports most Llama-family checkpoints
Paged attention for long contexts
Runs on CUDA or Metal

Use cases

Serve a fine-tuned model for internal tools
Offload high-volume classification to local hardware
Keep sensitive prompts out of third-party APIs

Related skills

Data & Analytics

Personal Knowledge Base

Persistent notes that the assistant actually remembers and can search.

20,300 installs

Data & Analytics

Data Analysis (Jupyter)

Run Python on your data with a live Jupyter kernel, from a natural-language brief.

16,200 installs

Data & Analytics

Chart Builder

Generate publication-ready charts from a pasted table or CSV.

11,100 installs

Data & Analytics

Database Query Engine

Query your own Postgres, MySQL, or SQLite with natural-language prompts.