Nano-Vllm: lightweight vLLM implementation built from scratch

5 days ago 1

A lightweight vLLM implementation built from scratch.

  • 🚀 Fast offline inference - Comparable inference speeds to vLLM
  • 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
  • Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

from nanovllm import LLM, SamplingParams llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1) sampling_params = SamplingParams(temperature=0.6, max_tokens=256) prompts = ["Hello, Nano-vLLM."] outputs = llm.generate(prompts, sampling_params) outputs[0]["text"]

See bench.py for benchmark.

Test Configuration:

  • Hardware: RTX 4070 Laptop (8GB)
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100–1024 tokens
  • Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM 133,966 98.37 1361.84
Nano-vLLM 133,966 93.41 1434.13

Star History Chart

Read Entire Article