Nano-Vllm: lightweight vLLM implementation built from scratch

5 days ago 1

A lightweight vLLM implementation built from scratch.

🚀 Fast offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

from nanovllm import LLM, SamplingParams llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1) sampling_params = SamplingParams(temperature=0.6, max_tokens=256) prompts = ["Hello, Nano-vLLM."] outputs = llm.generate(prompts, sampling_params) outputs[0]["text"]

See bench.py for benchmark.

Test Configuration: