Skip to content

Optimization Levels

Overview

vLLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance:

  • -O0: No optimization. Fastest startup time, but lowest performance.
  • -O1: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
  • -O2: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
  • -O3: Aggressive optimization. Currently equal to -O2, but may include additional time-consuming or experimental optimizations in the future.

All optimization level defaults can be achieved by manually setting the underlying flags. User-set flags take precedence over optimization level defaults.

Level Summaries and Usage Examples

# CLI usage
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1

# Python API usage
from vllm.entrypoints.llm import LLM

llm = LLM(
    model="RedHatAI/Llama-3.2-1B-FP8",
    optimization_level=2 # equivalent to -O2
)

-O0: No Optimization

Startup as fast as possible - no autotuning, no compilation, and no cudagraphs. This level is good for initial phases of development and debugging.

Settings:

  • -cc.cudagraph_mode=NONE
  • -cc.mode=NONE (also resulting in -cc.custom_ops=["none"])
  • -cc.pass_config.fuse_...=False (all fusions disabled)
  • --kernel-config.enable_flashinfer_autotune=False

-O1: Fast Optimization

Prioritize fast startup, but still enable basic optimizations like compilation and cudagraphs. This level is a good balance for most development scenarios where you want faster startup but still make sure your code does not break cudagraphs or compilation.

Settings:

  • -cc.cudagraph_mode=PIECEWISE
  • -cc.mode=VLLM_COMPILE
  • --kernel-config.enable_flashinfer_autotune=True

Fusions:

  • -cc.pass_config.fuse_norm_quant=True*
  • -cc.pass_config.fuse_act_quant=True*
  • -cc.pass_config.fuse_act_padding=True
  • -cc.pass_config.fuse_rope_kvcache=True† (will be moved to O2)

* These fusions are only enabled when either op is using a custom kernel, otherwise Inductor fusion is better.
† These fusions are ROCm-only and require AITER.

-O2: Full Optimization (Default)

Prioritize performance at the expense of additional startup time. This level is recommended for production workloads and is hence the default. Fusions in this level may take longer due to additional compile ranges.

Settings (on top of -O1):

  • -cc.cudagraph_mode=FULL_AND_PIECEWISE
  • -cc.pass_config.fuse_allreduce_rms=True

-O3: Aggressive Optimization

This level is currently the same as -O2, but may include additional optimizations in the future that are more time-consuming or experimental.

Troubleshooting

Common Issues

  1. Startup Time Too Long: Use -O0 or -O1 for faster startup
  2. Compilation Errors: Use debug_dump_path for additional debugging information
  3. Performance Issues: Ensure using -O2 for production