vllm.model_executor.layers.quantization.utils.mps_dequant ¶
MPS (Metal) dequantization utilities for AWQ, GPTQ, and GGUF models.
Uses Metal kernel packages when available, with pure PyTorch/numpy fallbacks for environments where the kernels aren't installed.
_get_metal_dequant ¶
Try to import Metal dequant kernel package (cached).
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
_get_metal_dequant_gguf ¶
Try to import Metal dequant_gguf kernel package (cached).
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
_pytorch_dequant_awq ¶
Pure PyTorch AWQ dequantization — bitwise unpack + scale.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qweight | Tensor | [in_features, out_features/8] packed int32 | required |
scales | Tensor | [num_groups, out_features] float16 | required |
qzeros | Tensor | [num_groups, out_features/8] packed int32 | required |
group_size | int | quantization group size | required |
Returns:
| Type | Description |
|---|---|
Tensor | [in_features, out_features] float16 weight matrix |
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
_pytorch_dequant_gguf ¶
_pytorch_dequant_gguf(
W: Tensor,
quant_type: int,
m: int,
n: int,
dtype: dtype | None = None,
) -> Tensor
Fallback GGUF dequantization using the gguf Python library.
This does a GPU→CPU→GPU round-trip via numpy, so it's slow but correct.
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
_pytorch_dequant_gptq ¶
_pytorch_dequant_gptq(
qweight: Tensor,
scales: Tensor,
qzeros: Tensor,
g_idx: Tensor,
group_size: int,
use_v2_format: bool = False,
) -> Tensor
Pure PyTorch GPTQ dequantization — bitwise unpack + scale.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qweight | Tensor | [in_features/8, out_features] packed int32 | required |
scales | Tensor | [num_groups, out_features] float16 | required |
qzeros | Tensor | [num_groups, out_features/8] packed int32 | required |
g_idx | Tensor | [in_features] int32 group index (empty if no desc_act) | required |
group_size | int | quantization group size | required |
use_v2_format | bool | if True, use v2 zero-point convention (no offset). v1 (default): stored_zero = true_zero - 1, so add 1 back. | False |
Returns:
| Type | Description |
|---|---|
Tensor | [in_features, out_features] float16 weight matrix |
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
awq_dequant_matmul ¶
Dequantize AWQ weights and perform matmul on MPS.
Uses Metal kernel if available, falls back to pure PyTorch.
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
gguf_dequant_on_mps ¶
gguf_dequant_on_mps(
W: Tensor,
quant_type: int,
m: int,
n: int,
dtype: dtype | None = None,
) -> Tensor
Dequantize GGUF weights on MPS.
Uses Metal kernel if available for all standard GGUF types, falls back to gguf library (numpy) for unsupported types (IQ*).
Source code in vllm/model_executor/layers/quantization/utils/mps_dequant.py
gptq_dequant_matmul ¶
gptq_dequant_matmul(
x: Tensor,
layer: Any,
bias: Tensor | None,
quant_config: Any,
use_v2_format: bool = False,
) -> Tensor
Dequantize GPTQ weights and perform matmul on MPS.
Uses Metal kernel if available, falls back to pure PyTorch.