模型量化技术:INT8和INT4量化实战指南
模型量化技术:INT8和INT4量化实战指南
模型量化是模型压缩和加速的重要技术,可以在保持模型精度的同时大幅降低显存占用和推理延迟。
什么是模型量化?
模型量化是将浮点数权重转换为低精度整数(如INT8、INT4)的过程:
FP32 (32-bit) → INT8 (8-bit) → INT4 (4-bit)
量化的优势
- 显存占用减少:INT8量化可减少75%显存
- 推理速度提升:INT8推理速度提升2-4倍
- 能耗降低:移动端和边缘设备部署更友好
INT8量化
量化公式
def quantize_to_int8(tensor, scale):
"""
将FP32张量量化为INT8
"""
# 缩放
scaled = tensor / scale
# 截断到[-128, 127]
clipped = torch.clamp(scaled, -128, 127)
# 四舍五入
quantized = torch.round(clipped).to(torch.int8)
return quantized, scale
def dequantize_from_int8(quantized, scale):
"""
将INT8张量反量化为FP32
"""
return quantized.float() * scale
校准方法
静态量化
def calibrate_model(model, calibration_data):
model.eval()
# 收集激活值统计信息
for data in calibration_data:
with torch.no_grad():
_ = model(data)
# 计算量化参数
for module in model.modules():
if isinstance(module, torch.nn.Linear):
# 计算scale
weight_max = module.weight.abs().max()
scale = weight_max / 127.0
# 量化权重
quantized_weight, _ = quantize_to_int8(
module.weight, scale
)
module.register_buffer('quantized_weight', quantized_weight)
module.register_buffer('scale', scale)
动态量化
import torch.quantization as quantization
# 动态量化(运行时量化)
model_dynamic = quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # 量化这些层
dtype=torch.qint8
)
INT4量化
GPTQ量化
GPTQ是一种后训练量化方法,特别适合大语言模型:
def gptq_quantize(weight, bits=4):
"""
GPTQ量化算法
"""
# 分组量化
group_size = 128
quantized_weight = []
for i in range(0, weight.shape[0], group_size):
group = weight[i:i+group_size]
# 计算量化参数
scale = group.abs().max() / (2 ** (bits - 1) - 1)
# 量化
quantized = torch.round(group / scale).clamp(
-(2 ** (bits - 1)),
2 ** (bits - 1) - 1
)
quantized_weight.append((quantized, scale))
return quantized_weight
AWQ量化
AWQ(Activation-aware Weight Quantization)考虑激活值分布:
def awq_quantize(weight, activation, bits=4):
"""
AWQ量化:考虑激活值的重要性
"""
# 计算激活值的重要性
importance = activation.abs().mean(dim=0)
# 根据重要性调整量化粒度
quantized_weight = []
for i in range(weight.shape[1]):
col_importance = importance[i]
# 重要通道使用更精细的量化
if col_importance > threshold:
scale = weight[:, i].abs().max() / (2 ** (bits - 1) - 1)
else:
scale = weight[:, i].abs().max() / (2 ** (bits - 2) - 1)
quantized = torch.round(weight[:, i] / scale).clamp(
-(2 ** (bits - 1)),
2 ** (bits - 1) - 1
)
quantized_weight.append((quantized, scale))
return quantized_weight
量化感知训练(QAT)
伪量化节点
class FakeQuantize(torch.nn.Module):
def __init__(self, bits=8):
super().__init__()
self.bits = bits
self.scale = nn.Parameter(torch.ones(1))
def forward(self, x):
# 前向传播:量化+反量化
quantized = quantize_to_int8(x, self.scale)
dequantized = dequantize_from_int8(quantized, self.scale)
return dequantized
def backward(self, grad_output):
# 反向传播:直通估计器
return grad_output
QAT训练流程
# 1. 插入伪量化节点
model = insert_fake_quantize(model)
# 2. 正常训练
for epoch in range(num_epochs):
for data, target in train_loader:
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# 3. 转换为真实量化模型
quantized_model = convert_to_quantized(model)
推理优化
量化推理引擎
class QuantizedInference:
def __init__(self, quantized_model):
self.model = quantized_model
def forward(self, x):
# INT8矩阵乘法
if isinstance(self.model.weight, torch.Tensor):
# 反量化权重
weight_fp32 = dequantize_from_int8(
self.model.quantized_weight,
self.model.scale
)
output = F.linear(x, weight_fp32)
else:
# 直接使用量化权重(需要硬件支持)
output = quantized_linear(x, self.model.quantized_weight)
return output
性能对比
实验结果
在7B模型上的量化效果:
| 量化方法 | 显存占用 | 推理速度 | 精度损失 |
|---|---|---|---|
| FP32 | 14GB | 1x | 0% |
| INT8 | 4GB | 3.2x | <1% |
| INT4 | 2GB | 4.5x | <2% |
精度-速度权衡
- INT8:平衡点,适合大多数场景
- INT4:极致压缩,适合资源受限场景
- 混合精度:关键层FP16,其他层INT8
最佳实践
- 校准数据选择:使用代表性数据校准
- 逐层量化:敏感层使用更高精度
- 精度监控:量化后必须验证精度
- 硬件适配:考虑目标硬件的量化支持
总结
模型量化是模型部署的关键技术,合理使用可以在保持精度的同时大幅提升推理效率。选择合适的量化方法和精度级别,需要根据具体应用场景进行权衡。
希望这些经验对正在优化模型推理性能的开发者有所帮助!