Quantizing LLMs: Code to Cut Memory & Inference Cost

Practical, up-to-date code to safely quantize LLMs (8‑bit/int8) with PyTorch, HuggingFace/bitsandbytes and ONNX + benchmarks for memory and latency tradeoffs.

Cut model memory and inference cost with safe, repeatable quantization

If your LLM projects are eating GPU/host RAM and driving up cloud bills, you need reproducible ways to trade a little accuracy for major memory savings and latency improvements. This article gives practical, copy-paste-ready code for the three most common production paths in 2026: PyTorch (dynamic quantization), HuggingFace + bitsandbytes (8-bit GPU inference), and ONNX Runtime (int8 CPU/GPU inference). Each snippet shows how to quantize safely, measure memory and latency, and decide where to keep floating point for critical layers.

Why quantization matters in 2026

By late 2025 and early 2026, memory prices and availability have become a real constraint for AI teams. As noted during CES 2026:

Memory chip scarcity is driving up prices for laptops and PCs

For teams shipping models at scale, that means two things: (1) hardware costs matter again; and (2) squeezing models to run in less memory without bespoke retraining is a huge win. Quantization is the practical lever: convert weights and sometimes activations from fp32/fp16 to int8/8-bit formats to reduce memory footprint and move from high-cost GPU slots to cheaper CPU resources or denser GPU packing.

Key principles before you start

Start with a goal: disk size, peak GPU memory, or tokens/sec? Benchmark against that baseline.
Test with representative inputs: realistic prompts and batch sizes—accuracy loss is input-dependent.
Quantize selectively: avoid quantizing LayerNorm, softmax or small embeddings unless you've validated them.
Use per-channel weight quantization: it preserves accuracy for transformer linear layers.
Warm up and measure properly: steady-state latency matters; include warmup runs, compute p50/p95.

1) HuggingFace + bitsandbytes: fast path for 8-bit GPU inference

bitsandbytes (bfloat/bnb) has become the de-facto way to run large transformers in 8-bit on GPUs without retraining. This path is usually the fastest to get running on existing HF models and gives excellent memory reduction with small accuracy drop.

Install

Install the essentials (run in your environment):

pip install transformers accelerate bitsandbytes

Load an 8-bit model and measure memory & latency

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

MODEL = "gpt2"  # replace with your model (eg. Llama-2 or a HF causal model)

tokenizer = AutoTokenizer.from_pretrained(MODEL)
# load_in_8bit requires bitsandbytes installed
model = AutoModelForCausalLM.from_pretrained(MODEL, load_in_8bit=True, device_map="auto")

# helper to measure
def measure_latency(prompt, runs=50, warmup=10):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    # warmup
    for _ in range(warmup):
        _ = model.generate(**inputs, max_new_tokens=16)
    # timed runs
    t0 = time.perf_counter()
    for _ in range(runs):
        _ = model.generate(**inputs, max_new_tokens=16)
    t = (time.perf_counter() - t0) / runs
    return t

print("Device:", model.device)
# check GPU memory usage if on CUDA
if torch.cuda.is_available():
    print("CUDA memory allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")

lat = measure_latency("Hello world")
print(f"avg latency: {lat*1000:.1f} ms")

Notes:

device_map="auto" uses accelerate to place layers—works well for mixed GPU/CPU setups.
bitsandbytes uses 8-bit quantized weight matrices with runtime kernels—no retraining required.
If your model is huge, use low_cpu_mem_usage=True when calling from_pretrained.

2) PyTorch dynamic quantization: cost-effective CPU inference (int8)

When your deployment target is CPU, torch.quantization.quantize_dynamic is a simple, reliable option. It converts linear/BiLinear layers to int8 and often halves memory and speeds up CPU inference.

Example: DistilBERT classification on CPU

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.quantization import quantize_dynamic
import os

MODEL = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.eval()

# save baseline size
torch.save(model.state_dict(), "model_fp32.pth")
size_fp32 = os.path.getsize("model_fp32.pth") / 1024**2
print(f"FP32 state_dict: {size_fp32:.1f} MB")

# quantize (affects Linear modules)
model_q = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
torch.save(model_q.state_dict(), "model_int8.pth")
size_int8 = os.path.getsize("model_int8.pth") / 1024**2
print(f"INT8 state_dict: {size_int8:.1f} MB")

# quick latency test on CPU
inputs = tokenizer("I love this!", return_tensors="pt")
with torch.no_grad():
    # warmup
    for _ in range(10):
        _ = model_q(**inputs)
    t0 = time.perf_counter()
    for _ in range(50):
        _ = model_q(**inputs)
    print("avg latency (ms):", (time.perf_counter()-t0)/50*1000)

Important points:

Quantize on CPU only: PyTorch dynamic quant works best for CPU endpoints.
No calibration data required: dynamic quantization chooses scales at runtime.
Check accuracy: compare logits or task metrics after quantization and keep floating versions of sensitive layers when needed.

3) ONNX + ONNX Runtime quantization: static & dynamic int8 for CPU/GPU

ONNX gives portability and mature production runtimes (ONNX Runtime, TensorRT via ONNX, OpenVINO). There are two common ONNX quant flows:

quantize_dynamic — quick, works without calibration, good for weights.
quantize_static — uses calibration dataset to quantize activations and weights; higher accuracy.

Export to ONNX and apply dynamic quantization

Below is a compact flow for a causal model (use a small model for testing). For bigger models, use HuggingFace optimum export helpers.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from onnxruntime.quantization import quantize_dynamic, QuantType
import onnxruntime as ort
import os
import time

MODEL = "distilgpt2"
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float32).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# export a tiny ONNX with torch.onnx (adjust opset and input names in real cases)
input_ids = tokenizer("Hello world", return_tensors="pt").input_ids
torch.onnx.export(model, (input_ids,), "model.onnx", opset_version=15, do_constant_folding=True,
                  input_names=["input_ids"], output_names=["logits"])
print("ONNX size:", os.path.getsize("model.onnx")/1024**2, "MB")

# quantize weights to int8
quantize_dynamic("model.onnx", "model_q.onnx", weight_type=QuantType.QInt8)
print("Quantized ONNX size:", os.path.getsize("model_q.onnx")/1024**2, "MB")

# benchmark with ONNX Runtime (CPU) — use CUDAExecutionProvider for GPU
session = ort.InferenceSession("model_q.onnx", providers=["CPUExecutionProvider"]) 
inputs = {session.get_inputs()[0].name: input_ids.numpy()}
# warmup
for _ in range(10):
    _ = session.run(None, inputs)
# timed
t0 = time.perf_counter()
for _ in range(100):
    _ = session.run(None, inputs)
print("avg latency (ms):", (time.perf_counter()-t0)/100*1000)

For production, use quantize_static with a calibration dataset and per-channel quantization on weight tensors. ONNX Runtime and the optimum toolkit have improved static workflows in 2025–2026, including better CUDA execution provider support for int8.

Safety checklist: keep inference accurate

Run a small held-out validation set and compute task metrics (e.g., perplexity, F1) before and after quantization.
Compare logits with fp32 at a few inputs; large differences signal numerical issues.
If accuracy drops, try: per-channel weight quantization, finer-grained layer selection (quantize only feed-forward layers), or static calibration with representative data.
Use SmoothQuant or similar transforms when activations have large dynamic ranges—move some scaling into weights to stabilize int8 results.

How to benchmark memory & latency methodically

Below is a short protocol you can copy into CI to track regressions.

Choose representative prompts and batch sizes used in production.
Measure baseline: disk size, peak memory, and latency (p50/p95) for the float model.
Apply quantization variant (bitsandbytes, dynamic, ONNX) and repeat measurements.
Log per-request CPU/GPU memory and wall time. Use pynvml for GPU memory and psutil for host memory.
Track quality metrics—tokens-per-second is useless without task accuracy checks.

Example: measuring GPU peak memory with pynvml

import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print("GPU used (MB):", info.used/1024**2)

Real-world tradeoffs — practical rules of thumb

bitsandbytes 8-bit: typically 2–4x memory reduction on GPU with minor quality loss for many generative tasks—fastest to adopt for large HF models in 2026.
PyTorch dynamic (CPU): 2x size reduction, lower cost to run on CPU; best for classification and small models.
ONNX quant & static int8: best production portability and often the best accuracy when calibration is used; slightly more engineering effort to export and calibrate.
When strict accuracy is required: prefer per-channel/int8 static with calibration or keep critical layers in fp16/fp32.

Advanced strategies you should consider in 2026

Hybrid quantization: keep attention softmax and LayerNorm in fp16, quantize only Linear/MatMul weight matrices.
Layer-wise profiling: identify biggest layers by memory and quantize them first—often FFN projection matrices are the largest.
Automated quant search: run a small search that quantizes different layer groups and measures quality vs. size to find the best balance.
Use GPTQ/AWQ for very large models: for sub-4-bit deployments these methods (mature by 2025) can produce usable 4/3-bit models—complex but good for extreme memory limits.

Common pitfalls and how to avoid them

Forgetting warmup: first inference often includes JIT/graph building—exclude it from latency metrics.
Quantizing the tokenizer or embeddings without testing: small components can cause outsized accuracy loss.
Not validating edge cases: outlier prompts may expose numerical instability—add them to the calibration/validation set.
Assuming GPU quant always helps: some GPU kernels are optimized for fp16; always benchmark on target hardware.

Quick checklist to ship quantized models

Choose quant path (bitsandbytes, PyTorch dynamic, ONNX static/dynamic).
Run representative validation and record baseline metrics.
Apply quantization and run the same tests (latency, p50/p95, memory, task metrics).
If accuracy drop unacceptable, fallback specific layers to fp16/fp32 or use static per-channel quant with calibration.
Automate the benchmark script in CI and store artifacts for auditing.

Final takeaways — what to do next

In 2026, quantization is a practical necessity: rising memory costs and denser model deployment demands mean you must extract more from less. Use bitsandbytes for the fastest GPU wins, PyTorch dynamic for cost-effective CPU inference, and ONNX when you need portability and the best static int8 accuracy. Always validate with representative data and automate benchmarks so quality and cost stay predictable.

Actionable next steps

Try the bitsandbytes snippet on a small HF model and measure the GPU memory difference.
Export one hot-path model to ONNX and run quantize_dynamic; compare disk and runtime size.
Automate the benchmarking protocol above in your CI and add per-release checks for accuracy and p95 latency.

Want the full toolkit? I maintain a small repo of benchmark scripts for HuggingFace, PyTorch quant, and ONNX quant that follow the patterns in this article—drop a line or subscribe for the scripts and CI examples.

Call-to-action

Start with a single model and one quant path today. Quantize, benchmark, and iterate—then scale your winning configuration across the fleet. If you want the exact benchmarking scripts I use in production, sign up for our developer kit or ask for the examples and I’ll send the repo and CI templates you can drop into your pipeline.

Quantizing LLMs: Quick Code Snippets to Cut Memory Use and Inference Cost

Cut model memory and inference cost with safe, repeatable quantization

Why quantization matters in 2026

Key principles before you start

1) HuggingFace + bitsandbytes: fast path for 8-bit GPU inference

Install

Load an 8-bit model and measure memory & latency

2) PyTorch dynamic quantization: cost-effective CPU inference (int8)

Example: DistilBERT classification on CPU

3) ONNX + ONNX Runtime quantization: static & dynamic int8 for CPU/GPU

Export to ONNX and apply dynamic quantization

Safety checklist: keep inference accurate

How to benchmark memory & latency methodically

Example: measuring GPU peak memory with pynvml

Real-world tradeoffs — practical rules of thumb

Advanced strategies you should consider in 2026

Common pitfalls and how to avoid them

Quick checklist to ship quantized models

Final takeaways — what to do next

Actionable next steps

Call-to-action

Related Topics

technique

Up Next

Best AI Text Rewriter Tools for Developers and Technical Writers

Markdown to HTML Tools Compared for Clean Publishing Workflows

Best Online Developer Tools for Frontend Debugging in 2026

From Our Network

CI/CD Pipeline Checklist for Web Apps: From Pull Request to Production

Dockerfile Best Practices Checklist for Smaller, Faster, More Secure Images

Vercel vs Netlify vs Cloudflare Pages: Best Front-End Hosting for Modern Web Apps

How to Debug CORS Errors in Local Development and Production

Best Browser-Based SQL Formatter Tools for Cleaner Queries

How to Set Up Environment Variables for Local, Staging, and Production

Cut model memory and inference cost with safe, repeatable quantization

Why quantization matters in 2026

Key principles before you start

1) HuggingFace + bitsandbytes: fast path for 8-bit GPU inference

Install

Load an 8-bit model and measure memory & latency

2) PyTorch dynamic quantization: cost-effective CPU inference (int8)

Example: DistilBERT classification on CPU

3) ONNX + ONNX Runtime quantization: static & dynamic int8 for CPU/GPU

Export to ONNX and apply dynamic quantization

Safety checklist: keep inference accurate

How to benchmark memory & latency methodically

Example: measuring GPU peak memory with pynvml

Real-world tradeoffs — practical rules of thumb

Advanced strategies you should consider in 2026

Common pitfalls and how to avoid them

Quick checklist to ship quantized models

Final takeaways — what to do next

Actionable next steps

Call-to-action

Related Reading

Related Topics

technique

Up Next

Best AI Text Rewriter Tools for Developers and Technical Writers

Markdown to HTML Tools Compared for Clean Publishing Workflows

Best Online Developer Tools for Frontend Debugging in 2026

From Our Network

CI/CD Pipeline Checklist for Web Apps: From Pull Request to Production

Dockerfile Best Practices Checklist for Smaller, Faster, More Secure Images

Vercel vs Netlify vs Cloudflare Pages: Best Front-End Hosting for Modern Web Apps

How to Debug CORS Errors in Local Development and Production

Best Browser-Based SQL Formatter Tools for Cleaner Queries

How to Set Up Environment Variables for Local, Staging, and Production