INT8 | CΛTΞИCOΔΞ

/ɪnˈteɪt/

n. “small numbers, absolute certainty.”

INT8 is an 8-bit two's complement integer ranging from -128 to +127, optimized for quantized neural network inference where model weights/activations rounded to nearest integer maintain >99% accuracy versus FP32 training. Post-training quantization or quantization-aware training converts FP32 networks to INT8, enabling 4x throughput and 4x memory reduction on edge TPUs while zero-point offsets handle asymmetric activation ranges.

Key characteristics of INT8 include:

Range: -128 to +127 (signed); 0-255 (unsigned); 2's complement encoding.
Quantization: S = FP32_scale × (INT8 - zero_point); scale=127/max|weights|.
Throughput: 4x GEMM speed vs FP32; 1024 INT8 MACs/cycle on A100.
Dequantization: FP32 = S × (INT8 - zero_point) for activations before next layer.
Mixed Precision: INT8 compute with FP16/FP32 accumulation prevents overflow.

A conceptual example of INT8 quantization flow:

1. Analyze FP32 conv layer: weights [-3.2, +2.8] → scale=0.025, zero_point=0 2. Quantize: w_int8 = round(w_fp32 / 0.025) → [-128, +112] 3. Inference: INT8 dot product → FP32 accumulation 4. Requantize activations: act_int8 = round(act_fp32 / act_scale) 5. Dequantize for next layer: act_fp32 = act_scale × (act_int8 - act_zero_pt) 6. 240 TOPS INT8 vs 60 TFLOPS FP32 (A100)

Conceptually, INT8 is like compressing a high-resolution photo to thumbnail preview—discards fine precision details imperceptible to humans (neural net accuracy) while shrinking 32MB FP32 models to 8MB for mobile Bluetooth inference, trading 0.5% accuracy for 16x battery life.

In essence, INT8 powers edge AI from RNN keyword spotting to FP16-hybrid vision models on SerDes SoCs, quantized via SIMD dot products while HBM-fed servers mix INT8/FP16 for HPC-scale training on EMI-shielded racks.

Modeling

Data

Compute