Skip to content

Latest commit

 

History

History
224 lines (156 loc) · 17.4 KB

Quantization.md

File metadata and controls

224 lines (156 loc) · 17.4 KB

Uniform Quantization with Fine-Tuning

A uniform "fake" quantization method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. The method performs differentiable sampling of the continuous signal (for example, activations or weights) during forward pass, simulating inference with integer arithmetic.

Common Quantization Formula

Quantization is parametrized by clamping range and number of quantization levels. The sampling formula is the following:

output = \frac{\left\lfloor (clamp(input; input_low, input_high)-input_low)  *s\right \rceil}{s} + input_low\

clamp(input; input_low, input_high) = min(max(input, input_low), input_high)))

s=\frac{levels-1}{input_high - input_low}

input_low and input_high represent the quantization range and \left\lfloor\cdot\right \rceil denotes rounding to the nearest integer.

Symmetric Quantization

During the training, we optimize the scale parameter that represents the range [input_low, input_range] of the original signal using gradient descent:

input_low=scale*\frac{level_low}{level_high}

input_high=scale

In the formula above, level_low and level_high represent the range of the discrete signal.

  • For weights:

    level_low=-2^{bits-1}+1,

    level_high=2^{bits-1}-1

    levels=255

  • For unsigned activations:

    level_low=0

    level_high=2^{bits}-1

    levels=256

  • For signed activations:

    level_low=-2^{bits-1}

    level_high=2^{bits-1}-1

    levels=256

For all the cases listed above, the common quantization formula is simplified after substitution of input_low, input_high and levels:

output = \left\lfloor clamp(input * \frac{level_high}{scale}, level_low, level_high)\right \rceil * \frac{scale}{level_high}

Use the num_init_steps parameter from the initializer group to initialize the values of scale and determine which activation should be signed or unsigned from the collected statistics during given number of steps.

Asymmetric Quantization

During the training we optimize the input_low and input_range parameters using gradient descent:

input_high=input_low + input_range

levels=256

level_low=0

level_high=2^{bits}-1

For better accuracy, floating-point zero should be within quantization range and strictly mapped into quant (without rounding). Therefore, the following scheme is applied to ranges of weights and activations before quantization:

{input_low}' = min(input_low, 0)

{input_high}' = max(input_high, 0)

ZP= \left\lfloor \frac{-{input_low}'*(levels-1)}{{input_high}'-{input_low}'} \right \rceil

{input_high}''=\frac{ZP-levels+1}{ZP}*{input_low}'

{input_low}''=\frac{ZP}{ZP-levels+1}*{input_high}'

{input_low,input_high} = \begin{cases} {input_low}',{input_high}', & ZP \in ${0,levels-1}$ \ {input_low}',{input_high}'', & {input_high}'' - {input_low}' > {input_high}' - {input_low}'' \ {input_low}'',{input_high}', & {input_high}'' - {input_low}' <= {input_high}' - {input_low}''\ \end{cases}

You can use the num_init_steps parameter from the initializer group to initialize the values of input_low and input_range from the collected statistics during given number of steps.

Quantization Implementation

In our implementation, we use a slightly transformed formula. It is equivalent by order of floating-point operations to simplified symmetric formula and the assymetric one. The small difference is addition of small positive number eps to prevent division by zero and taking absolute value of range, since it might become negative on backward:

output = \frac{\left\lfloor clamp(input-input_low^{*}, level_low, level_high)s\right \rceil} {s} + input_low^{}

s = \frac{level_high}{|input_range^{*}| + eps}

For asymmetric: \input_low^{} = input_low \ input_range^{} = input_range

For symmetric: \input_low^{} = 0 \ input_range^{} = scale

Mixed Precision Quantization

Quantization to lower precisions (e.g. 6, 4, 2 bits) is an efficient way to accelerate inference of neural networks. Though NNCF supports quantization with an arbitrary number of bits to represent weights and activations values, choosing ultra low bitwidth could noticeably affect the model's accuracy. A good trade-off between accuracy and performance is achieved by assigning different precisions to different layers. NNCF utilizes the HAWQ-v2 method to automatically choose optimal mixed precision configuration by taking into account the sensitivity of each layer, i.e. how much lower-bit quantization of each layer decreases the accuracy of model. The most sensitive layers are kept at higher precision. The sensitivity of the i-th layer is calculated by multiplying the average Hessian trace with the L2 norm of quantization perturbation:

\overline{Tr}(H_{i}) * \left | Q(W_{i}) - W_{i} \right |^2_2

The sum of the sensitivities for each layer forms a metric that is used to determine the specific bit precision configuration. The optimal configuration is found by calculating this metric for all possible bitwidth settings and selecting the median one. To reduce exponential search the following restriction is used: layers with a small value of average Hessian trace are quantized to lower bits, and vice versa.

The Hessian trace is estimated with the randomized Hutchinson algorithm. Given Rademacher distributed random vector v, the trace of symmetric matrix H is equal to the estimation of a quadratic form:

Tr(H) = \mathbb{E}[v^T H v]

The randomized algorithm solves the expectation by Monte Carlo using sampling of v from its distribution, evaluating the quadratic term, and averaging:

Tr(H) \approx \frac{1}{m}\sum_{i=1}^{m}[v_i^T H v_i]

Evaluation of the quadratic term happens by computing Hv - the result of multiplication of the Hessian matrix with a given random vector v, without the explicit formation of the Hessian operator. For gradient of the loss with respect to the i-th block g_i and for a random vector v, which is independent of W_i, we have the equation:

\frac{\partial(g_i^T v)}{\partial  W_i} = H_i v

where H_i is the Hessian matrix of loss with respect to W_i. Hence Hv can be computed by 2 backpropagation passes: first - with respect to the loss and second - with respect to the product of the gradients and a random vector.

Automatic mixed precision selection can be enabled by specifying "type": "hawq" in precision group within initializer section of the quantization algorithm. The manual mode is also available by explicitly setting the number of bits per layer through bitwidth_per_scope parameter.

Quantization configuration file parameters:

{
    "algorithm": "quantization",
    "initializer": {
        "range": {
            "num_init_steps": 5, // Number of batches from the training dataset to consume as sample model inputs for purposes of setting initial minimum and maximum quantization ranges
            "type": "minmax" // Type of the initializer - determines which statistics gathered during initialization will be used to initialize the quantization ranges
        },
        "precision": {
            "type": "hawq", // Type of precision initialization - either "manual" or "hawq". With "manual", precisions are defined explicitly via "bitwidth_per_scope". With "hawq", these are determined automatically using the HAWQ algorithm.
            "bits": [4, 8], // A list of bitwidth to choose from when performing precision initialization.",
            "num_data_points": 200, // Number of data points to iteratively estimate Hessian trace, 200 by default.
            "iter_number": 200, // Maximum number of iterations of Hutchinson algorithm to estimate Hessian trace, 200 by default
            "tolerance": 1e-5, //  Minimum relative tolerance for stopping the Hutchinson algorithm. It's calculated  between mean average trace from previous iteration and current one. 1e-5 by default
            "bitwidth_per_scope": [ // Manual settings for the quantizer bitwidths. Scopes are used to identify the weight quantizers. The same number of bits is assigned to adjacent activation quantizers. By default bitwidth is taken from global quantization parameters from `weights` and `activations` sections above
                [
                    4,
                    "MobileNetV2/Sequential[features]/InvertedResidual[8]/Sequential[conv]/NNCFConv2d[0]/ModuleDict[pre_ops]/UpdateWeight[0]/AsymmetricQuantizer[op]"
                ], // A tuple of a bitwidth and a scope
                [
                    4,
                    "ModuleDict/AsymmetricQuantizer[MobileNetV2/Sequential[features]/InvertedResidual[15]/Sequential[conv]/ReLU6[5]/hardtanh_0]"
                ]
            ]
        }
    }
    "weights": { // Constraints to be applied to model weights quantization only.Overrides higher-level settings.
        "mode": "symmetric", // Mode of quantization
        "bits": 8, // Bitwidth to quantize to.
        "signed": true, // Whether to use signed or unsigned input/output values for quantization. If specified as unsigned and the input values during initialization have differing signs, will reset to performing signed quantization instead.
        "per_channel": false, // Whether to quantize inputs per channel (i.e. per 0-th dimension for weight quantization,and per 1-st dimension for activation quantization)

        // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'blacklist'. Optional.
        "ignored_scopes": []

        // A list of model control flow graph node scopes to be considered for this operation - functions as a 'whitelist'. Optional.
        // "target_scopes": []
    },
    "activations": { // Constraints to be applied to model activations quantization only. Overrides higher-level settings.
        "mode": "symmetric", // Mode of quantization
        "bits": 4, // Bitwidth to quantize to.
        "signed": true, // Whether to use signed or unsigned input/output values for quantization. If specified as unsigned and the input values during initialization have differing signs, will reset to performing signed quantization instead.
        "per_channel": false, // Whether to quantize inputs per channel (i.e. per 0-th dimension for weight quantization,and per 1-st dimension for activation quantization)

        // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'blacklist'. Optional.
        "ignored_scopes": []

        // A list of model control flow graph node scopes to be considered for this operation - functions as a 'whitelist'. Optional.
        // "target_scopes": []
    },
    "quantize_inputs": true, // Whether the model inputs should be immediately quantized prior to any other model operations."
    "quantizable_subgraph_patterns": [ // Each sub-list in this list will correspond to a sequence of operations in the model control flow graph that will have a quantizer appended at the end of the sequence
        [
            "cat",
            "batch_norm"
        ],
        [
            "h_swish"
        ]
    ]
    "scope_overrides": { // This option is used to specify overriding quantization constraints for specific scope, e.g. in case you need to quantize a single operation differently than the rest of the model.
        "{re}.*InvertedResidual.*": {
            "mode": "symmetric", // Mode of quantization
            "bits": 4, // Bitwidth to quantize to.
            "signed": true, // Whether to use signed or unsigned input/output values for quantization. If specified as unsigned and the input values during initialization have differing signs, will reset to performing signed quantization instead.
            "per_channel": false // Whether to quantize inputs per channel (i.e. per 0-th dimension for weight quantization,and per 1-st dimension for activation quantization)
        }
    },

    // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'blacklist'. Optional.
    "ignored_scopes": [],

    // A list of model control flow graph node scopes to be considered for this operation - functions as a 'whitelist'. Optional.
    // "target_scopes": [],

    // Determines how should the additional quantization operations be exported into the ONNX format. Set this to false for export to OpenVINO-supported FakeQuantize ONNX, or to true for export to ONNX standard QuantizeLinear-DequantizeLinear node pairs (8-bit quantization only in the latter case). Default: false
    "export_to_onnx_standard_ops": false
}