Skip to content

Commit

Permalink
Extend weight compression with INT8 symmetric scheme (openvinotoolkit…
Browse files Browse the repository at this point in the history
…#2288)

### Changes

Added `INT8_SYM` compression mode

### Reason for changes

`INT8_SYM` mode can provide a better performance and required for
dynamic quantization

### Related tickets

124823

### Tests

Updated tests/openvino/native/quantization/test_weights_compression.py
  • Loading branch information
l-bat committed Dec 7, 2023
1 parent a6e4928 commit 6d08f52
Show file tree
Hide file tree
Showing 11 changed files with 379 additions and 84 deletions.
32 changes: 20 additions & 12 deletions docs/compression_algorithms/CompressWeights.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,30 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod

#### Supported modes

By default, weights are compressed to 8-bit integer data type - "INT8" mode.
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings and last linear layers are always compressed to 8-bit integer data type.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit integer data type.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.

#### User guide

- Compress weights to 8-bit integer data type.
- Compress weights asymmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
compressed_model = compress_weights(model)
```

- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed to 8-bit integer data type.
- Compress weights symmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT8_SYM)
```

- Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed asymmetrically to 8-bit integer data type.

```python
from nncf import compress_weights
Expand All @@ -36,7 +44,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM)
If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`.
Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
the rest of layers to 8-bit integer data type. The same parametrization is applicable for `INT4_SYM` mode.
the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for `INT4_SYM` mode.

```python
from nncf import compress_weights
Expand All @@ -45,7 +53,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, g
```

- `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
quantized to 8-bit integer. Here's the example how to compress weights to nf4 data type with group size = 128.
quantized to 8-bit asymmetric integer. Here's the example how to compress weights to nf4 data type with group size = 128.
Different `group_size` and `ratio` are also supported.

```python
Expand Down Expand Up @@ -79,7 +87,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">databricks/dolly-v2-3b</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">5.07</td>
<td class="tg-0pky">0.05</td>
<td class="tg-0pky">2.6</td>
Expand Down Expand Up @@ -107,7 +115,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">facebook/opt-6.7b</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">4.27</td>
<td class="tg-0pky">0.01</td>
<td class="tg-0pky">6.2</td>
Expand Down Expand Up @@ -135,7 +143,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">3.29</td>
<td class="tg-0pky">0.01</td>
<td class="tg-0pky">6.3</td>
Expand Down Expand Up @@ -163,7 +171,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">4.17</td>
<td class="tg-0pky">0.02</td>
<td class="tg-0pky">6.4</td>
Expand Down Expand Up @@ -191,7 +199,7 @@ Here is the perplexity and model size before and after weight compression for di
</tr>
<tr>
<td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
<td class="tg-0pky">int8</td>
<td class="tg-0pky">int8_asym</td>
<td class="tg-0pky">2.91</td>
<td class="tg-0pky">0</td>
<td class="tg-0pky">12.1</td>
Expand All @@ -218,7 +226,7 @@ Here is the perplexity and model size before and after weight compression for di
- The algorithm is supported for OpenVINO and PyTorch models.
- The compression applies in-place.
- The compressed model is not trainable.
- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.

#### Additional resources
Expand Down
14 changes: 11 additions & 3 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,20 +62,28 @@ class DropType(Enum):
class CompressWeightsMode(Enum):
"""
Defines a mode for weight compression.
:param INT8: Stands for 8-bit integer quantization of all weights.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
:param INT8_ASYM: The same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param INT4_SYM: Stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization
:param INT4_ASYM: The same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
:param NF4: The the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param INT8: Mode is deprecated and will be removed in future releases. Please use `INT8_ASYM` instead.
"""

INT8 = "int8"
INT8_SYM = "int8_sym"
INT8_ASYM = "int8_asym"
INT4_SYM = "int4_sym"
INT4_ASYM = "int4_asym"
NF4 = "nf4"
INT8 = "int8" # Deprecated mode
9 changes: 6 additions & 3 deletions nncf/quantization/algorithms/weight_compression/algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,20 @@ def __init__(
):
"""
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ratio: the ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
and the rest to INT8).
and the rest to INT8_ASYM).
:param group_size: number of weights (e.g. 128) in the channel dimension
that share quantization parameters (scale). The value -1 means no grouping.
:param ignored_scope: An ignored scope that defined the list of model control
Expand Down
16 changes: 11 additions & 5 deletions nncf/quantization/algorithms/weight_compression/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,13 @@ def validate_params(mode: CompressWeightsMode, ignored_scope: Optional[IgnoredSc
parameters. Should be called on early algorithm steps to prevent execution of time-consuming operations.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
Expand All @@ -77,17 +80,20 @@ def do_compression(
:param nodes_to_compress: List of nodes in the model's graph,
corresponding to the layers for weight compression.
:param mode: Defines a mode for weight compression.
INT8 stands for 8-bit integer quantization of all weights.
INT8_SYM stands for 8-bit integer symmetric quantization of all weights.
Weights are quantized symmetrically with a fixed zero point equals to 128.
INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.
Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8.
All embeddings and the last layer are always compressed to a backup precision, which is 8-bit integer,
All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM,
by default. All others are quantized whether to 4-bit integer or to a backup precision depending on
criteria and the given ratio.
INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically
with a typical non-fixed zero point.
NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.
:param ratio: The ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
and the rest to INT8).
and the rest to INT8_ASYM).
:param group_size: Number of weights (e.g. 128) in the channel dimension
that share quantization parameters (scale). The value -1 means no grouping.
:return: A resulting model with compressed weights.
Expand Down
Loading

0 comments on commit 6d08f52

Please sign in to comment.