-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized ONNX Model Still Has Float32 Input/Output Tensors #21138
Comments
This is the QDQ representation of ONNX model. In order to perform integer-arithmetic only, you have to quantize your model to QOperator representation. For mor detail, follow https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html. def quantize_static(
model_input: Union[str, Path, onnx.ModelProto],
model_output: Union[str, Path],
calibration_data_reader: CalibrationDataReader,
quant_format=QuantFormat.QDQ, # Change this to QuantFormat.QOperator
op_types_to_quantize=None,
per_channel=False,
reduce_range=False,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
nodes_to_quantize=None,
nodes_to_exclude=None,
use_external_data_format=False,
calibrate_method=CalibrationMethod.MinMax,
extra_options=None,
):
... |
@hoangtv2000 Thank you for the comment!
I tried your suggestion to perform quantization using QOperator. However, the quantized model's input and output remain float32. Do you have any idea regarding this? |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
I'm facing the same problem, have you solved that? |
Describe the issue
After quantization, the output ONNX model had faster inference speed and smaller model size, but why are the input and output tensors still float32?
I thought it should be uint8 since the output ONNX file is around one fourth of the original size. Also, I tried onnxruntime 1.12.0, 1.13.1, and 1.18.0 and all the results are the same that input and output tensors are all float32.
To reproduce
onnxruntime: 1.14.1
torch: 2.3.0
torchvision: 0.18.0
Follow the official example and the results can be reproduced.
https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagenet_v2/mobilenet.ipynb
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
Python
Architecture
X86
Execution Provider
Default CPU
Execution Provider Library Version
No response
The text was updated successfully, but these errors were encountered: