Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference regression DML 1.10.1->1.11 and higher #483

Closed
divideconcept opened this issue Jul 15, 2023 · 8 comments
Closed

Inference regression DML 1.10.1->1.11 and higher #483

divideconcept opened this issue Jul 15, 2023 · 8 comments

Comments

@divideconcept
Copy link

divideconcept commented Jul 15, 2023

I noticed an inference regression between DML 1.10.1 (and earlier) and DML 1.11 (and later), which causes the inference results to be completely off with some models. I'm not sure what node exactly cause the issue, but here's a complete repro step by step:

  1. Download model3.onnx
  2. Install ONNX Runtime for Python with DirectML (1.12) support : pip install onnxruntime-directml.
  3. Launch Python and run the following block, which shows the results for CPU inference (ground-truth):
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['CPUExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 2.9624936e-01  4.4031662e-01  4.9660692e-01 ...  4.6317300e-01
     4.7017282e-01  4.8335472e-01]
   [ 1.3661800e-01  3.4651098e-01  4.5258337e-01 ...  1.9679888e-01
     1.7567760e-01  1.5068966e-01]
   [ 8.6605281e-02  2.3559587e-01  2.5511262e-01 ...  1.8966110e-01
     1.6827986e-01  1.8628305e-01]
   ...
   [ 2.0687398e-01  1.7956746e-01  1.2259285e-01 ...  4.0398946e-01
     2.9999584e-01  2.3229304e-01]
   [ 2.1055967e-01  2.8771651e-01  1.5513927e-01 ...  2.2960059e-01
     9.8949686e-02  1.3984089e-01]
   [ 3.5148939e-01  5.4730177e-01  4.9234924e-01 ...  5.0844795e-01
     1.0927881e-01  8.2973397e-01]]

  [[ 2.2016920e-03  2.0630960e-03 -5.2188965e-04 ...  2.2417619e-03
    -1.7067563e-05 -1.7323773e-03]
   [ 1.7242560e-03  1.4197731e-03 -3.6929462e-03 ...  1.9717988e-02
     5.6085708e-03 -1.5628221e-04]
   [ 1.1211790e-03  2.6711330e-03 -1.9482106e-03 ...  2.7553817e-02
     2.2175895e-02 -1.0396730e-03]
   ...
   [ 3.3056617e-04  4.6207048e-03  2.2537552e-03 ... -1.4263104e-02
    -6.7468719e-03 -1.0978156e-03]
   [ 4.8008582e-04  3.7944347e-03  2.9231098e-03 ... -1.3265806e-02
     5.2854721e-03 -2.5849973e-03]
   [-1.1614685e+00  1.9571548e-02  9.1706263e-03 ... -3.7664244e-01
    -3.9552280e-01 -3.7509048e-01]]]]

In practice, those values are the expected output.

  1. Now run the following block, which shows the results for DML (1.12) inference:
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 1.2931919   1.2133068   0.92959875 ...  0.7315138   0.790178
     0.7206028 ]
   [ 0.72345215  0.75116944  0.73152304 ...  0.22733253  0.23081687
     0.25107992]
   [ 0.2179307   0.2573861   0.21388349 ...  0.39436132  0.4016724
     0.40784192]
   ...
   [ 0.08995652  0.19412872  0.17644493 ... -0.49595234 -0.5870603
    -0.6045674 ]
   [-0.7070265  -0.48113438 -0.59548837 ... -0.2593547  -0.13349809
    -0.4097974 ]
   [-0.30988705 -0.24486321 -0.37750056 ... -0.09476887 -0.10245383
     1.6842717 ]]

  [[ 1.908289    1.825213    1.5819263  ...  0.5750079   0.5997568
     0.5894706 ]
   [ 0.5852979   0.5780628   0.59584457 ...  0.08814019  0.10299458
     0.10440087]
   [ 0.11373571  0.09142485  0.08090715 ... -0.07507971 -0.05238473
    -0.07834692]
   ...
   [ 0.52536386  0.3986077   0.40048116 ... -2.4888163  -2.5107079
    -2.5601194 ]
   [-2.683926   -2.500065   -2.5673628  ... -2.6227055  -2.2424307
    -3.2082942 ]
   [-3.0153034  -2.7083983  -3.0444543  ... -2.6381683  -2.937557
    -0.8014389 ]]]]

Notice the results are completely different (and in practice, no useable value is produced)

  1. Exit Python, open the following subfolder in your Python folder Lib\site-packages\onnxruntime\capi and rename DirectML.dll to DirectML.bak.
  2. Download DirectML 1.10.1 (latest known version without the regression) and extract DirectML.dll using 7zip or NanaZip for instance (or Open DirectML 1.10.1 in NuGet Package Explorer and download DirectML.dll), and place it in next to DirectML.bak.
  3. Launch Python again and run again the DML code:
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 2.96219379e-01  4.40306246e-01  4.96606797e-01 ...  4.63159323e-01
     4.70169514e-01  4.83349711e-01]
   [ 1.36600077e-01  3.46489936e-01  4.52565104e-01 ...  1.96821779e-01
     1.75700665e-01  1.50690824e-01]
   [ 8.65869373e-02  2.35554934e-01  2.55067706e-01 ...  1.89643666e-01
     1.68269396e-01  1.86276674e-01]
   ...
   [ 2.06858933e-01  1.79518670e-01  1.22561574e-01 ...  4.04009968e-01
     2.99959868e-01  2.32160479e-01]
   [ 2.10545927e-01  2.87681311e-01  1.55097321e-01 ...  2.59109288e-01
     1.26482189e-01  1.55689940e-01]
   [ 3.51458400e-01  5.47263682e-01  4.92354274e-01 ...  5.09262800e-01
     1.08978793e-01  8.29321682e-01]]

  [[ 2.20179441e-03  2.06306414e-03 -5.22566261e-04 ...  2.24243640e-03
    -1.67239050e-05 -1.73229771e-03]
   [ 1.72400963e-03  1.42002921e-03 -3.69322696e-03 ...  1.97228286e-02
     5.61172329e-03 -1.56276918e-04]
   [ 1.12099492e-03  2.67143198e-03 -1.94829446e-03 ...  2.75527705e-02
     2.21788641e-02 -1.04018056e-03]
   ...
   [ 3.30423441e-04  4.62117651e-03  2.25326000e-03 ... -1.42651405e-02
    -6.74304273e-03 -1.09608530e-03]
   [ 4.79940762e-04  3.79496464e-03  2.92291958e-03 ...  9.46796732e-04
     1.33081703e-02 -2.03865883e-03]
   [-1.16166353e+00  1.95743497e-02  9.17264353e-03 ... -3.78102601e-01
    -3.96833807e-01 -3.75986814e-01]]]]

Notice the results are very close, almost identicals to the CPU output. And in practice the values correspond to what is expected.

With this model, correct values are produced with DirectML 1.8, 1.9, 1.10, 1.10.1, and bad values are produced with DirectML 1.11, 1.12. Note that the final tensor shape is correct, only the values are wrong.

My DirectML device is a GeForce RTX 3090.

@divideconcept
Copy link
Author

divideconcept commented Jul 15, 2023

After further investigation, I've identified the node where values are misscalculated.

Opening model3.onnx in netron.app, around the end of the model there's a Transpose node:
image

This Transpose node has the following properties : perm: 1,0,2
And that's where the results from CPU/DirectML 1.10.1 (or earlier) inference and DirectML 1.11 (or later) inference differ.

Here's the code to repro (this requires the onnx package to be installed - also make sure DirectML.dll 1.11 or 1.12 is active):

# add all nodes of the model as outputs
session = ort.InferenceSession("model3.onnx")
outputs = [x.name for x in ort_session.get_outputs()]
model = onnx.load("model3.onnx")
for node in model.graph.node:
    for output in node.output:
        if output not in outputs:
            model.graph.output.extend([onnx.ValueInfoProto(name=output)])
# prepare inputs
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
# infere with CPU
session = ort.InferenceSession(model.SerializeToString(), providers=['CPUExecutionProvider'])
outputs=session.run(["1603"], {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0].shape, outputs[0])
# infere with DML
session = ort.InferenceSession(model.SerializeToString(), providers=['DmlExecutionProvider'])
outputs=session.run(["1603"], {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0].shape, outputs[0])

Output from CPU:

(257, 202, 384) [[[-2.24244222e-02 -1.81161650e-02  1.04738630e-01 ... -2.62294384e-03
   -5.24457321e-02 -8.63133818e-02]
  [ 4.34092991e-02 -1.91172038e-03  1.83144584e-01 ... -1.39332749e-02
   -2.91350372e-02 -9.12476424e-03]
  [ 1.75927863e-01 -3.56033891e-02  7.00066835e-02 ... -4.62932549e-02
   -2.47223936e-02  1.75747147e-03]
  ...
  [-6.26550429e-03 -5.27066961e-02  7.54785836e-02 ... -1.69345154e-03
   -2.18201317e-02  5.16798394e-03]
  [ 9.01989988e-05 -1.04649691e-03 -3.37222181e-02 ... -1.02277445e-04
   -1.39540777e-01  3.56663688e-04]
  [ 1.04166304e-06  7.25851278e-04 -4.34977338e-02 ... -1.32037560e-03
   -1.24461785e-01 -2.93780504e-05]]

 [[-2.19303407e-02 -2.03029327e-02  8.90306607e-02 ... -2.52002873e-03
   -5.03902100e-02 -9.85104367e-02]
  [ 4.52997573e-02 -4.39030398e-03  1.53681174e-01 ... -3.83538716e-02
   -3.50315198e-02 -4.82412241e-03]
  [ 1.83614999e-01 -4.65713479e-02  5.76112531e-02 ... -2.95043159e-02
   -2.42901165e-02  2.83370763e-02]
  ...
  [-1.99852079e-01  2.50023929e-03  6.05125464e-02 ... -7.08830729e-03
   -2.81588733e-02 -3.04497313e-03]
  [ 3.59401945e-03  1.02978829e-05 -4.89478633e-02 ... -1.61632473e-04
   -1.64531782e-01  1.38141841e-05]
  [ 2.19431549e-05  9.33293253e-03 -5.97366728e-02 ... -3.52595677e-03
   -1.37281522e-01 -3.99660552e-03]]

 [[-2.07985602e-02 -1.83043443e-02  7.67841786e-02 ... -2.12177704e-03
   -5.34841903e-02 -1.09880999e-01]
  [ 6.78566024e-02 -6.06867205e-03  1.08080104e-01 ... -6.17316738e-02
   -3.18331569e-02  6.95146620e-03]
  [ 1.86760351e-01 -4.21980396e-02  3.38125788e-02 ... -3.79635952e-02
   -2.70707775e-02  2.43895222e-02]
  ...
  [-7.53476471e-02 -8.28410790e-04  2.53375527e-02 ... -1.54487870e-03
   -2.48027314e-02 -5.08362474e-03]
  [ 4.07979684e-03  2.41013331e-05 -4.57204618e-02 ... -1.08828162e-05
   -1.78377748e-01 -2.01429557e-05]
  [ 2.72689631e-05  1.45370024e-03 -1.13494202e-01 ...  6.66758139e-03
   -1.26200467e-01 -1.68970018e-03]]

 ...

 [[-3.46809463e-03 -5.06885955e-03  5.44518270e-02 ... -2.32961698e-04
   -3.23662795e-02 -4.65176739e-02]
  [ 1.25713602e-01 -7.99832924e-04  6.82542399e-02 ... -1.07500739e-02
   -3.50132622e-02 -4.65782505e-04]
  [ 5.00956237e-01 -3.48149613e-02  8.15351084e-02 ... -8.87240767e-02
   -3.10592409e-02 -2.77274661e-03]
  ...
  [-2.32463349e-02  8.44998285e-05  1.38259038e-01 ... -2.73333775e-04
   -2.28560623e-02 -1.55033249e-05]
   -1.23722441e-01 -2.60449979e-05]]

 [[-3.60833388e-03 -3.31441197e-03  3.89608592e-02 ... -2.54207203e-04
   -2.99281497e-02 -4.30685245e-02]
  [ 1.27522528e-01 -6.38367143e-04  4.20251973e-02 ... -1.07803959e-02
   -3.88976373e-02 -4.55969333e-04]
  [ 4.80060756e-01 -3.22379060e-02  6.08608425e-02 ... -8.77213329e-02
   -3.32298428e-02 -2.54026917e-03]
  ...
  [ 1.57818496e-01 -5.59169974e-04  1.00797571e-01 ...  8.11835518e-04
   -1.58398803e-02  1.24038830e-02]
  [ 1.41852035e-03 -4.16653165e-06 -2.36224048e-02 ...  9.30286405e-05
   -8.67753774e-02  2.50721257e-03]
  [ 1.24601347e-05  2.43027398e-06  6.66262060e-02 ...  3.42243846e-04
   -1.85495743e-03 -2.45646333e-05]]

 [[ 1.05812935e-04  1.44522940e-03  5.93114123e-02 ... -3.53662654e-06
   -3.12188845e-02 -1.85762458e-02]
  [ 5.15261769e-01  2.65544280e-03  5.88782430e-02 ... -2.02777869e-06
    6.93067670e-01 -2.88051833e-05]
  [ 8.76773731e-04  2.29678684e-04  2.79930890e-01 ...  1.00064881e-06
    1.59177646e-01 -7.48374872e-03]
  ...
  [-4.68075825e-07  8.12394774e-06  1.00143395e-01 ...  5.46613244e-08
    9.41725254e-01 -4.87099271e-07]
  [ 9.36946366e-04  2.75643231e-07  2.74980843e-01 ...  8.97245727e-08
    9.84934926e-01  1.19077333e-06]
  [-5.14064450e-04  1.32889836e-05  1.29662991e-01 ...  4.32575931e-08
    9.92943466e-01  4.11752978e-08]]]

Output from DirectML:

(257, 202, 384) [[[-2.24237777e-02 -1.81159042e-02  1.04737826e-01 ... -2.62334966e-03
   -5.24464287e-02 -8.63163546e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[-2.19296645e-02 -2.03028359e-02  8.90300572e-02 ... -2.52034538e-03
   -5.03907092e-02 -9.85143781e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[-2.07979754e-02 -1.83043629e-02  7.67838955e-02 ... -2.12199963e-03
   -5.34842797e-02 -1.09886073e-01]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 ...

 [[-3.46806739e-03 -5.06879063e-03  5.44530377e-02 ... -2.32965860e-04
   -3.23669985e-02 -4.65166308e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[-3.60834715e-03 -3.31466878e-03  3.89620177e-02 ... -2.54210085e-04
   -2.99288891e-02 -4.30673063e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[ 1.06376880e-04  1.44572917e-03  5.93050644e-02 ... -3.53716496e-06
   -3.12189274e-02 -1.85762532e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]]

If you check the input of this Transpose (node 1602), both CPU and DirectML values match.

So for some reason the Transpose with perm 1,0,2 doesn't work with DirectML 1.11 and higher. The output shape is correct, but the values are all messed up.

@divideconcept divideconcept changed the title Inference regression DML 10.1->11 and higher Inference regression DML 1.10.1->1.11 and higher Jul 16, 2023
@divideconcept
Copy link
Author

@jstoecker
Copy link
Contributor

Thanks for reporting this, we'll try to take a look soon. @martinb35

@fdwr
Copy link
Contributor

fdwr commented Aug 8, 2023

@smk2007 has a pending fix for an upcoming patch release in a few weeks. ⏳

@adtsai
Copy link
Contributor

adtsai commented Aug 25, 2023

@divideconcept this issue has been fixed in DirectML 1.12.1. This fix will also be incorporated into the upcoming onnxruntime-directml 1.16 which is expected to release in the coming weeks.

@fdwr
Copy link
Contributor

fdwr commented Sep 18, 2023

@divideconcept I'm curious if DirectML.dll 1.12.1 solved it? (appears the ORT 1.16 release is still delayed...)

@divideconcept
Copy link
Author

@fdwr @adtsai I confirm DirectML 1.12.1 fixed the issue !

@fdwr
Copy link
Contributor

fdwr commented Sep 18, 2023

Thanks Sheil for fixing and Robin for verifying. Closing ✅.

@fdwr fdwr closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants