Inference regression DML 1.10.1->1.11 and higher #483

divideconcept · 2023-07-15T16:20:19Z

I noticed an inference regression between DML 1.10.1 (and earlier) and DML 1.11 (and later), which causes the inference results to be completely off with some models. I'm not sure what node exactly cause the issue, but here's a complete repro step by step:

Download model3.onnx
Install ONNX Runtime for Python with DirectML (1.12) support : pip install onnxruntime-directml.
Launch Python and run the following block, which shows the results for CPU inference (ground-truth):

import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['CPUExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 2.9624936e-01  4.4031662e-01  4.9660692e-01 ...  4.6317300e-01
     4.7017282e-01  4.8335472e-01]
   [ 1.3661800e-01  3.4651098e-01  4.5258337e-01 ...  1.9679888e-01
     1.7567760e-01  1.5068966e-01]
   [ 8.6605281e-02  2.3559587e-01  2.5511262e-01 ...  1.8966110e-01
     1.6827986e-01  1.8628305e-01]
   ...
   [ 2.0687398e-01  1.7956746e-01  1.2259285e-01 ...  4.0398946e-01
     2.9999584e-01  2.3229304e-01]
   [ 2.1055967e-01  2.8771651e-01  1.5513927e-01 ...  2.2960059e-01
     9.8949686e-02  1.3984089e-01]
   [ 3.5148939e-01  5.4730177e-01  4.9234924e-01 ...  5.0844795e-01
     1.0927881e-01  8.2973397e-01]]

  [[ 2.2016920e-03  2.0630960e-03 -5.2188965e-04 ...  2.2417619e-03
    -1.7067563e-05 -1.7323773e-03]
   [ 1.7242560e-03  1.4197731e-03 -3.6929462e-03 ...  1.9717988e-02
     5.6085708e-03 -1.5628221e-04]
   [ 1.1211790e-03  2.6711330e-03 -1.9482106e-03 ...  2.7553817e-02
     2.2175895e-02 -1.0396730e-03]
   ...
   [ 3.3056617e-04  4.6207048e-03  2.2537552e-03 ... -1.4263104e-02
    -6.7468719e-03 -1.0978156e-03]
   [ 4.8008582e-04  3.7944347e-03  2.9231098e-03 ... -1.3265806e-02
     5.2854721e-03 -2.5849973e-03]
   [-1.1614685e+00  1.9571548e-02  9.1706263e-03 ... -3.7664244e-01
    -3.9552280e-01 -3.7509048e-01]]]]

In practice, those values are the expected output.

Now run the following block, which shows the results for DML (1.12) inference:

import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 1.2931919   1.2133068   0.92959875 ...  0.7315138   0.790178
     0.7206028 ]
   [ 0.72345215  0.75116944  0.73152304 ...  0.22733253  0.23081687
     0.25107992]
   [ 0.2179307   0.2573861   0.21388349 ...  0.39436132  0.4016724
     0.40784192]
   ...
   [ 0.08995652  0.19412872  0.17644493 ... -0.49595234 -0.5870603
    -0.6045674 ]
   [-0.7070265  -0.48113438 -0.59548837 ... -0.2593547  -0.13349809
    -0.4097974 ]
   [-0.30988705 -0.24486321 -0.37750056 ... -0.09476887 -0.10245383
     1.6842717 ]]

  [[ 1.908289    1.825213    1.5819263  ...  0.5750079   0.5997568
     0.5894706 ]
   [ 0.5852979   0.5780628   0.59584457 ...  0.08814019  0.10299458
     0.10440087]
   [ 0.11373571  0.09142485  0.08090715 ... -0.07507971 -0.05238473
    -0.07834692]
   ...
   [ 0.52536386  0.3986077   0.40048116 ... -2.4888163  -2.5107079
    -2.5601194 ]
   [-2.683926   -2.500065   -2.5673628  ... -2.6227055  -2.2424307
    -3.2082942 ]
   [-3.0153034  -2.7083983  -3.0444543  ... -2.6381683  -2.937557
    -0.8014389 ]]]]

Notice the results are completely different (and in practice, no useable value is produced)

Exit Python, open the following subfolder in your Python folder Lib\site-packages\onnxruntime\capi and rename DirectML.dll to DirectML.bak.
Download DirectML 1.10.1 (latest known version without the regression) and extract DirectML.dll using 7zip or NanaZip for instance (or Open DirectML 1.10.1 in NuGet Package Explorer and download DirectML.dll), and place it in next to DirectML.bak.
Launch Python again and run again the DML code:

import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 2.96219379e-01  4.40306246e-01  4.96606797e-01 ...  4.63159323e-01
     4.70169514e-01  4.83349711e-01]
   [ 1.36600077e-01  3.46489936e-01  4.52565104e-01 ...  1.96821779e-01
     1.75700665e-01  1.50690824e-01]
   [ 8.65869373e-02  2.35554934e-01  2.55067706e-01 ...  1.89643666e-01
     1.68269396e-01  1.86276674e-01]
   ...
   [ 2.06858933e-01  1.79518670e-01  1.22561574e-01 ...  4.04009968e-01
     2.99959868e-01  2.32160479e-01]
   [ 2.10545927e-01  2.87681311e-01  1.55097321e-01 ...  2.59109288e-01
     1.26482189e-01  1.55689940e-01]
   [ 3.51458400e-01  5.47263682e-01  4.92354274e-01 ...  5.09262800e-01
     1.08978793e-01  8.29321682e-01]]

  [[ 2.20179441e-03  2.06306414e-03 -5.22566261e-04 ...  2.24243640e-03
    -1.67239050e-05 -1.73229771e-03]
   [ 1.72400963e-03  1.42002921e-03 -3.69322696e-03 ...  1.97228286e-02
     5.61172329e-03 -1.56276918e-04]
   [ 1.12099492e-03  2.67143198e-03 -1.94829446e-03 ...  2.75527705e-02
     2.21788641e-02 -1.04018056e-03]
   ...
   [ 3.30423441e-04  4.62117651e-03  2.25326000e-03 ... -1.42651405e-02
    -6.74304273e-03 -1.09608530e-03]
   [ 4.79940762e-04  3.79496464e-03  2.92291958e-03 ...  9.46796732e-04
     1.33081703e-02 -2.03865883e-03]
   [-1.16166353e+00  1.95743497e-02  9.17264353e-03 ... -3.78102601e-01
    -3.96833807e-01 -3.75986814e-01]]]]

Notice the results are very close, almost identicals to the CPU output. And in practice the values correspond to what is expected.

With this model, correct values are produced with DirectML 1.8, 1.9, 1.10, 1.10.1, and bad values are produced with DirectML 1.11, 1.12. Note that the final tensor shape is correct, only the values are wrong.

My DirectML device is a GeForce RTX 3090.

The text was updated successfully, but these errors were encountered:

divideconcept · 2023-07-15T18:25:55Z

After further investigation, I've identified the node where values are misscalculated.

Opening model3.onnx in netron.app, around the end of the model there's a Transpose node:

This Transpose node has the following properties : perm: 1,0,2
And that's where the results from CPU/DirectML 1.10.1 (or earlier) inference and DirectML 1.11 (or later) inference differ.

Here's the code to repro (this requires the onnx package to be installed - also make sure DirectML.dll 1.11 or 1.12 is active):

# add all nodes of the model as outputs
session = ort.InferenceSession("model3.onnx")
outputs = [x.name for x in ort_session.get_outputs()]
model = onnx.load("model3.onnx")
for node in model.graph.node:
    for output in node.output:
        if output not in outputs:
            model.graph.output.extend([onnx.ValueInfoProto(name=output)])
# prepare inputs
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
# infere with CPU
session = ort.InferenceSession(model.SerializeToString(), providers=['CPUExecutionProvider'])
outputs=session.run(["1603"], {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0].shape, outputs[0])
# infere with DML
session = ort.InferenceSession(model.SerializeToString(), providers=['DmlExecutionProvider'])
outputs=session.run(["1603"], {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0].shape, outputs[0])

Output from CPU:

(257, 202, 384) [[[-2.24244222e-02 -1.81161650e-02  1.04738630e-01 ... -2.62294384e-03
   -5.24457321e-02 -8.63133818e-02]
  [ 4.34092991e-02 -1.91172038e-03  1.83144584e-01 ... -1.39332749e-02
   -2.91350372e-02 -9.12476424e-03]
  [ 1.75927863e-01 -3.56033891e-02  7.00066835e-02 ... -4.62932549e-02
   -2.47223936e-02  1.75747147e-03]
  ...
  [-6.26550429e-03 -5.27066961e-02  7.54785836e-02 ... -1.69345154e-03
   -2.18201317e-02  5.16798394e-03]
  [ 9.01989988e-05 -1.04649691e-03 -3.37222181e-02 ... -1.02277445e-04
   -1.39540777e-01  3.56663688e-04]
  [ 1.04166304e-06  7.25851278e-04 -4.34977338e-02 ... -1.32037560e-03
   -1.24461785e-01 -2.93780504e-05]]

 [[-2.19303407e-02 -2.03029327e-02  8.90306607e-02 ... -2.52002873e-03
   -5.03902100e-02 -9.85104367e-02]
  [ 4.52997573e-02 -4.39030398e-03  1.53681174e-01 ... -3.83538716e-02
   -3.50315198e-02 -4.82412241e-03]
  [ 1.83614999e-01 -4.65713479e-02  5.76112531e-02 ... -2.95043159e-02
   -2.42901165e-02  2.83370763e-02]
  ...
  [-1.99852079e-01  2.50023929e-03  6.05125464e-02 ... -7.08830729e-03
   -2.81588733e-02 -3.04497313e-03]
  [ 3.59401945e-03  1.02978829e-05 -4.89478633e-02 ... -1.61632473e-04
   -1.64531782e-01  1.38141841e-05]
  [ 2.19431549e-05  9.33293253e-03 -5.97366728e-02 ... -3.52595677e-03
   -1.37281522e-01 -3.99660552e-03]]

 [[-2.07985602e-02 -1.83043443e-02  7.67841786e-02 ... -2.12177704e-03
   -5.34841903e-02 -1.09880999e-01]
  [ 6.78566024e-02 -6.06867205e-03  1.08080104e-01 ... -6.17316738e-02
   -3.18331569e-02  6.95146620e-03]
  [ 1.86760351e-01 -4.21980396e-02  3.38125788e-02 ... -3.79635952e-02
   -2.70707775e-02  2.43895222e-02]
  ...
  [-7.53476471e-02 -8.28410790e-04  2.53375527e-02 ... -1.54487870e-03
   -2.48027314e-02 -5.08362474e-03]
  [ 4.07979684e-03  2.41013331e-05 -4.57204618e-02 ... -1.08828162e-05
   -1.78377748e-01 -2.01429557e-05]
  [ 2.72689631e-05  1.45370024e-03 -1.13494202e-01 ...  6.66758139e-03
   -1.26200467e-01 -1.68970018e-03]]

 ...

 [[-3.46809463e-03 -5.06885955e-03  5.44518270e-02 ... -2.32961698e-04
   -3.23662795e-02 -4.65176739e-02]
  [ 1.25713602e-01 -7.99832924e-04  6.82542399e-02 ... -1.07500739e-02
   -3.50132622e-02 -4.65782505e-04]
  [ 5.00956237e-01 -3.48149613e-02  8.15351084e-02 ... -8.87240767e-02
   -3.10592409e-02 -2.77274661e-03]
  ...
  [-2.32463349e-02  8.44998285e-05  1.38259038e-01 ... -2.73333775e-04
   -2.28560623e-02 -1.55033249e-05]
   -1.23722441e-01 -2.60449979e-05]]

 [[-3.60833388e-03 -3.31441197e-03  3.89608592e-02 ... -2.54207203e-04
   -2.99281497e-02 -4.30685245e-02]
  [ 1.27522528e-01 -6.38367143e-04  4.20251973e-02 ... -1.07803959e-02
   -3.88976373e-02 -4.55969333e-04]
  [ 4.80060756e-01 -3.22379060e-02  6.08608425e-02 ... -8.77213329e-02
   -3.32298428e-02 -2.54026917e-03]
  ...
  [ 1.57818496e-01 -5.59169974e-04  1.00797571e-01 ...  8.11835518e-04
   -1.58398803e-02  1.24038830e-02]
  [ 1.41852035e-03 -4.16653165e-06 -2.36224048e-02 ...  9.30286405e-05
   -8.67753774e-02  2.50721257e-03]
  [ 1.24601347e-05  2.43027398e-06  6.66262060e-02 ...  3.42243846e-04
   -1.85495743e-03 -2.45646333e-05]]

 [[ 1.05812935e-04  1.44522940e-03  5.93114123e-02 ... -3.53662654e-06
   -3.12188845e-02 -1.85762458e-02]
  [ 5.15261769e-01  2.65544280e-03  5.88782430e-02 ... -2.02777869e-06
    6.93067670e-01 -2.88051833e-05]
  [ 8.76773731e-04  2.29678684e-04  2.79930890e-01 ...  1.00064881e-06
    1.59177646e-01 -7.48374872e-03]
  ...
  [-4.68075825e-07  8.12394774e-06  1.00143395e-01 ...  5.46613244e-08
    9.41725254e-01 -4.87099271e-07]
  [ 9.36946366e-04  2.75643231e-07  2.74980843e-01 ...  8.97245727e-08
    9.84934926e-01  1.19077333e-06]
  [-5.14064450e-04  1.32889836e-05  1.29662991e-01 ...  4.32575931e-08
    9.92943466e-01  4.11752978e-08]]]

Output from DirectML:

(257, 202, 384) [[[-2.24237777e-02 -1.81159042e-02  1.04737826e-01 ... -2.62334966e-03
   -5.24464287e-02 -8.63163546e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[-2.19296645e-02 -2.03028359e-02  8.90300572e-02 ... -2.52034538e-03
   -5.03907092e-02 -9.85143781e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[-2.07979754e-02 -1.83043629e-02  7.67838955e-02 ... -2.12199963e-03
   -5.34842797e-02 -1.09886073e-01]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 ...

 [[-3.46806739e-03 -5.06879063e-03  5.44530377e-02 ... -2.32965860e-04
   -3.23669985e-02 -4.65166308e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[-3.60834715e-03 -3.31466878e-03  3.89620177e-02 ... -2.54210085e-04
   -2.99288891e-02 -4.30673063e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[ 1.06376880e-04  1.44572917e-03  5.93050644e-02 ... -3.53716496e-06
   -3.12189274e-02 -1.85762532e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]]

If you check the input of this Transpose (node 1602), both CPU and DirectML values match.

So for some reason the Transpose with perm 1,0,2 doesn't work with DirectML 1.11 and higher. The output shape is correct, but the values are all messed up.

divideconcept · 2023-07-18T12:10:29Z

@jstoecker @fdwr @zhangxiang1993

jstoecker · 2023-07-19T18:10:06Z

Thanks for reporting this, we'll try to take a look soon. @martinb35

fdwr · 2023-08-08T18:51:52Z

@smk2007 has a pending fix for an upcoming patch release in a few weeks. ⏳

adtsai · 2023-08-25T20:39:27Z

@divideconcept this issue has been fixed in DirectML 1.12.1. This fix will also be incorporated into the upcoming onnxruntime-directml 1.16 which is expected to release in the coming weeks.

fdwr · 2023-09-18T10:46:17Z

@divideconcept I'm curious if DirectML.dll 1.12.1 solved it? (appears the ORT 1.16 release is still delayed...)

divideconcept · 2023-09-18T11:04:19Z

@fdwr @adtsai I confirm DirectML 1.12.1 fixed the issue !

fdwr · 2023-09-18T20:23:15Z

Thanks Sheil for fixing and Robin for verifying. Closing ✅.

divideconcept changed the title ~~Inference regression DML 10.1->11 and higher~~ Inference regression DML 1.10.1->1.11 and higher Jul 16, 2023

fdwr closed this as completed Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference regression DML 1.10.1->1.11 and higher #483

Inference regression DML 1.10.1->1.11 and higher #483

divideconcept commented Jul 15, 2023 •

edited

Loading

divideconcept commented Jul 15, 2023 •

edited

Loading

divideconcept commented Jul 18, 2023

jstoecker commented Jul 19, 2023

fdwr commented Aug 8, 2023

adtsai commented Aug 25, 2023 •

edited

Loading

fdwr commented Sep 18, 2023

divideconcept commented Sep 18, 2023

fdwr commented Sep 18, 2023

Inference regression DML 1.10.1->1.11 and higher #483

Inference regression DML 1.10.1->1.11 and higher #483

Comments

divideconcept commented Jul 15, 2023 • edited Loading

divideconcept commented Jul 15, 2023 • edited Loading

divideconcept commented Jul 18, 2023

jstoecker commented Jul 19, 2023

fdwr commented Aug 8, 2023

adtsai commented Aug 25, 2023 • edited Loading

fdwr commented Sep 18, 2023

divideconcept commented Sep 18, 2023

fdwr commented Sep 18, 2023

divideconcept commented Jul 15, 2023 •

edited

Loading

divideconcept commented Jul 15, 2023 •

edited

Loading

adtsai commented Aug 25, 2023 •

edited

Loading