Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Closes #2027) fix for ArrayMixin._effective_shape bug #2685

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

arporter
Copy link
Member

@arporter arporter commented Aug 7, 2024

We had a bug which would cause a crash if we had an array access like a(b) where b is actually an array.

In testing this, I also realised that the ArrayAssignment2LoopsTrans transformation would incorrectly convert something like:

c(:) = a(b)

to

do idx = 1, SIZE(c,1)
   c(idx) = a(b)
end do

This should in fact be:

do idx = 1, SIZE(c,1)
  c(idx) = a(b(idx))
end do

This also means we have a bug in examples/nemo/scripts/utils.py because it should have converted a(b) to a(b(:)).

@arporter arporter marked this pull request as draft August 7, 2024 10:40
@arporter arporter self-assigned this Aug 7, 2024
@arporter arporter added NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH PSyIR Core PSyIR functionality bug labels Aug 7, 2024
Copy link

codecov bot commented Aug 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.85%. Comparing base (6aad1d4) to head (eb4f8c8).
Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2685      +/-   ##
==========================================
- Coverage   99.86%   99.85%   -0.02%     
==========================================
  Files         354      354              
  Lines       49112    49123      +11     
==========================================
+ Hits        49048    49052       +4     
- Misses         64       71       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@arporter
Copy link
Member Author

arporter commented Aug 7, 2024

The good news: the integration tests pass.
Bad news:
image
i.e. the performance of NEMO with OMP offload and OMP CPU threading has taken a big hit.

@arporter
Copy link
Member Author

arporter commented Aug 7, 2024

I've diffed the generated OMP code with that from master and the only differences are in a comment where we don't sort the order of 'impure' routine names. Perhaps the machine just happened to be busy? Oh, there are a couple of differences in some routine calls:

diff psycloned-openmp_cpu-master/icbdyn.f90 psycloned-openmp_cpu/icbdyn.f90
108c108
<       pt%lon = icb_utl_bilin_x(glamt,pt%xi,pt%yj)
---
>       pt%lon = icb_utl_bilin_x(glamt(:,:),pt%xi,pt%yj)
diff psycloned-openmp_cpu-master/icbrst.f90 psycloned-openmp_cpu/icbrst.f90
341c341
<         nret = nf90_put_var(ncid,nsiceid,griddata,nstrt3,nlngth3)
---
>         nret = nf90_put_var(ncid,nsiceid,griddata,nstrt3(:),nlngth3(:))

but they look reassuring (as in, something has actually changed) and shouldn't affect performance at all.

@arporter
Copy link
Member Author

arporter commented Aug 7, 2024

I re-ran the NEMO tests and got:
image
so it may just be machine noise. I'll check the generated OMP-offload code against that from master too.

@arporter
Copy link
Member Author

arporter commented Aug 8, 2024

The only diffs between master and this branch in the generated OMP-offoad version of NEMO4 are:

$ diff psycloned-openmp_gpu{,-2027}
diff psycloned-openmp_gpu/icbdyn.f90 psycloned-openmp_gpu-2027/icbdyn.f90
111c111
<       pt%lon = icb_utl_bilin_x(glamt,pt%xi,pt%yj)
---
>       pt%lon = icb_utl_bilin_x(glamt(:,:),pt%xi,pt%yj)
diff psycloned-openmp_gpu/icbrst.f90 psycloned-openmp_gpu-2027/icbrst.f90
365c365
<         nret = nf90_put_var(ncid,nsiceid,griddata,nstrt3,nlngth3)
---
>         nret = nf90_put_var(ncid,nsiceid,griddata,nstrt3(:),nlngth3(:))

Therefore I declare this ready for review. One for @sergisiso, @AidanChalk or @hiker.

@arporter arporter marked this pull request as ready for review August 8, 2024 08:16
@sergisiso
Copy link
Collaborator

I brought the PR to master in order to run the integration tests again (it needed to use updated modules). With an empty system this PR shows no performance degradation 👍

Copy link
Collaborator

@sergisiso sergisiso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @arporter, I suggested cases that may still break the logic of this functionality, but if fixing them is complicated feel free to defer them to follow up PRs.

Comment on lines +249 to +257
if hasattr(reference, "indices"):
# Look at array-index expressions too.
for exprn in reference.indices:
if (isinstance(exprn, Reference) and
isinstance(exprn.symbol, DataSymbol)):
try:
Reference2ArrayRangeTrans().apply(exprn)
except TransformationError:
pass
Copy link
Collaborator

@sergisiso sergisiso Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead remove the stop_type=Reference) above?

This may still fail to recurse in cases where the top reference does not have an indices, e.g.:
my_struct%my_field%my_array_field(another_array)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, this was not causing a bug, right? It is just an improvement?

(The ArrayAssignment2LoopsTrans is/was already calling the same recursive Reference2ArrayRangeTrans so it should have been expanded there anyway)

Comment on lines +699 to +701
shape = routine.children[child_idx].lhs._get_effective_shape()
assert len(shape) == 1
assert "SIZE(a)" in shape[0].debug_string()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test if we do the right thing if we don't know the type of the symbols:
b(scalarval, arrayval) = 1

And I see this method is not overriden by ArrayMember nor ArrayOfStructuresMixing. So we could also add some tests with these constructs here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't but we do now. Added tests.

if num_of_ranges > 0:
if not found:
count = len([x for x in accessor.indices if isinstance(x, Range)])
if count:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the name changes, but for clarity I suggest still doing "if count > 0", even if technically it is the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# array indices must be known (because otherwise they themselves
# might be arrays, e.g.
# a(:) = b(c) where `c` can be a scalar or an array.
if isinstance(reference, ArrayMixin):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, do we still have a problem if the ArrayMixin is not in the top-level reference? e.g.
a(:) = b%field(c)
(maybe it should walk for inner arraymixins?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extended the test in arrayassignment2loops_trans_test for this situation and it works!

Comment on lines 379 to 380
# If the number of ranges in this access is not the same as
# on the LHS then we may or may not have a scalar.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add: "To disambiguate we need to know that the Reference2ArrayRangeTrans in the apply will provide the remaining ranges, and for this it will need to know the type of each index expression"? (if this is the right reasoning here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I've extended the comment - see what you think.

Comment on lines 382 to 383
if not isinstance(exprn, Reference):
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that's right, for example in the new "test_apply_indirect_indexing" if in
ishtSi(5:8,jf) = ishtSi( iwewe,jf )
if iwewe was not a Reference and instead a codeblock:
ishtSi(5:8,jf) = ishtSi( (/ jpwe,jpea,jpwe,jpea /) , jf )
it should not validate.

Similarly, if it was a function call that is not elemental but we don't know if it returns a scalar or an array.

By contrast, I think we will probably do the right thing for expressions, e.g.:
ishtSi(5:8,jf) = ishtSi( iwewe + 1, jf )
as expr.datatype with bubble up the UnresolvedType if any part of the expression is unresolved.

Please add tests for all these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to what you expected, we already refused to validate your first two examples. However, for the third one we haven't yet implemented support for finding the datatype when there are expressions in the array indices (#1799). I've added tests for all of these along with an associated TODO for the last one.

@sergisiso
Copy link
Collaborator

I am finding this same issue in lbclnk.f90 of NEMOv5 in my two open PRs, merging this branch into them fixes it.

@arporter
Copy link
Member Author

Unfortunately, my tightening-up of checks on the types of expressions appearing in array-indices now means that the WHERE handling is refusing some code that is/was OK, e.g.:

  WHERE( picefr(:,:) > 1.e-10 )
    zevap_ice(:,:,1) = snow(:,3,:) * frcv(jpr_ievp)%z3(:,:,1) / picefr(:,:)

Here, we don't know the type of jpr_ievp (because it is imported) and therefore we refuse to find the shape of frcv(jpr_ievp)%z3(:,:,1). However, we can see that the whole expression already contains two ranges (:) and therefore we can conclude that jpr_ievp must be a scalar. This is going to be hard to deal with and also will conflict with @LonelyCat124 's work in #2687. I therefore don't really want to touch it until that PR is on (unless I can find a simple fix).

@sergisiso
Copy link
Collaborator

@arporter The WHERE branch has been merged, so this PR can probably continue now. Also note that a "lbclnk.f90", # TODO #2685: effective shape bug was added to examples/nemo/scripts/omp_cpu_trans.py as this was affecting NEMOv5. We want to make sure this PR resolves it and we can delete the exclusion.

@sergisiso
Copy link
Collaborator

Unfortunately, my tightening-up of checks on the types of expressions appearing in array-indices now means that the WHERE handling is refusing some code that is/was OK
Here, we don't know the type of jpr_ievp (because it is imported) and therefore we refuse to find the shape of

By the way, @LonelyCat124 also tightened the WHEREs with imported symbols and I believe he had to bring "jpr_ievp" and other symbols as locals in some tests. We also seen in the integration tests that this did not affect NEMO performance. So, your issue may be resolved now.

@arporter
Copy link
Member Author

arporter commented Oct 1, 2024

@arporter The WHERE branch has been merged, so this PR can probably continue now. Also note that a "lbclnk.f90", # TODO #2685: effective shape bug was added to examples/nemo/scripts/omp_cpu_trans.py as this was affecting NEMOv5. We want to make sure this PR resolves it and we can delete the exclusion.

In what way did the bug manifest with this file? I've removed this exclusion and re-run the integration tests which were fine apart from the race condition affecting the OMP CPU results.

@sergisiso
Copy link
Collaborator

It was a pyclone exception while processing the file. I knew this branch resolved it because I tried merging it even before the first review and I saw it was fixed here, that's why I added the TODO. It should be good now.

@arporter
Copy link
Member Author

arporter commented Oct 2, 2024

I have some indirect coverage changes that I need to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH PSyIR Core PSyIR functionality reviewed with actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants