Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.0.x: REF6976 Silent failure of OMPI over OFI with large messages sizes #7005

Merged
merged 1 commit into from
Sep 24, 2019
Merged

v4.0.x: REF6976 Silent failure of OMPI over OFI with large messages sizes #7005

merged 1 commit into from
Sep 24, 2019

Conversation

mwheinz
Copy link

@mwheinz mwheinz commented Sep 23, 2019

INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

(cherry-picked from commit 3aca4af)
Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60
Signed-off-by: Michael Heinz michael.william.heinz@intel.com
Reviewed-by: Adam Goldman adam.goldman@intel.com
Reviewed-by: Brendan Cunningham brendan.cunningham@intel.com

INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

(cherry-picked from commit 3aca4af)
Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60
Signed-off-by: Michael Heinz <michael.william.heinz@intel.com>
Reviewed-by: Adam Goldman <adam.goldman@intel.com>
Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>
@mwheinz mwheinz changed the title REF6976 Silent failure of OMPI over OFI with large messages sizes v4.0.x: REF6976 Silent failure of OMPI over OFI with large messages sizes Sep 23, 2019
@jsquyres jsquyres added this to the v4.0.2 milestone Sep 23, 2019
@bwbarrett
Copy link
Member

I don't think this is the right fix; we need to discuss more (see #7004).

@gpaulsen
Copy link
Member

@jsquyres added WIP-DNM until we could discuss at today's web-ex.

We were in agreement that the above PR is not "complete", as it does nothing to support larger message sizes (and MPI should be able to support all message sizes). It was noted that this PR does improve the current situation by failing with an error message rather than silently.

@hppritcha and I (as release maangers for v4.0.x) have agreed to merge this for now, and if a better solution comes along consider that at that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants