Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open MPI appears to ignore the max_msg_size and related fields reported by OFI. #6976

Closed
hppritcha opened this issue Sep 13, 2019 · 14 comments
Closed

Comments

@hppritcha
Copy link
Member

hppritcha commented Sep 13, 2019

This issue tracks a discussion on the use mail list:

https://www.mail-archive.com/users@lists.open-mpi.org//msg33397.html

The test case works with the PML ob1, fails with a PSM2 error if using the PSM2 MTL, fails silently if using the OFI MTL (highly likely using the PSM2 provider).
Test case:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>

long failed_offset = 0;

size_t chunk_size = 1 << 16;
size_t nchunks = (1 << 16) + 1;

int main(int argc, char * argv[])
{
    if (argc >= 2) chunk_size = atol(argv[1]);
    if (argc >= 3) nchunks = atol(argv[1]);

    MPI_Init(&argc, &argv);
    /*
     * This function returns:
     *  0 on success.
     *  a non-zero MPI Error code if MPI_Allgather returned one.
     *  -1 if no MPI Error code was returned, but the result of Allgather
     *  was wrong.
     *  -2 if memory allocation failed.
     *
     * (note that the MPI document guarantees that MPI error codes are
     * positive integers)
     */

    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int err;

    char * check_text;
    int rc = asprintf(&check_text, "MPI_Allgather, %d nodes, 0x%zx chunks of 0x%zx bytes, total %d * 0x%zx bytes", size, nchunks, chunk_size, size, chunk_size * nchunks);
    if (rc < 0) abort();

    if (!rank) printf("%s: ...\n", check_text);

    MPI_Datatype mpi_ft;
    MPI_Type_contiguous(chunk_size, MPI_BYTE, &mpi_ft);
    MPI_Type_commit(&mpi_ft);
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    void * data = malloc(nchunks * size * chunk_size);
    memset(data, 0, nchunks * size * chunk_size);
    int alloc_ok = data != NULL;
    MPI_Allreduce(MPI_IN_PLACE, &alloc_ok, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
    if (alloc_ok) {
        memset(((char*)data) + nchunks * chunk_size * rank, 0x42, nchunks * chunk_size);
        err = MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
                data, nchunks,
                mpi_ft, MPI_COMM_WORLD);
        if (err == 0) {
            void * p = memchr(data, 0, nchunks * size * chunk_size);
            if (p != NULL) {
                /* We found a zero, we shouldn't ! */
                err = -1;
                failed_offset = ((char*)p)-(char*)data;
            }
        }
    } else {
        err = -2;
    }
    if (data) free(data);
    MPI_Type_free(&mpi_ft);

    if (!rank) {
        printf("%s: %s\n", check_text, err == 0 ? "ok" : "NOK");
    }
    if (err == -2) {
        puts("Could not allocate memory buffer");
    } else if (err != 0) {
        int someone_has_minusone = (err == -1);
        MPI_Allreduce(MPI_IN_PLACE, &someone_has_minusone, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
        if (someone_has_minusone) {
            long * offsets = malloc(size * sizeof(long));
            offsets[rank] = failed_offset;
            MPI_Gather(&failed_offset, 1, MPI_LONG,
                    offsets, 1, MPI_LONG, 0, MPI_COMM_WORLD);
            if (!rank) {
                for(int i = 0 ; i < size ; i++) {
                    printf("node %d failed_offset = 0x%lx\n", i, offsets[i]);
                }
            }
            free(offsets);
        }

        if (!rank) {
            if (err > 0) { /* return an MPI Error if we've got one. */
                /* we often get MPI_ERR_OTHER... mostly useless */
                char error[1024];
                int errorlen = sizeof(error);
                MPI_Error_string(err, error, &errorlen);
                printf("MPI error returned:\n%s\n", error);
            }
        }
    }
    free(check_text);
    MPI_Finalize();
}
@emmanuelthome
Copy link

[...] fails silently if using the OF1 MTL (highly likely using the PSM2 provider).

Yes, output_401_mtl_ofi.txt contains:

[node1.localdomain:09250] ../../../../../openmpi-4.0.1/ompi/mca/mtl/ofi/mtl_ofi_component.c:347: mtl:ofi:prov: psm2

@hppritcha
Copy link
Member Author

This is likely a PSM2/OFI PSM2 provider issue. Works with OB1 PML for me and with CM PML if I use the gni provider. Did runs on the NERSC cori system.

hpp@nid02314:~/issue_6976>which mpirun 
/global/common/software/m3169/openmpi/4.0.1/gnu/bin/mpirun
hpp@nid02314:~/issue_6976>mpirun -np 2 ./test
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: ...
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: ok
hpp@nid02314:~/issue_6976>export OMPI_MCA_pml=cm
hpp@nid02314:~/issue_6976>mpirun -np 2 ./test
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: ...
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: ok

@mwheinz
Copy link

mwheinz commented Sep 16, 2019

On it.

@mwheinz
Copy link

mwheinz commented Sep 18, 2019

The actual issue is in the OFI transport, so I opened an issue there, ofiwg/libfabric#5287 - but the proposed patch doesn't generate the expected result. I posted my current patch and asked for advice.

@mwheinz mwheinz changed the title silent failure for large allgather Open MPI appears to ignore the max_msg_size and related fields reported by OFI. Sep 19, 2019
@mwheinz
Copy link

mwheinz commented Sep 19, 2019

Okay, investigating the issue it now appears that the problem is unrelated to the PSM2 transport but may affect OFI in general. OFI reports to the client application the size of the largest message a selected transport can support but I haven't found any code in OMPI that uses that information. (I mean, I certainly could be wrong, but that's how it's looking right now.)

@mwheinz
Copy link

mwheinz commented Sep 23, 2019

Sample output:

Verbs Run: Mon Sep 23 13:58:20 EDT 2019 NP=4, chunk_size=, nchunks=

hdsmpriv02.hd.intel.com: is alive
MPI_Allgather, 4 nodes, 0x10001 chunks of 0x10000 bytes, total 4 * 0x100010000 bytes: ...
Message size 4295032832 bigger than supported by selected transport. Max = 2147483648
Message size 4295032832 bigger than supported by selected transport. Max = 2147483648
Message size 4295032832 bigger than supported by selected transport. Max = 2147483648
Message size 4295032832 bigger than supported by selected transport. Max = 2147483648
MPI_Allgather, 4 nodes, 0x10001 chunks of 0x10000 bytes, total 4 * 0x100010000 bytes: NOK
MPI error returned:
MPI_ERR_OTHER: known error not in list

PSM2 Run: Mon Sep 23 13:58:27 EDT 2019 NP=4, chunk_size=, nchunks=

hdsmpriv02.hd.intel.com: is alive
MPI_Allgather, 4 nodes, 0x10001 chunks of 0x10000 bytes, total 4 * 0x100010000 bytes: ...
Message size 4295032832 bigger than supported by selected transport. Max = 4294967295
Message size 4295032832 bigger than supported by selected transport. Max = 4294967295
Message size 4295032832 bigger than supported by selected transport. Max = 4294967295
Message size 4295032832 bigger than supported by selected transport. Max = 4294967295
MPI_Allgather, 4 nodes, 0x10001 chunks of 0x10000 bytes, total 4 * 0x100010000 bytes: NOK
MPI error returned:
MPI_ERR_OTHER: known error not in list
[RHEL7.5 hdsmpriv01 20190923_1358 mpi_apps]# ./STL-59403.sh 65536 32768

@hppritcha
Copy link
Member Author

I checked the UCX PML with this test and it passes.

@hppritcha
Copy link
Member Author

Per discussions at the 9/24/19 devel call, we decided to implement a short term fix for master and release branches: #7003, #7004, #7005,#7006

Longer term we need to either improve OFI providers' support for messages longer than the max_msg_size reported by a given provider, or somehow only select providers reporting a certain minimum size max_msg_size. The former will require some kind of method for fragmenting messages that are too long to be sent by the selected OFI provider.

@hppritcha hppritcha self-assigned this Sep 25, 2019
@emmanuelthome
Copy link

Thanks guys for handling this.

@shefty
Copy link

shefty commented Oct 2, 2019

Checking on this, is this a problem in OMPI not checking/using the reported max message size?

If OMPI has a minimum required size, it could provide that in the hints to filter out providers that can't meet that minimum. That would likely result in providers being dropped that OMPI would ideally like to use, however.

@mwheinz
Copy link

mwheinz commented Oct 2, 2019

Checking on this, is this a problem in OMPI not checking/using the reported max message size?

Correct. The OFI MTL wasn't checking the reported value at all. This patch will cause the MTL to report an error if the max message size is exceeded; we are discussing a longer term fix to allow the MTL to break up messages that would exceed the limit.

@jsquyres
Copy link
Member

jsquyres commented Oct 2, 2019

@shefty Ya, as @mwheinz said: this was 100% an OMPI problem -- not a libfabric problem. I don't think we need to add a "minimum largest size" kind of attribute in libfabric.

@shefty
Copy link

shefty commented Oct 2, 2019

@jsquyres - libfabric should have this support already. If you set the hints->ep_attr->max_msg_size = X, only those providers that support max_msg_size >= X should respond.

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2019

This is now fixed on all release branches: v3.0.x, v3.1.x, v4.0.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants