Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246

Closed
hppritcha opened this issue Dec 23, 2022 · 23 comments

Comments

@hppritcha
Copy link
Member

hppritcha commented Dec 23, 2022

The hwloc 2.7.1 embedded in main and 5.0.x is too old for Intel processors with Ponte Vecchio accelerators. The lstopo built using this version or older of hwloc segfaults when run on such processors, and prterun experiences a similar segfault when run using the embedded hwloc 2.7.1.

Solution is to use a 2.8 or newer version of hwloc.

@jsquyres
Copy link
Member

This should also be noted in the docs and/or in the configury test for the minimum hwloc version.

@rhc54
Copy link
Contributor

rhc54 commented Dec 23, 2022

Isn't that going to be a significant issue for the distros? I believe their defaults are quite a bit older, aren't they?

@hppritcha hppritcha changed the title embedded hwloc 2.7.1 is too old for Intel ZE processors embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators Dec 23, 2022
@hppritcha
Copy link
Member Author

This might be hard to do a configury check at the moment. Frontend nodes will likely be vanilla Intel XE, only backends will likely have the Ponte Vecchio accelerators that seem to cause the issue.

Where in the docs would you recommend writing a blurb about this?

@rhc54
Copy link
Contributor

rhc54 commented Dec 23, 2022

Ah, so you want to do a runtime check of the version? I guess we can do that simply enough - would have to be in PMIx so we can cover both mpirun and direct launch modes. Would you mind opening an issue over there so we don't forget?

@jsquyres Is this going to be an issue re default hwloc versions on distros (I'm thinking of Amazon here, so @bwbarrett )? I don't know of any other solution, frankly, though I wonder if it wouldn't segfault if we asked it to not include those devices. We can do that if we pass the appropriate flags - @hppritcha has that been tried?

@hppritcha
Copy link
Member Author

I guess a runtime check in pmix would be nice, although the segfault was happening inside prte i think. anyway easy workaround at ANL is to use the hwloc they installed in a spack build out.

@jsquyres
Copy link
Member

@hppritcha Ah, you edited the description of this PR -- I think I understand better now: it's a run-time error with older hwloc on Intel with PV accelerators. In this case, is it easy to add a run-time check to see a) if we're on a machine with Intel PV accelerators, and b) the version of hwloc?

If we can detect both of these things at run time, then we should probably show_help() a warning that advises the user of the issue and that they might need to re-build Open MPI with an hwloc >= v2.8.

@rhc54
Copy link
Contributor

rhc54 commented Dec 23, 2022

Pretty sure there is a "get_version" function in hwloc - will fiddle with it in pmix as that is the base layer that provides the topology.

@jsquyres
Copy link
Member

Pretty sure there is a "get_version" function in hwloc - will fiddle with it in pmix as that is the base layer that provides the topology.

K. @hppritcha is there an easy way to tell that we're on a platform with Intel PV accelerators? Perhaps the presence of some file in /sys or something?

@rhc54
Copy link
Contributor

rhc54 commented Dec 23, 2022

Good point - we don't want to blanket block hwloc versions less than 2.8 for all platforms

@hppritcha
Copy link
Member Author

@rhc54 says he knows a way - see openpmix/openpmix#2893
nothing obvious from lspci or kernel modules that jumps out to me.

@rhc54
Copy link
Contributor

rhc54 commented Dec 24, 2022

Let's ask @bgoglin - is there a way for us to detect that Intel PV accelerators are present on a system prior to using HWLOC? Please see above discussion as to why that is an important question. Any guidance would be appreciated!

@bgoglin
Copy link
Contributor

bgoglin commented Dec 25, 2022

I don't have access to a PV server, I am going to ask. I'd assume we'd need to look for specific PCI vendor:device IDs.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 30, 2022

PCI vendor:device is indeed a good way to identify PV devices, and the list at https://pci-ids.ucw.cz/read/PC/8086 looks correct when it reports device ID = 0x0db[05-9ab] for Ponte Vecchio.

However the hwloc bug is related to device with multiple "levelzero subdevices", not to PV specifically, and there are some non-PV devices with multiple subdevices. But those are likely rare for now (discrete GPUs), at least when used in HPC. One solution would be to query level-zero subdevices instead of looking for PCI id. I am not aware of any way to query subdevices from sysfs or anything outside the L0 library.

By the way, the hwloc fix (762fdc4bfb8fc33b304511149a532a122ade395f) is basically the only change in 2.7.x after 2.7.1. I can release a 2.7.2 if needed.

@rhc54
Copy link
Contributor

rhc54 commented Dec 31, 2022

Hmmm...that sounds like we need to simply block everything below 2.7.2? It's an ugly "fix", but I cannot think of any other way to protect against segfault - can you?

Adding an L0 dependency is possible, I suppose, but given the history we've had with that library, not something I'm wild about doing. If someone wants to contribute it (it would go in the PMIx pgpu/intel component), I'm willing to look at it. Otherwise, the HWLOC cutoff is the only solution I can see.

@bgoglin
Copy link
Contributor

bgoglin commented Jan 4, 2023

I released hwloc 2.7.2 yesterday.

@rhc54
Copy link
Contributor

rhc54 commented Jan 4, 2023

Thanks @bgoglin! Have you received any response to the question regarding how we detect that PV is present prior to invoking HWLOC? Requiring hwloc 2.7.2 and above seems pretty onerous - is working thru L0 really the only way to reliably do it? If so, what are the chances of someone providing that code?

@rhc54
Copy link
Contributor

rhc54 commented Jan 4, 2023

Borrowing liberally from the hwloc code in 2.8.0, what if we did something like this:

  res = zeInit(0);
  if (res != ZE_RESULT_SUCCESS) {
    return 0; // we are okay
  }

  nbdrivers = 0;
  res = zeDriverGet(&nbdrivers, NULL);
  if (res != ZE_RESULT_SUCCESS || !nbdrivers)
    return 0; // we are okay
  drh = malloc(nbdrivers * sizeof(*drh));
  if (NULL == drh)
    return error
  res = zeDriverGet(&nbdrivers, drh);
  if (res != ZE_RESULT_SUCCESS) {
    free(drh);
    return error
  }

  zeidx = 0;
  for(i=0; i<nbdrivers; i++) {
    uint32_t nbdevices, j;
    ze_device_handle_t *dvh;
    char buffer[13];

    nbdevices = 0;
    res = zeDeviceGet(drh[i], &nbdevices, NULL);
    if (res != ZE_RESULT_SUCCESS || !nbdevices)
      continue;

    dvh = malloc(nbdevices * sizeof(*dvh));
    if (!dvh)
      continue;
    res = zeDeviceGet(drh[i], &nbdevices, dvh);
    if (res != ZE_RESULT_SUCCESS) {
      free(dvh);
      continue;
    }

    for(j=0; j<nbdevices; j++) {
      uint32_t nr_subdevices;
 
      nr_subdevices = 0;
      res = zeDeviceGetSubDevices(dvh[j], &nr_subdevices, NULL);
      /* returns ZE_RESULT_ERROR_INVALID_ARGUMENT if there are no subdevices */
      if (res != ZE_RESULT_ERROR_INVALID_ARGUMENT || nr_subdevices > 0) {
          // indicates presence of subdevices
          // error out with message if hwloc version is less than 2.7.2

@bgoglin
Copy link
Contributor

bgoglin commented Jan 4, 2023

Thanks @bgoglin! Have you received any response to the question regarding how we detect that PV is present prior to invoking HWLOC? Requiring hwloc 2.7.2 and above seems pretty onerous - is working thru L0 really the only way to reliably do it? If so, what are the chances of someone providing that code?

My reply from 5 days ago is what I got from Intel + ideas from me.

Your code above might be good. I don't have PVC access to test it but I have a way to simulate L0 on a couple different Intel GPU servers if you provide a standalone C program.

@rhc54
Copy link
Contributor

rhc54 commented Jan 5, 2023

I started looking at what it would take to enable this check, and I'm beginning to question if it is worth it. If someone wants to run on a machine that falls into this trap, then they are going to have to use an appropriate HWLOC version. I suppose it is a little nicer if we could warn them of the HWLOC issue instead of just segfaulting, but it would necessitate adding a LevelZero dependency to PMIx that it otherwise doesn't need (at least, so far - and I have no knowledge of any thinking to change that).

So I'm wondering if we should just add this to an FAQ somewhere and call it a day? If the user didn't configure PMIx --with-level-zero, then I couldn't warn them of the problem anyway - which feels like a gaping hole in the logic.

@hppritcha
Copy link
Member Author

I'd be okay with this solution - documenting somewhere. Maybe someone from Intel might have a different opinion but my "job" here is just to get Open MPI working on Aurora and they have a hwloc 2.8.0 that works fine.

@rhc54
Copy link
Contributor

rhc54 commented Jan 5, 2023

We discussed this a bit on the PMIx biweekly telecon today. As one participant noted, there are a number of codes that use HWLOC (including the RM) that would also be breaking and quite likely failing prior to PMIx. So there really is no useful purpose served by trying to have PMIx provide a warning.

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2023

Yeah, we really don't want to add anything complicated here -- I was hoping for a simple "if /sys/blah/blah/blah/point-vechio file exists and hwloc run-time version == 2.7.1, emit this warning" kind of thing. If that's not possible, so be it.

@jsquyres
Copy link
Member

Per discussion on the 10 Jan 2023 webex, this has likely turned into a documentation issue. It should be noted in the v5.0.x docs somewhere. Per #11290, we can at least disregard the effects of Open MPI's internal hwloc being too old for v6.x.

@jsquyres jsquyres added this to the v5.0.0 milestone Jan 10, 2023
hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 16, 2023
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 21, 2023
accelerators.

related to issue open-mpi#11246

Co-authored-by: Jeff Squyres <jsquyres@users.noreply.github.com>
Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 21, 2023
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 21, 2023
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 23, 2023
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit c1b5e6e)
boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
accelerators.

related to issue open-mpi#11246

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants