Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FORCE-TERMINATE AT base/plm_base_launch_support.c #5048

Closed
angainor opened this issue Apr 10, 2018 · 29 comments
Closed

FORCE-TERMINATE AT base/plm_base_launch_support.c #5048

angainor opened this issue Apr 10, 2018 · 29 comments
Labels
RTE Issue likely is in RTE or PMIx areas State-Awaiting developer information

Comments

@angainor
Copy link

angainor commented Apr 10, 2018

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI repo revision: v3.1.0rc3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  Configure command line:'--with-knem=/cluster/software/hpcx/2.0/knem' '--with-mxm=/cluster/software/hpcx/2.0/mxm' '--with-hcoll=/cluster/software/hpcx/2.0/hcoll' '--with-ucx=/cluster/software/hpcx/2.0/ucx' '--with-platform=contrib/platform/mellanox/optimized' '--with-pmix=/usr' '--with-hwloc=/usr' '--with-libevent=/usr'

  C compiler: gcc
  C compiler version: 7.2.0
  pmix-1.2.3 
  hwloc-libs-1.11.2
  libevent-2.0.21

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.4.1708 (Core)

  • Computer hardware: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz

  • Network type: Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]


Details of the problem

mpirun seems to have problems starting workers. I run mpirun ls from within a SLURM allocation:

shell$ mpirun ls
[c11-1:139811] PMIX ERROR: BAD-PARAM in file src/dstore/pmix_esh.c at line 1185
[c11-1:139811] [[5046,0],0] ORTE_ERROR_LOG: Not found in file base/odls_base_default_fns.c at line 172
[c11-1:139811] [[5046,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 550
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[5046,0],0] FORCE-TERMINATE AT (null):1 - error base/plm_base_launch_support.c(551)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

The above works for v3.0.1. Note that I'm compiling against a locally installed PMIx 1.2.3. Is this the problem?

@ggouaillardet
Copy link
Contributor

note you should (that is unlikely to cause the issue though)

configure '--with-pmix=external' '--with-hwloc=external' '--with-libevent=external' ...

did you configure SLURM with the PMIx plugin that was built with the same version ?
there has been some cross version compatibility fixes in the v1.2 series, so I'd rather suggest
you rebuild PMIX 1.2.5 and Open MPI on top of that.

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2018

Nah, this has nothing to do with slurm - the problem here is the external PMIx v1.2.3. OMPI v3.1.0 is based on PMIx v2.1, and I suspect the OPAL "glue" to the older PMIx library is having a problem. Changing to PMIx v1.2.5 might help - going to PMIx v2.1.1 would be a better option, though that won't work with direct launch against Slurm PMIx plugin (you could, however, still use the Slurm PMI2 or PMI1 support).

@ggouaillardet
Copy link
Contributor

yep, the issue is in the glue and PMIX v1.2.5 does not help here.

in orte_odls_base_default_get_add_procs_data, we opal_pmix.get(ORTE_PROC_MY_NAME, NULL, NULL, &val) (note key is NULL).
we end up in _esh_fetch() that explicitly does not support that use case

    if (NULL == key) {
        PMIX_OUTPUT_VERBOSE((7, pmix_globals.debug_output,
                             "dstore: Does not support passed parameters"));
        rc = PMIX_ERR_BAD_PARAM;
        PMIX_ERROR_LOG(rc);
        return rc;
    }

At first glance, I could not find a way to fetch all data for a specific rank in PMIx v1.2.5 (kind of _esh_fetchall() or pmix_dstore_fetchall())

I tried rebuilding PMIx 1.2.5 with --disable-dstore and ran into a different issue.
the hash store is fine with a NULL key, but the rank cannot be found in the namespace. It seems it is stored in ns->internal but the get command tries to find it in ns->modex which has zero entry at that time.

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2018

Yeah, that's what I kind of expected. The problem is that the rank probably needs to be "wildcard" for v1.2.5 - but it would be hard to know when to use that value vs any other one. I suspect you should just kill off the ext1x component as there are bound to be more problems.

I'm told that Slurm will work with the PMIx v2.x series (only a few APIs were implemented in the Slurm plugin, and they didn't change), so the best solution is to advise Slurm admins to build against PMIx v2.1.1 and things should just work.

@angainor
Copy link
Author

@rhc54 @ggouaillardet Last time I checked (in August), SLURM did not compile with pmix2. But it seems that the newest version does, so it might be an option to use PMIX2 starting from now. I'll talk to our expert here and see if I can get it tested.

One problem with this is of course that we need to re-build all our OpenMPI stack so that it can use the new PMIX lib. I guess better now than later - I guess you mean that PMIX1 will not be supported any more in OpenMPI 3?

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2018

It's up to them - it could be done, but it might take a bit of work. Someone would have to look at the OMPI v2.x series to see how certain calls were made and then alter the glue code in OMPI v3 to make the required conversions. Not impossible, and there are probably not that many places requiring it - but I don't know if folks will want to invest their time that way.

Note that OMPI v2 code (built against PMIx v1.2.5) should run just fine against Slurm with PMIx v2.1.1, so you don't have to rebuild the older OMPI releases (assuming they are linked against PMIx v1.2.5).

@angainor
Copy link
Author

Well, at the time of deployment we installed the then available pmix 1.2.3. so I guess we do need to rebuild?

Rebuilding itself is not a big problem. Just wondering about any potential compatibility issues for users. The earliest we have is OpenMPI 2.0.2, so if that will work fine with pmix 2.1.1, then I guess it sounds safe?

Thanks for your help!

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Apr 11, 2018

@bwbarrett any opinion ? Is it ok to drop support for PMIx v1.2 mid series ? Or should we try to fix that issue ? v3.0.x might be affected as well.

@kawashima-fj
Copy link
Member

This problem exists in the combination of OMPI 3.1.0rc3 + PMIx 2.0.3, too.

[fh01-009:13372] [[61696,0],0] ORTE_ERROR_LOG: Not found in file /home/tkawa/src/openmpi-3.1.0rc3/orte/mca/odls/base/odls_base_default_fns.c at line 172
[fh01-009:13372] [[61696,0],0] ORTE_ERROR_LOG: Not found in file /home/tkawa/src/openmpi-3.1.0rc3/orte/mca/plm/base/plm_base_launch_support.c at line 550
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[61696,0],0] FORCE-TERMINATE AT (null):1 - error /home/tkawa/src/openmpi-3.1.0rc3/orte/mca/plm/base/plm_base_launch_support.c(551)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

@ggouaillardet
Copy link
Contributor

back to ext1x could not even build on master, and I pushed 37e7bca to fix that.

That being said, it fails at runtime ...

I noted some stuff was not backported into v3.1.x, so far here is a snapshot of my work

I will resume the v1 related stuff when/if we decide we do not simply want to drop this from the v3.1.x branch

From 6202a1211157c41d393b6a243f4a749051c5af90 Mon Sep 17 00:00:00 2001
From: Gilles Gouaillardet <gilles@rist.or.jp>
Date: Thu, 12 Apr 2018 13:16:43 +0900
Subject: [PATCH] pmix: backport the legacy_get callback

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

(back-ported from commit open-mpi/ompi@9fb80bd239860a3d5e571f425cff3d5ebc09dd62)
(back-ported from commit open-mpi/ompi@187352eb3daba9357e53b2e02581824fdeef0539)
(back-ported from commit open-mpi/ompi@e9cd7fd7e6cb90ff1ce1f62fb9f057d14e6fc8c2)
---
 examples/hello_c.c                          |    4 +
 opal/mca/pmix/ext1x/pmix1x.c                |    7 +-
 opal/mca/pmix/ext1x/pmix1x_client.c         |    6 +-
 opal/mca/pmix/pmix.h                        |    3 +
 opal/mca/pmix/pmix2x/pmix2x.c               |   10 +-
 opal/mca/pmix/pmix2x/pmix2x.h               |    1 +
 opal/mca/pmix/pmix2x/pmix2x_client.c        |   19 ++-
 opal/mca/pmix/pmix2x/pmix2x_component.c     |    4 +-
 orte/mca/grpcomm/direct/grpcomm_direct.c    |   58 +++++--
 orte/mca/odls/base/odls_base_default_fns.c  |    5 +-
 orte/mca/plm/base/plm_base_launch_support.c |   72 +---------
 orte/mca/state/dvm/state_dvm.c              |  221 ++++++++++++++++-----------
 orte/orted/orted_main.c                     |   94 ++++++++----
 13 files changed, 287 insertions(+), 217 deletions(-)

diff --git a/examples/hello_c.c b/examples/hello_c.c
index e44f684..e038065 100644
--- a/examples/hello_c.c
+++ b/examples/hello_c.c
@@ -8,13 +8,17 @@
  */
 
 #include <stdio.h>
+#include <poll.h>
+
 #include "mpi.h"
 
 int main(int argc, char* argv[])
 {
     int rank, size, len;
+    volatile int _dbg = 1;
     char version[MPI_MAX_LIBRARY_VERSION_STRING];
 
+    while (_dbg) poll(NULL, 0, 1);
     MPI_Init(&argc, &argv);
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Comm_size(MPI_COMM_WORLD, &size);
diff --git a/opal/mca/pmix/ext1x/pmix1x.c b/opal/mca/pmix/ext1x/pmix1x.c
index fbc6025..410c7c7 100644
--- a/opal/mca/pmix/ext1x/pmix1x.c
+++ b/opal/mca/pmix/ext1x/pmix1x.c
@@ -1,6 +1,6 @@
 /* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
- * Copyright (c) 2014-2017 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
  * Copyright (c) 2014-2017 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * Copyright (c) 2014-2015 Mellanox Technologies, Inc.
@@ -48,8 +48,13 @@
 
 static const char *pmix1_get_nspace(opal_jobid_t jobid);
 static void pmix1_register_jobid(opal_jobid_t jobid, const char *nspace);
+static bool legacy_get(void)
+{
+    return true;
+}
 
 const opal_pmix_base_module_t opal_pmix_ext1x_module = {
+    .legacy_get = legacy_get,
     /* client APIs */
     .init = pmix1_client_init,
     .finalize = pmix1_client_finalize,
diff --git a/opal/mca/pmix/ext1x/pmix1x_client.c b/opal/mca/pmix/ext1x/pmix1x_client.c
index 3d45d35..741cae3 100644
--- a/opal/mca/pmix/ext1x/pmix1x_client.c
+++ b/opal/mca/pmix/ext1x/pmix1x_client.c
@@ -232,8 +232,10 @@ int pmix1_store_local(const opal_process_name_t *proc, opal_value_t *val)
             }
         }
         if (NULL == job) {
-            OPAL_ERROR_LOG(OPAL_ERR_NOT_FOUND);
-            return OPAL_ERR_NOT_FOUND;
+            job = OBJ_NEW(opal_pmix1_jobid_trkr_t);
+            (void)opal_snprintf_jobid(job->nspace, PMIX_MAX_NSLEN, proc->jobid);
+            job->jobid = proc->jobid;
+            opal_list_append(&mca_pmix_ext1x_component.jobids, &job->super);
         }
         (void)strncpy(p.nspace, job->nspace, PMIX_MAX_NSLEN);
         p.rank = proc->vpid;
diff --git a/opal/mca/pmix/pmix.h b/opal/mca/pmix/pmix.h
index 53e0457..a4936af 100644
--- a/opal/mca/pmix/pmix.h
+++ b/opal/mca/pmix/pmix.h
@@ -867,10 +867,13 @@ typedef int (*opal_pmix_base_process_monitor_fn_t)(opal_list_t *monitor,
                                                    opal_list_t *directives,
                                                    opal_pmix_info_cbfunc_t cbfunc, void *cbdata);
 
+typedef bool (*opal_pmix_base_legacy_get_fn_t)(void);
+
 /*
  * the standard public API data structure
  */
 typedef struct {
+    opal_pmix_base_legacy_get_fn_t                          legacy_get;
     /* client APIs */
     opal_pmix_base_module_init_fn_t                         init;
     opal_pmix_base_module_fini_fn_t                         finalize;
diff --git a/opal/mca/pmix/pmix2x/pmix2x.c b/opal/mca/pmix/pmix2x/pmix2x.c
index 34bc3d7..3f38835 100644
--- a/opal/mca/pmix/pmix2x/pmix2x.c
+++ b/opal/mca/pmix/pmix2x/pmix2x.c
@@ -1,6 +1,6 @@
 /* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
- * Copyright (c) 2014-2017 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
  * Copyright (c) 2014-2017 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * Copyright (c) 2014-2015 Mellanox Technologies, Inc.
@@ -50,7 +50,7 @@
 
 /* These are functions used by both client and server to
  * access common functions in the embedded PMIx library */
-
+static bool legacy_get(void);
 static const char *pmix2x_get_nspace(opal_jobid_t jobid);
 static void pmix2x_register_jobid(opal_jobid_t jobid, const char *nspace);
 static void register_handler(opal_list_t *event_codes,
@@ -72,6 +72,7 @@ static void pmix2x_log(opal_list_t *info,
                        opal_pmix_op_cbfunc_t cbfunc, void *cbdata);
 
 const opal_pmix_base_module_t opal_pmix_pmix2x_module = {
+    .legacy_get = legacy_get,
     /* client APIs */
     .init = pmix2x_client_init,
     .finalize = pmix2x_client_finalize,
@@ -126,6 +127,11 @@ const opal_pmix_base_module_t opal_pmix_pmix2x_module = {
     .register_jobid = pmix2x_register_jobid
 };
 
+static bool legacy_get(void)
+{
+    return mca_pmix_pmix2x_component.legacy_get;
+}
+
 static void opcbfunc(pmix_status_t status, void *cbdata)
 {
     pmix2x_opcaddy_t *op = (pmix2x_opcaddy_t*)cbdata;
diff --git a/opal/mca/pmix/pmix2x/pmix2x.h b/opal/mca/pmix/pmix2x/pmix2x.h
index 19683d0..86eb009 100644
--- a/opal/mca/pmix/pmix2x/pmix2x.h
+++ b/opal/mca/pmix/pmix2x/pmix2x.h
@@ -48,6 +48,7 @@ BEGIN_C_DECLS
 
 typedef struct {
   opal_pmix_base_component_t super;
+  bool legacy_get;
   opal_list_t jobids;
   bool native_launch;
   size_t evindex;
diff --git a/opal/mca/pmix/pmix2x/pmix2x_client.c b/opal/mca/pmix/pmix2x/pmix2x_client.c
index 7b8c897..e32f9ef 100644
--- a/opal/mca/pmix/pmix2x/pmix2x_client.c
+++ b/opal/mca/pmix/pmix2x/pmix2x_client.c
@@ -1,6 +1,6 @@
 /* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
- * Copyright (c) 2014-2017 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
  * Copyright (c) 2014-2017 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * Copyright (c) 2014-2017 Mellanox Technologies, Inc.
@@ -400,7 +400,6 @@ int pmix2x_store_local(const opal_process_name_t *proc, opal_value_t *val)
 
     PMIX_VALUE_CONSTRUCT(&kv);
     pmix2x_value_load(&kv, val);
-
     /* call the library - this is a blocking call */
     rc = PMIx_Store_internal(&p, val->key, &kv);
     PMIX_VALUE_DESTRUCT(&kv);
@@ -596,10 +595,11 @@ int pmix2x_get(const opal_process_name_t *proc, const char *key,
         return OPAL_ERR_NOT_INITIALIZED;
     }
 
-    if (NULL == proc) {
+    if (NULL == proc && NULL != key) {
         /* if they are asking for our jobid, then return it */
         if (0 == strcmp(key, OPAL_PMIX_JOBID)) {
             (*val) = OBJ_NEW(opal_value_t);
+            (*val)->key = strdup(key);
             (*val)->type = OPAL_UINT32;
             (*val)->data.uint32 = OPAL_PROC_MY_NAME.jobid;
             OPAL_PMIX_RELEASE_THREAD(&opal_pmix_base.lock);
@@ -608,6 +608,7 @@ int pmix2x_get(const opal_process_name_t *proc, const char *key,
         /* if they are asking for our rank, return it */
         if (0 == strcmp(key, OPAL_PMIX_RANK)) {
             (*val) = OBJ_NEW(opal_value_t);
+            (*val)->key = strdup(key);
             (*val)->type = OPAL_INT;
             (*val)->data.integer = pmix2x_convert_rank(my_proc.rank);
             OPAL_PMIX_RELEASE_THREAD(&opal_pmix_base.lock);
@@ -642,6 +643,9 @@ int pmix2x_get(const opal_process_name_t *proc, const char *key,
     rc = PMIx_Get(&p, key, pinfo, sz, &pval);
     if (PMIX_SUCCESS == rc) {
         ival = OBJ_NEW(opal_value_t);
+        if (NULL != key) {
+            ival->key = strdup(key);
+        }
         if (OPAL_SUCCESS != (ret = pmix2x_value_unload(ival, pval))) {
             rc = pmix2x_convert_opalrc(ret);
         } else {
@@ -663,6 +667,9 @@ static void val_cbfunc(pmix_status_t status,
 
     OPAL_ACQUIRE_OBJECT(op);
     OBJ_CONSTRUCT(&val, opal_value_t);
+    if (NULL != op->nspace) {
+        val.key = strdup(op->nspace);
+    }
     rc = pmix2x_convert_opalrc(status);
     if (PMIX_SUCCESS == status && NULL != kv) {
         rc = pmix2x_value_unload(&val, kv);
@@ -702,6 +709,7 @@ int pmix2x_getnb(const opal_process_name_t *proc, const char *key,
         if (0 == strcmp(key, OPAL_PMIX_JOBID)) {
             if (NULL != cbfunc) {
                 val = OBJ_NEW(opal_value_t);
+                val->key = strdup(key);
                 val->type = OPAL_UINT32;
                 val->data.uint32 = OPAL_PROC_MY_NAME.jobid;
                 cbfunc(OPAL_SUCCESS, val, cbdata);
@@ -713,6 +721,7 @@ int pmix2x_getnb(const opal_process_name_t *proc, const char *key,
         if (0 == strcmp(key, OPAL_PMIX_RANK)) {
             if (NULL != cbfunc) {
                 val = OBJ_NEW(opal_value_t);
+                val->key = strdup(key);
                 val->type = OPAL_INT;
                 val->data.integer = pmix2x_convert_rank(my_proc.rank);
                 cbfunc(OPAL_SUCCESS, val, cbdata);
@@ -726,7 +735,9 @@ int pmix2x_getnb(const opal_process_name_t *proc, const char *key,
     op = OBJ_NEW(pmix2x_opcaddy_t);
     op->valcbfunc = cbfunc;
     op->cbdata = cbdata;
-
+    if (NULL != key) {
+        op->nspace = strdup(key);
+    }
     if (NULL == proc) {
         (void)strncpy(op->p.nspace, my_proc.nspace, PMIX_MAX_NSLEN);
         op->p.rank = pmix2x_convert_rank(PMIX_RANK_WILDCARD);
diff --git a/opal/mca/pmix/pmix2x/pmix2x_component.c b/opal/mca/pmix/pmix2x/pmix2x_component.c
index 03246c1..cdcdb7d 100644
--- a/opal/mca/pmix/pmix2x/pmix2x_component.c
+++ b/opal/mca/pmix/pmix2x/pmix2x_component.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2014-2017 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014-2018 Intel, Inc. All rights reserved.
  * Copyright (c) 2014-2015 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * Copyright (c) 2016 Cisco Systems, Inc.  All rights reserved.
@@ -21,6 +21,7 @@
 #include "opal/constants.h"
 #include "opal/class/opal_list.h"
 #include "opal/util/proc.h"
+#include "opal/util/show_help.h"
 #include "opal/mca/pmix/pmix.h"
 #include "pmix2x.h"
 
@@ -74,6 +75,7 @@ mca_pmix_pmix2x_component_t mca_pmix_pmix2x_component = {
             MCA_BASE_METADATA_PARAM_CHECKPOINT
         }
     },
+    .legacy_get = true,
     .native_launch = false
 };
 
diff --git a/orte/mca/grpcomm/direct/grpcomm_direct.c b/orte/mca/grpcomm/direct/grpcomm_direct.c
index 8711d2c..530e2ce 100644
--- a/orte/mca/grpcomm/direct/grpcomm_direct.c
+++ b/orte/mca/grpcomm/direct/grpcomm_direct.c
@@ -275,7 +275,7 @@ static void xcast_recv(int status, orte_process_name_t* sender,
     size_t inlen, cmplen;
     uint8_t *packed_data, *cmpdata;
     int32_t nvals, i;
-    opal_value_t *kv;
+    opal_value_t kv, *kval;
     orte_process_name_t dmn;
 
     OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_framework.framework_output,
@@ -461,33 +461,57 @@ static void xcast_recv(int status, orte_process_name_t* sender,
                         OBJ_CONSTRUCT(&wireup, opal_buffer_t);
                         opal_dss.load(&wireup, bo->bytes, bo->size);
                         /* decode it, pushing the info into our database */
-                        cnt=1;
-                        while (OPAL_SUCCESS == (ret = opal_dss.unpack(&wireup, &dmn, &cnt, ORTE_NAME))) {
-                            cnt = 1;
-                            if (ORTE_SUCCESS != (ret = opal_dss.unpack(&wireup, &nvals, &cnt, OPAL_INT32))) {
+                        if (opal_pmix.legacy_get()) {
+                            OBJ_CONSTRUCT(&kv, opal_value_t);
+                            kv.key = OPAL_PMIX_PROC_URI;
+                            kv.type = OPAL_STRING;
+                            cnt=1;
+                            while (OPAL_SUCCESS == (ret = opal_dss.unpack(&wireup, &dmn, &cnt, ORTE_NAME))) {
+                                cnt = 1;
+                                if (ORTE_SUCCESS != (ret = opal_dss.unpack(&wireup, &kv.data.string, &cnt, OPAL_STRING))) {
+                                    ORTE_ERROR_LOG(ret);
+                                    break;
+                                }
+                                if (OPAL_SUCCESS != (ret = opal_pmix.store_local(&dmn, &kv))) {
+                                    ORTE_ERROR_LOG(ret);
+                                    free(kv.data.string);
+                                    break;
+                                }
+                                free(kv.data.string);
+                                kv.data.string = NULL;
+                            }
+                            if (ORTE_ERR_UNPACK_READ_PAST_END_OF_BUFFER != ret) {
                                 ORTE_ERROR_LOG(ret);
-                                break;
                             }
-                            for (i=0; i < nvals; i++) {
+                        } else {
+                           cnt=1;
+                           while (OPAL_SUCCESS == (ret = opal_dss.unpack(&wireup, &dmn, &cnt, ORTE_NAME))) {
+                               cnt = 1;
+                               if (ORTE_SUCCESS != (ret = opal_dss.unpack(&wireup, &nvals, &cnt, OPAL_INT32))) {
+                                   ORTE_ERROR_LOG(ret);
+                                   break;
+                               }
+                               for (i=0; i < nvals; i++) {
                                 cnt = 1;
-                                if (ORTE_SUCCESS != (ret = opal_dss.unpack(&wireup, &kv, &cnt, OPAL_VALUE))) {
+                                if (ORTE_SUCCESS != (ret = opal_dss.unpack(&wireup, &kval, &cnt, OPAL_VALUE))) {
                                     ORTE_ERROR_LOG(ret);
                                     break;
                                 }
                                 OPAL_OUTPUT_VERBOSE((5, orte_grpcomm_base_framework.framework_output,
-                                                    "%s STORING MODEX DATA FOR PROC %s KEY %s",
-                                                    ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
-                                                    ORTE_NAME_PRINT(&dmn), kv->key));
-                                if (OPAL_SUCCESS != (ret = opal_pmix.store_local(&dmn, kv))) {
+                                                     "%s STORING MODEX DATA FOR PROC %s KEY %s",
+                                                     ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
+                                                     ORTE_NAME_PRINT(&dmn), kval->key));
+                                if (OPAL_SUCCESS != (ret = opal_pmix.store_local(&dmn, kval))) {
                                     ORTE_ERROR_LOG(ret);
-                                    OBJ_RELEASE(kv);
+                                    OBJ_RELEASE(kval);
                                     break;
                                 }
-                                OBJ_RELEASE(kv);
+                                OBJ_RELEASE(kval);
+                            }
+                            }
+                            if (ORTE_ERR_UNPACK_READ_PAST_END_OF_BUFFER != ret) {
+                                ORTE_ERROR_LOG(ret);
                             }
-                        }
-                        if (ORTE_ERR_UNPACK_READ_PAST_END_OF_BUFFER != ret) {
-                            ORTE_ERROR_LOG(ret);
                         }
                         /* done with the wireup buffer - dump it */
                         OBJ_DESTRUCT(&wireup);
diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c
index c178c4a..7b30f20 100644
--- a/orte/mca/odls/base/odls_base_default_fns.c
+++ b/orte/mca/odls/base/odls_base_default_fns.c
@@ -152,8 +152,9 @@ int orte_odls_base_default_get_add_procs_data(opal_buffer_t *buffer,
 
     /* if we haven't already done so, provide the info on the
      * capabilities of each node */
-    if (!orte_node_info_communicated ||
-        orte_get_attribute(&jdata->attributes, ORTE_JOB_LAUNCHED_DAEMONS, NULL, OPAL_BOOL)) {
+    if (1 < orte_process_info.num_procs &&
+        (!orte_node_info_communicated ||
+         orte_get_attribute(&jdata->attributes, ORTE_JOB_LAUNCHED_DAEMONS, NULL, OPAL_BOOL))) {
         flag = 1;
         opal_dss.pack(buffer, &flag, 1, OPAL_INT8);
         if (ORTE_SUCCESS != (rc = orte_regx.encode_nodemap(buffer))) {
diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index b39b348..f9c0af4 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -38,6 +38,7 @@
 
 #include "opal/hash_string.h"
 #include "opal/util/argv.h"
+#include "opal/util/opal_environ.h"
 #include "opal/class/opal_pointer_array.h"
 #include "opal/dss/dss.h"
 #include "opal/mca/hwloc/hwloc-internal.h"
@@ -681,18 +682,7 @@ void orte_plm_base_post_launch(int fd, short args, void *cbdata)
                              ORTE_JOBID_PRINT(jdata->jobid)));
         goto cleanup;
     }
-    /* if it was a dynamic spawn, and it isn't an MPI job, then
-     * it won't register and we need to send the response now.
-     * Otherwise, it is an MPI job and we should wait for it
-     * to register */
-    if (!orte_get_attribute(&jdata->attributes, ORTE_JOB_NON_ORTE_JOB, NULL, OPAL_BOOL) &&
-        !orte_get_attribute(&jdata->attributes, ORTE_JOB_DVM_JOB, NULL, OPAL_BOOL)) {
-        OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
-                             "%s plm:base:launch job %s is MPI",
-                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
-                             ORTE_JOBID_PRINT(jdata->jobid)));
-        goto cleanup;
-    }
+
     /* prep the response */
     rc = ORTE_SUCCESS;
     answer = OBJ_NEW(opal_buffer_t);
@@ -743,10 +733,7 @@ void orte_plm_base_post_launch(int fd, short args, void *cbdata)
 
 void orte_plm_base_registered(int fd, short args, void *cbdata)
 {
-    int ret, room, *rmptr;
-    int32_t rc;
     orte_job_t *jdata;
-    opal_buffer_t *answer;
     orte_state_caddy_t *caddy = (orte_state_caddy_t*)cbdata;
 
     ORTE_ACQUIRE_OBJECT(caddy);
@@ -770,61 +757,8 @@ void orte_plm_base_registered(int fd, short args, void *cbdata)
         return;
     }
     /* update job state */
-    caddy->jdata->state = caddy->job_state;
-
-    /* if this isn't a dynamic spawn, just cleanup */
-    if (ORTE_JOBID_INVALID == jdata->originator.jobid ||
-        orte_get_attribute(&jdata->attributes, ORTE_JOB_NON_ORTE_JOB, NULL, OPAL_BOOL) ||
-        orte_get_attribute(&jdata->attributes, ORTE_JOB_DVM_JOB, NULL, OPAL_BOOL)) {
-        OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
-                             "%s plm:base:launch job %s is not a dynamic spawn",
-                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
-                             ORTE_JOBID_PRINT(jdata->jobid)));
-        goto cleanup;
-    }
-
-    /* if it was a dynamic spawn, send the response */
-    rc = ORTE_SUCCESS;
-    answer = OBJ_NEW(opal_buffer_t);
-    if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &rc, 1, OPAL_INT32))) {
-        ORTE_ERROR_LOG(ret);
-        ORTE_FORCED_TERMINATE(ORTE_ERROR_DEFAULT_EXIT_CODE);
-        OBJ_RELEASE(caddy);
-        return;
-    }
-    if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &jdata->jobid, 1, ORTE_JOBID))) {
-        ORTE_ERROR_LOG(ret);
-        ORTE_FORCED_TERMINATE(ORTE_ERROR_DEFAULT_EXIT_CODE);
-        OBJ_RELEASE(caddy);
-        return;
-    }
-    /* pack the room number */
-    rmptr = &room;
-    if (orte_get_attribute(&jdata->attributes, ORTE_JOB_ROOM_NUM, (void**)&rmptr, OPAL_INT)) {
-        if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &room, 1, OPAL_INT))) {
-            ORTE_ERROR_LOG(ret);
-            ORTE_FORCED_TERMINATE(ORTE_ERROR_DEFAULT_EXIT_CODE);
-            OBJ_RELEASE(caddy);
-            return;
-        }
-    }
-    OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
-                         "%s plm:base:launch sending dyn release of job %s to %s",
-                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
-                         ORTE_JOBID_PRINT(jdata->jobid),
-                         ORTE_NAME_PRINT(&jdata->originator)));
-    if (0 > (ret = orte_rml.send_buffer_nb(orte_mgmt_conduit,
-                                           &jdata->originator, answer,
-                                           ORTE_RML_TAG_LAUNCH_RESP,
-                                           orte_rml_send_callback, NULL))) {
-        ORTE_ERROR_LOG(ret);
-        OBJ_RELEASE(answer);
-        ORTE_FORCED_TERMINATE(ORTE_ERROR_DEFAULT_EXIT_CODE);
-        OBJ_RELEASE(caddy);
-        return;
-    }
+    jdata->state = caddy->job_state;
 
-  cleanup:
    /* if this wasn't a debugger job, then need to init_after_spawn for debuggers */
     if (!ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_DEBUGGER_DAEMON)) {
         ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_READY_FOR_DEBUGGERS);
diff --git a/orte/mca/state/dvm/state_dvm.c b/orte/mca/state/dvm/state_dvm.c
index 98ef551..6ae8e16 100644
--- a/orte/mca/state/dvm/state_dvm.c
+++ b/orte/mca/state/dvm/state_dvm.c
@@ -257,119 +257,156 @@ static void vm_ready(int fd, short args, void *cbdata)
 
     /* if this is my job, then we are done */
     if (ORTE_PROC_MY_NAME->jobid == caddy->jdata->jobid) {
-        /* send the daemon map to every daemon in this DVM - we
-         * do this here so we don't have to do it for every
-         * job we are going to launch */
-        buf = OBJ_NEW(opal_buffer_t);
-        opal_dss.pack(buf, &command, 1, ORTE_DAEMON_CMD);
-        /* if we couldn't provide the allocation regex on the orted
-         * cmd line, then we need to provide all the info here */
-        if (!orte_nidmap_communicated) {
-            if (ORTE_SUCCESS != (rc = orte_regx.nidmap_create(orte_node_pool, &nidmap))) {
-                ORTE_ERROR_LOG(rc);
-                OBJ_RELEASE(buf);
-                return;
+        /* if there is only one daemon in the job, then there
+         * is just a little bit to do */
+        if (1 == orte_process_info.num_procs) {
+            if (!orte_nidmap_communicated) {
+                if (ORTE_SUCCESS != (rc = orte_regx.nidmap_create(orte_node_pool, &orte_node_regex))) {
+                    ORTE_ERROR_LOG(rc);
+                    return;
+                }
+                orte_nidmap_communicated = true;
             }
-            orte_nidmap_communicated = true;
         } else {
-            nidmap = NULL;
-        }
-        opal_dss.pack(buf, &nidmap, 1, OPAL_STRING);
-        if (NULL != nidmap) {
-            free(nidmap);
-        }
-        /* provide the info on the capabilities of each node */
-        if (!orte_node_info_communicated) {
-            flag = 1;
-            opal_dss.pack(buf, &flag, 1, OPAL_INT8);
-            if (ORTE_SUCCESS != (rc = orte_regx.encode_nodemap(buf))) {
-                ORTE_ERROR_LOG(rc);
-                OBJ_RELEASE(buf);
-                return;
-            }
-            orte_node_info_communicated = true;
-            /* get wireup info for daemons */
-            jptr = orte_get_job_data_object(ORTE_PROC_MY_NAME->jobid);
-            wireup = OBJ_NEW(opal_buffer_t);
-            for (v=0; v < jptr->procs->size; v++) {
-                if (NULL == (dmn = (orte_proc_t*)opal_pointer_array_get_item(jptr->procs, v))) {
-                    continue;
+            /* send the daemon map to every daemon in this DVM - we
+             * do this here so we don't have to do it for every
+             * job we are going to launch */
+            buf = OBJ_NEW(opal_buffer_t);
+            opal_dss.pack(buf, &command, 1, ORTE_DAEMON_CMD);
+            /* if we couldn't provide the allocation regex on the orted
+             * cmd line, then we need to provide all the info here */
+            if (!orte_nidmap_communicated) {
+                if (ORTE_SUCCESS != (rc = orte_regx.nidmap_create(orte_node_pool, &nidmap))) {
+                    ORTE_ERROR_LOG(rc);
+                    OBJ_RELEASE(buf);
+                    return;
                 }
-                val = NULL;
-                if (OPAL_SUCCESS != (rc = opal_pmix.get(&dmn->name, NULL, NULL, &val)) || NULL == val) {
+                orte_nidmap_communicated = true;
+            } else {
+                nidmap = NULL;
+            }
+            opal_dss.pack(buf, &nidmap, 1, OPAL_STRING);
+            if (NULL != nidmap) {
+                free(nidmap);
+            }
+            /* provide the info on the capabilities of each node */
+            if (!orte_node_info_communicated) {
+                flag = 1;
+                opal_dss.pack(buf, &flag, 1, OPAL_INT8);
+                if (ORTE_SUCCESS != (rc = orte_regx.encode_nodemap(buf))) {
                     ORTE_ERROR_LOG(rc);
                     OBJ_RELEASE(buf);
-                    OBJ_RELEASE(wireup);
                     return;
-                } else {
-                    /* pack the name of the daemon */
-                    if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &dmn->name, 1, ORTE_NAME))) {
-                        ORTE_ERROR_LOG(rc);
-                        OBJ_RELEASE(buf);
-                        OBJ_RELEASE(wireup);
-                        return;
-                    }
-                    /* the data is returned as a list of key-value pairs in the opal_value_t */
-                    if (OPAL_PTR != val->type) {
-                        ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
-                        OBJ_RELEASE(buf);
-                        OBJ_RELEASE(wireup);
-                        return;
-                    }
-                    modex = (opal_list_t*)val->data.ptr;
-                    numbytes = (int32_t)opal_list_get_size(modex);
-                    if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &numbytes, 1, OPAL_INT32))) {
-                        ORTE_ERROR_LOG(rc);
-                        OBJ_RELEASE(buf);
-                        OBJ_RELEASE(wireup);
-                        return;
+                }
+                orte_node_info_communicated = true;
+                /* get wireup info for daemons */
+                jptr = orte_get_job_data_object(ORTE_PROC_MY_NAME->jobid);
+                wireup = OBJ_NEW(opal_buffer_t);
+                for (v=0; v < jptr->procs->size; v++) {
+                    if (NULL == (dmn = (orte_proc_t*)opal_pointer_array_get_item(jptr->procs, v))) {
+                        continue;
                     }
-                    OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
-                        if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &kv, 1, OPAL_VALUE))) {
+                    val = NULL;
+                    if (opal_pmix.legacy_get()) {
+                        if (OPAL_SUCCESS != (rc = opal_pmix.get(&dmn->name, OPAL_PMIX_PROC_URI, NULL, &val)) || NULL == val) {
+                            ORTE_ERROR_LOG(rc);
+                            OBJ_RELEASE(buf);
+                            OBJ_RELEASE(wireup);
+                            return;
+                        } else {
+                            /* pack the name of the daemon */
+                            if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &dmn->name, 1, ORTE_NAME))) {
+                                ORTE_ERROR_LOG(rc);
+                                OBJ_RELEASE(buf);
+                                OBJ_RELEASE(wireup);
+                                return;
+                            }
+                            /* pack the URI */
+                           if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &val->data.string, 1, OPAL_STRING))) {
+                                ORTE_ERROR_LOG(rc);
+                                OBJ_RELEASE(buf);
+                                OBJ_RELEASE(wireup);
+                                return;
+                            }
+                            OBJ_RELEASE(val);
+                        }
+                    } else {
+                        if (OPAL_SUCCESS != (rc = opal_pmix.get(&dmn->name, NULL, NULL, &val)) || NULL == val) {
                             ORTE_ERROR_LOG(rc);
                             OBJ_RELEASE(buf);
                             OBJ_RELEASE(wireup);
                             return;
+                        } else {
+                            /* pack the name of the daemon */
+                            if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &dmn->name, 1, ORTE_NAME))) {
+                                ORTE_ERROR_LOG(rc);
+                                OBJ_RELEASE(buf);
+                                OBJ_RELEASE(wireup);
+                                return;
+                            }
+                            /* the data is returned as a list of key-value pairs in the opal_value_t */
+                            if (OPAL_PTR != val->type) {
+                                ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
+                                OBJ_RELEASE(buf);
+                                OBJ_RELEASE(wireup);
+                                return;
+                            }
+                            modex = (opal_list_t*)val->data.ptr;
+                            numbytes = (int32_t)opal_list_get_size(modex);
+                            if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &numbytes, 1, OPAL_INT32))) {
+                                ORTE_ERROR_LOG(rc);
+                                OBJ_RELEASE(buf);
+                                OBJ_RELEASE(wireup);
+                                return;
+                            }
+                            OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
+                                if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &kv, 1, OPAL_VALUE))) {
+                                    ORTE_ERROR_LOG(rc);
+                                    OBJ_RELEASE(buf);
+                                    OBJ_RELEASE(wireup);
+                                    return;
+                                }
+                            }
+                            OPAL_LIST_RELEASE(modex);
+                            OBJ_RELEASE(val);
                         }
                     }
-                    OPAL_LIST_RELEASE(modex);
-                    OBJ_RELEASE(val);
                 }
+                /* put it in a byte object for xmission */
+                opal_dss.unload(wireup, (void**)&bo.bytes, &numbytes);
+                /* pack the byte object - zero-byte objects are fine */
+                bo.size = numbytes;
+                boptr = &bo;
+                if (ORTE_SUCCESS != (rc = opal_dss.pack(buf, &boptr, 1, OPAL_BYTE_OBJECT))) {
+                    ORTE_ERROR_LOG(rc);
+                    OBJ_RELEASE(wireup);
+                    OBJ_RELEASE(buf);
+                    return;
+                }
+                /* release the data since it has now been copied into our buffer */
+                if (NULL != bo.bytes) {
+                    free(bo.bytes);
+                }
+                OBJ_RELEASE(wireup);
+            } else {
+                flag = 0;
+                opal_dss.pack(buf, &flag, 1, OPAL_INT8);
             }
-            /* put it in a byte object for xmission */
-            opal_dss.unload(wireup, (void**)&bo.bytes, &numbytes);
-            /* pack the byte object - zero-byte objects are fine */
-            bo.size = numbytes;
-            boptr = &bo;
-            if (ORTE_SUCCESS != (rc = opal_dss.pack(buf, &boptr, 1, OPAL_BYTE_OBJECT))) {
+
+            /* goes to all daemons */
+            sig = OBJ_NEW(orte_grpcomm_signature_t);
+            sig->signature = (orte_process_name_t*)malloc(sizeof(orte_process_name_t));
+            sig->signature[0].jobid = ORTE_PROC_MY_NAME->jobid;
+            sig->signature[0].vpid = ORTE_VPID_WILDCARD;
+            if (ORTE_SUCCESS != (rc = orte_grpcomm.xcast(sig, ORTE_RML_TAG_DAEMON, buf))) {
                 ORTE_ERROR_LOG(rc);
-                OBJ_RELEASE(wireup);
                 OBJ_RELEASE(buf);
+                OBJ_RELEASE(sig);
+                ORTE_FORCED_TERMINATE(ORTE_ERROR_DEFAULT_EXIT_CODE);
                 return;
             }
-            /* release the data since it has now been copied into our buffer */
-            if (NULL != bo.bytes) {
-                free(bo.bytes);
-            }
-            OBJ_RELEASE(wireup);
-        } else {
-            flag = 0;
-            opal_dss.pack(buf, &flag, 1, OPAL_INT8);
-        }
-
-        /* goes to all daemons */
-        sig = OBJ_NEW(orte_grpcomm_signature_t);
-        sig->signature = (orte_process_name_t*)malloc(sizeof(orte_process_name_t));
-        sig->signature[0].jobid = ORTE_PROC_MY_NAME->jobid;
-        sig->signature[0].vpid = ORTE_VPID_WILDCARD;
-        if (ORTE_SUCCESS != (rc = orte_grpcomm.xcast(sig, ORTE_RML_TAG_DAEMON, buf))) {
-            ORTE_ERROR_LOG(rc);
             OBJ_RELEASE(buf);
-            OBJ_RELEASE(sig);
-            ORTE_FORCED_TERMINATE(ORTE_ERROR_DEFAULT_EXIT_CODE);
-            return;
         }
-        OBJ_RELEASE(buf);
         /* notify that the vm is ready */
         fprintf(stdout, "DVM ready\n");
         OBJ_RELEASE(caddy);
diff --git a/orte/orted/orted_main.c b/orte/orted/orted_main.c
index fec5082..75906ab 100644
--- a/orte/orted/orted_main.c
+++ b/orte/orted/orted_main.c
@@ -230,6 +230,7 @@ int orte_daemon(int argc, char *argv[])
 #if OPAL_ENABLE_FT_CR == 1
     char *tmp_env_var = NULL;
 #endif
+    opal_value_t val;
 
     /* initialize the globals */
     memset(&orted_globals, 0, sizeof(orted_globals));
@@ -460,6 +461,20 @@ int orte_daemon(int argc, char *argv[])
     }
     ORTE_PROC_MY_DAEMON->jobid = ORTE_PROC_MY_NAME->jobid;
     ORTE_PROC_MY_DAEMON->vpid = ORTE_PROC_MY_NAME->vpid;
+    OBJ_CONSTRUCT(&val, opal_value_t);
+    val.key = OPAL_PMIX_PROC_URI;
+    val.type = OPAL_STRING;
+    val.data.string = orte_process_info.my_daemon_uri;
+    if (OPAL_SUCCESS != (ret = opal_pmix.store_local(ORTE_PROC_MY_NAME, &val))) {
+        ORTE_ERROR_LOG(ret);
+        val.key = NULL;
+        val.data.string = NULL;
+        OBJ_DESTRUCT(&val);
+        goto DONE;
+    }
+    val.key = NULL;
+    val.data.string = NULL;
+    OBJ_DESTRUCT(&val);
 
     /* if I am also the hnp, then update that contact info field too */
     if (ORTE_PROC_IS_HNP) {
@@ -668,7 +683,6 @@ int orte_daemon(int argc, char *argv[])
                                   &orte_parent_uri);
     if (NULL != orte_parent_uri) {
         orte_process_name_t parent;
-        opal_value_t val;
 
         /* set the contact info into our local database */
         ret = orte_rml_base_parse_uris(orte_parent_uri, &parent, NULL);
@@ -684,6 +698,8 @@ int orte_daemon(int argc, char *argv[])
         val.data.string = orte_parent_uri;
         if (OPAL_SUCCESS != (ret = opal_pmix.store_local(&parent, &val))) {
             ORTE_ERROR_LOG(ret);
+            val.key = NULL;
+            val.data.string = NULL;
             OBJ_DESTRUCT(&val);
             goto DONE;
         }
@@ -758,52 +774,76 @@ int orte_daemon(int argc, char *argv[])
 
         /* get any connection info we may have pushed */
         {
-            opal_value_t *val = NULL, *kv;
+            opal_value_t *vptr = NULL, *kv;
             opal_list_t *modex;
             int32_t flag;
 
-            if (OPAL_SUCCESS != (ret = opal_pmix.get(ORTE_PROC_MY_NAME, NULL, NULL, &val)) || NULL == val) {
-                /* just pack a marker indicating we don't have any to share */
-                flag = 0;
-                if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &flag, 1, OPAL_INT32))) {
-                    ORTE_ERROR_LOG(ret);
-                    OBJ_RELEASE(buffer);
-                    goto DONE;
-                }
-            } else {
-                /* the data is returned as a list of key-value pairs in the opal_value_t */
-                if (OPAL_PTR == val->type) {
-                    modex = (opal_list_t*)val->data.ptr;
-                    flag = (int32_t)opal_list_get_size(modex);
+            if (opal_pmix.legacy_get()) {
+                if (OPAL_SUCCESS != (ret = opal_pmix.get(ORTE_PROC_MY_NAME, OPAL_PMIX_PROC_URI, NULL, &vptr)) || NULL == vptr) {
+                    /* just pack a marker indicating we don't have any to share */
+                    flag = 0;
                     if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &flag, 1, OPAL_INT32))) {
                         ORTE_ERROR_LOG(ret);
                         OBJ_RELEASE(buffer);
                         goto DONE;
                     }
-                    OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
-                        if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &kv, 1, OPAL_VALUE))) {
-                            ORTE_ERROR_LOG(ret);
-                            OBJ_RELEASE(buffer);
-                            goto DONE;
-                        }
-                    }
-                    OPAL_LIST_RELEASE(modex);
                 } else {
-                    opal_output(0, "VAL KEY: %s", (NULL == val->key) ? "NULL" : val->key);
-                    /* single value */
                     flag = 1;
                     if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &flag, 1, OPAL_INT32))) {
                         ORTE_ERROR_LOG(ret);
                         OBJ_RELEASE(buffer);
                         goto DONE;
                     }
-                    if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &val, 1, OPAL_VALUE))) {
+                    if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &vptr, 1, OPAL_VALUE))) {
                         ORTE_ERROR_LOG(ret);
                         OBJ_RELEASE(buffer);
                         goto DONE;
                     }
+                    OBJ_RELEASE(vptr);
+                }
+            } else {
+                if (OPAL_SUCCESS != (ret = opal_pmix.get(ORTE_PROC_MY_NAME, NULL, NULL, &vptr)) || NULL == vptr) {
+                    /* just pack a marker indicating we don't have any to share */
+                    flag = 0;
+                    if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &flag, 1, OPAL_INT32))) {
+                        ORTE_ERROR_LOG(ret);
+                        OBJ_RELEASE(buffer);
+                        goto DONE;
+                    }
+                } else {
+                    /* the data is returned as a list of key-value pairs in the opal_value_t */
+                    if (OPAL_PTR == vptr->type) {
+                        modex = (opal_list_t*)vptr->data.ptr;
+                        flag = (int32_t)opal_list_get_size(modex);
+                        if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &flag, 1, OPAL_INT32))) {
+                            ORTE_ERROR_LOG(ret);
+                            OBJ_RELEASE(buffer);
+                            goto DONE;
+                        }
+                        OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
+                            if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &kv, 1, OPAL_VALUE))) {
+                                ORTE_ERROR_LOG(ret);
+                                OBJ_RELEASE(buffer);
+                                goto DONE;
+                            }
+                        }
+                        OPAL_LIST_RELEASE(modex);
+                    } else {
+                        /* single value */
+                        flag = 1;
+                        if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &flag, 1, OPAL_INT32))) {
+                            ORTE_ERROR_LOG(ret);
+                            OBJ_RELEASE(buffer);
+                            goto DONE;
+                        }
+                        if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &vptr, 1, OPAL_VALUE))) {
+                            ORTE_ERROR_LOG(ret);
+                            OBJ_RELEASE(buffer);
+                            goto DONE;
+                        }
+                        OBJ_RELEASE(vptr);
+                    }
                 }
-                OBJ_RELEASE(val);
             }
         }
 
-- 
1.7.1

@ggouaillardet
Copy link
Contributor

@kawashima-fj the previous patch plus the one below (that should land into master first) seems to fix the issue for me with PMIx v2.0.3

can you please give it a try ?

i am now testing PMIX v1.2.5

diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c
index 7b30f20..8d178a4 100644
--- a/orte/mca/odls/base/odls_base_default_fns.c
+++ b/orte/mca/odls/base/odls_base_default_fns.c
@@ -15,8 +15,8 @@
  *                         All rights reserved.
  * Copyright (c) 2011-2017 Cisco Systems, Inc.  All rights reserved
  * Copyright (c) 2013-2018 Intel, Inc.  All rights reserved.
- * Copyright (c) 2014-2017 Research Organization for Information Science
- *                         and Technology (RIST). All rights reserved.
+ * Copyright (c) 2014-2018 Research Organization for Information Science
+ *                         and Technology (RIST).  All rights reserved.
  * Copyright (c) 2017      Mellanox Technologies Ltd. All rights reserved.
  * Copyright (c) 2017      IBM Corporation. All rights reserved.
  * $COPYRIGHT$
@@ -169,38 +169,60 @@ int orte_odls_base_default_get_add_procs_data(opal_buffer_t *buffer,
         wireup = OBJ_NEW(opal_buffer_t);
         /* always include data for mpirun as the daemons can't have it yet */
         val = NULL;
-        if (OPAL_SUCCESS != (rc = opal_pmix.get(ORTE_PROC_MY_NAME, NULL, NULL, &val)) || NULL == val) {
-            ORTE_ERROR_LOG(rc);
-            OBJ_RELEASE(wireup);
-            return rc;
-        } else {
-            /* the data is returned as a list of key-value pairs in the opal_value_t */
-            if (OPAL_PTR != val->type) {
-                ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
-                OBJ_RELEASE(wireup);
-                return ORTE_ERR_NOT_FOUND;
-            }
-            if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, ORTE_PROC_MY_NAME, 1, ORTE_NAME))) {
+        if (opal_pmix.legacy_get()) {
+            if (OPAL_SUCCESS != (rc = opal_pmix.get(ORTE_PROC_MY_NAME, OPAL_PMIX_PROC_URI, NULL, &val)) || NULL == val) {
                 ORTE_ERROR_LOG(rc);
                 OBJ_RELEASE(wireup);
                 return rc;
+            } else {
+                /* pack the name of the daemon */
+                if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, ORTE_PROC_MY_NAME, 1, ORTE_NAME))) {
+                    ORTE_ERROR_LOG(rc);
+                    OBJ_RELEASE(wireup);
+                    return rc;
+                }
+                /* pack the URI */
+               if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &val->data.string, 1, OPAL_STRING))) {
+                    ORTE_ERROR_LOG(rc);
+                    OBJ_RELEASE(wireup);
+                    return rc;
+                }
+                OBJ_RELEASE(val);
             }
-            modex = (opal_list_t*)val->data.ptr;
-            numbytes = (int32_t)opal_list_get_size(modex);
-            if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &numbytes, 1, OPAL_INT32))) {
+        } else {
+            if (OPAL_SUCCESS != (rc = opal_pmix.get(ORTE_PROC_MY_NAME, NULL, NULL, &val)) || NULL == val) {
                 ORTE_ERROR_LOG(rc);
                 OBJ_RELEASE(wireup);
                 return rc;
-            }
-            OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
-                if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &kv, 1, OPAL_VALUE))) {
+            } else {
+                /* the data is returned as a list of key-value pairs in the opal_value_t */
+                if (OPAL_PTR != val->type) {
+                    ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
+                    OBJ_RELEASE(wireup);
+                    return ORTE_ERR_NOT_FOUND;
+                }
+                if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, ORTE_PROC_MY_NAME, 1, ORTE_NAME))) {
+                    ORTE_ERROR_LOG(rc);
+                    OBJ_RELEASE(wireup);
+                    return rc;
+                }
+                modex = (opal_list_t*)val->data.ptr;
+                numbytes = (int32_t)opal_list_get_size(modex);
+                if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &numbytes, 1, OPAL_INT32))) {
                     ORTE_ERROR_LOG(rc);
                     OBJ_RELEASE(wireup);
                     return rc;
                 }
+                OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
+                    if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &kv, 1, OPAL_VALUE))) {
+                        ORTE_ERROR_LOG(rc);
+                        OBJ_RELEASE(wireup);
+                        return rc;
+                    }
+                }
+                OPAL_LIST_RELEASE(modex);
+                OBJ_RELEASE(val);
             }
-            OPAL_LIST_RELEASE(modex);
-            OBJ_RELEASE(val);
         }
         /* if we didn't rollup the connection info, then we have
          * to provide a complete map of connection info */
@@ -210,41 +232,66 @@ int orte_odls_base_default_get_add_procs_data(opal_buffer_t *buffer,
                     continue;
                 }
                 val = NULL;
-                if (OPAL_SUCCESS != (rc = opal_pmix.get(&dmn->name, NULL, NULL, &val)) || NULL == val) {
-                    ORTE_ERROR_LOG(rc);
-                    OBJ_RELEASE(buffer);
-                    return rc;
-                } else {
-                    /* the data is returned as a list of key-value pairs in the opal_value_t */
-                    if (OPAL_PTR != val->type) {
-                        ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
-                        OBJ_RELEASE(buffer);
-                        return ORTE_ERR_NOT_FOUND;
-                    }
-                    if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &dmn->name, 1, ORTE_NAME))) {
+                if (opal_pmix.legacy_get()) {
+                    if (OPAL_SUCCESS != (rc = opal_pmix.get(&dmn->name, OPAL_PMIX_PROC_URI, NULL, &val)) || NULL == val) {
                         ORTE_ERROR_LOG(rc);
                         OBJ_RELEASE(buffer);
                         OBJ_RELEASE(wireup);
                         return rc;
+                    } else {
+                        /* pack the name of the daemon */
+                        if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &dmn->name, 1, ORTE_NAME))) {
+                            ORTE_ERROR_LOG(rc);
+                            OBJ_RELEASE(buffer);
+                            OBJ_RELEASE(wireup);
+                            return rc;
+                        }
+                        /* pack the URI */
+                       if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &val->data.string, 1, OPAL_STRING))) {
+                            ORTE_ERROR_LOG(rc);
+                            OBJ_RELEASE(buffer);
+                            OBJ_RELEASE(wireup);
+                            return rc;
+                        }
+                        OBJ_RELEASE(val);
                     }
-                    modex = (opal_list_t*)val->data.ptr;
-                    numbytes = (int32_t)opal_list_get_size(modex);
-                    if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &numbytes, 1, OPAL_INT32))) {
+                } else {
+                    if (OPAL_SUCCESS != (rc = opal_pmix.get(&dmn->name, NULL, NULL, &val)) || NULL == val) {
                         ORTE_ERROR_LOG(rc);
                         OBJ_RELEASE(buffer);
-                        OBJ_RELEASE(wireup);
                         return rc;
-                    }
-                    OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
-                        if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &kv, 1, OPAL_VALUE))) {
+                    } else {
+                        /* the data is returned as a list of key-value pairs in the opal_value_t */
+                        if (OPAL_PTR != val->type) {
+                            ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
+                            OBJ_RELEASE(buffer);
+                            return ORTE_ERR_NOT_FOUND;
+                        }
+                        if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &dmn->name, 1, ORTE_NAME))) {
                             ORTE_ERROR_LOG(rc);
                             OBJ_RELEASE(buffer);
                             OBJ_RELEASE(wireup);
                             return rc;
                         }
+                        modex = (opal_list_t*)val->data.ptr;
+                        numbytes = (int32_t)opal_list_get_size(modex);
+                        if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &numbytes, 1, OPAL_INT32))) {
+                            ORTE_ERROR_LOG(rc);
+                            OBJ_RELEASE(buffer);
+                            OBJ_RELEASE(wireup);
+                            return rc;
+                        }
+                        OPAL_LIST_FOREACH(kv, modex, opal_value_t) {
+                            if (ORTE_SUCCESS != (rc = opal_dss.pack(wireup, &kv, 1, OPAL_VALUE))) {
+                                ORTE_ERROR_LOG(rc);
+                                OBJ_RELEASE(buffer);
+                                OBJ_RELEASE(wireup);
+                                return rc;
+                            }
+                        }
+                        OPAL_LIST_RELEASE(modex);
+                        OBJ_RELEASE(val);
                     }
-                    OPAL_LIST_RELEASE(modex);
-                    OBJ_RELEASE(val);
                 }
             }
         }

@ggouaillardet
Copy link
Contributor

ext1x has an other issue. It cannot register the daemon uri, since the daemon is in a namespace that is unknown at that time. FWIW PMIx v2 automatically creates non existing namespaces, but v1 simply fails.

Since Open MPI v3.1.0 has not been released yet, should we simply drop the pmix/ext1x component ?
Otherwise some extra dev will be required.

@kawashima-fj
Copy link
Member

@ggouaillardet OK. I'll try. (may be next week)

@kawashima-fj
Copy link
Member

@ggouaillardet I had time to try your patch (with OMPI 3.1.0rc3 + PMIx 2.0.3) today.

The original issue was abend of orterun. The abend disappeared. But orted abends.
Probably segmentation fault in the orte_daemon function.
I'll look next week.

[fh04-005:18801] *** Process received signal ***
[fh04-005:18801] Signal: Segmentation fault (11)
[fh04-005:18801] Signal code: Address not mapped (1)
[fh04-005:18801] Failing at address: (nil)
[fh04-005:18801] [ 0] /lib64/libpthread.so.0(+0x11d88)[0xffffffff00f71d88]
[fh04-005:18801] [ 1] [0x0]
[fh04-005:18801] [ 2] orted[0x1008e0]
[fh04-005:18801] [ 3] /lib64/libc.so.6(__libc_start_main+0x1c4)[0xffffffff010bd9d8]
[fh04-005:18801] [ 4] orted[0x10078c]
[fh04-005:18801] *** End of error message ***

@ggouaillardet
Copy link
Contributor

How did you test that ? I will try again too.

@kawashima-fj
Copy link
Member

Manually build using gcc 4.4.7 on Fujitsu PRIMEHPC FX100 (sparc64):

  • libevent 2.0.22
  • PMIx 2.0.3
  • Open MPI 3.1.0rc3 + your two patches
  • examples/ring_c.c in OMPI tree

Run on Fujitsu PRIMEHPC FX100 (sparc64):

mpiexec
-n 512
--hostfile $hostfile
--npernode 32
--mca orte_tmpdir_base /dev/shm
--mca plm_rsh_agent /bin/pjrsh
./ring_c

/bin/pjrsh is an alternative of ssh in Fujitsu system.

I didn't try on x86.

@ggouaillardet
Copy link
Contributor

@kawashima-fj fwiw, i cannot reproduce this on x86_64 nor FX10 (I could only test up to 24 nodes and 8 tasks per node though)

@kawashima-fj
Copy link
Member

@ggouaillardet Something was wrong in my previous try. Today I refreshed my installation and built Open MPI cleanly. It worked fine.

@ggouaillardet
Copy link
Contributor

@kawashima-fj thanks, I will push the fix in master and then PR vs the v3.1.x branch tomorrow

@jjhursey
Copy link
Member

Do we know if v3.0.x is impacted by this?

@ggouaillardet
Copy link
Contributor

I do not think so. You can grep opal_pmix.legacy_get, and if it returns nothing, it means v3.0.x is safe.

@ggouaillardet
Copy link
Contributor

as I wrote earlier, this patch is necessary but not sufficient for PMIx v1.2

The first issue I faced is in orte_ess_base_app_setup when storing the local URI

        OBJ_CONSTRUCT(&val, opal_value_t);
        val.key = OPAL_PMIX_PROC_URI;
        val.type = OPAL_STRING;
        val.data.string = orte_process_info.my_daemon_uri;
        if (OPAL_SUCCESS != (ret = opal_pmix.store_local(ORTE_PROC_MY_DAEMON, &val))) {

This fails since there is no opal_pmix1_jobid_trkr_t object for the daemon's namespace, and even if we create it, there is no such namespace in PMIx in pmix_globals.nspaces.

There is no such issue in v3.0.x since PMIx is not involved, but we instead

orte_rml.set_contact_info(orte_process_info.my_daemon_uri);

FWIW, I tried the following hack to force the creation of the missing objects, but I ended up with a deadlock when using 2 mpi tasks or more

diff --git a/opal/mca/pmix/ext1x/pmix1x_client.c b/opal/mca/pmix/ext1x/pmix1x_client.c
index 480b206..1e663dd 100644
--- a/opal/mca/pmix/ext1x/pmix1x_client.c
+++ b/opal/mca/pmix/ext1x/pmix1x_client.c
@@ -232,8 +232,16 @@ int pmix1_store_local(const opal_process_name_t *proc, opal_value_t *val)
             }
         }
         if (NULL == job) {
-            OPAL_ERROR_LOG(OPAL_ERR_NOT_FOUND);
-            return OPAL_ERR_NOT_FOUND;
+            opal_list_t info;
+            opal_value_t * val;
+            job = OBJ_NEW(opal_pmix1_jobid_trkr_t);
+            (void)opal_snprintf_jobid(job->nspace, PMIX_MAX_NSLEN, proc->jobid);
+            job->jobid = proc->jobid;
+            opal_list_append(&mca_pmix_ext1x_component.jobids, &job->super);
+            /* force PMIx to create the internal namespace */
+            OBJ_CONSTRUCT(&info, opal_list_t);
+            pmix1_get(proc, NULL, &info, &val);
+            OBJ_DESTRUCT(&info);
         }
         (void)strncpy(p.nspace, job->nspace, PMIX_MAX_NSLEN);
         p.rank = proc->vpid;

At this stage, I can only see two options

  • restoring some old code and have a opal_pmix.legacy_store_local() (same idea than the existing opal_pmix.legacy_get() used to support PMIx < v2.1
  • simply remove the pmix/ext1x from both master and v3.1.x

So this is both a technical issue (@rhc54 could you please share some insights ?) and a RM issue (@bwbarrett any opinion on this ?)

@bwbarrett
Copy link
Member

@ggouaillardet, pretend I have like 10 minutes a week to spend on this issue. Can you summarize in an actionable way? Earlier, you asked about removing support for something mid-version. That's clearly not ok, but it's unclear what broken and when (like did it ever work in the series or were we claiming support for something broken)?

@bwbarrett bwbarrett removed their assignment Apr 23, 2018
@ggouaillardet
Copy link
Contributor

@bwbarrett so here is the status (short version)

  • PMIx v1.2 does not work on both master and v3.1.x branches (used by the pmix/ext1x component)
  • 3.1.0 has already been released with the broken PMIX v1.2 support
  • fixing this will require some effort

so I can see two ways of seeing this

  1. this is a bug and and we will spend the time and effort to fix this
  2. pmix/ext1x should never have been released, and we will fix that by removing the component in 3.1.1

Long(er) story

  • I suspect PMIX v1.2 support was broken in master by b225366 in July 2017 and that happened before the v3.1.x branch was forked
  • the error occured when orte_ess_base_app_setup was updated to ask something to PMIx (e.g. opal_pmix.store_local(ORTE_PROC_MY_DAEMON, ...) that is not supported by PMIX v1.2. Per the commit message of b225366

Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.

My current understanding is that these subroutines are still needed if we want to support PMIX v1.2

  • pmix/ext1x did not even build on master for a while and I fixed that in 37e7bca. I suspect 9f472d8 broke it two weeks earlier.
  • all of this this could indicate no one tested and or/cared about PMIX v1.2 support on master nor v3.1.x branches
  • at the time I wrote about removing support for something mid-version that was not correct since 3.1.0 had not been released yet (but it has been released since ...)

Bottom line, PMIX v1.2 support is now broken in both master and v3.1.x branches, and it has never worked in the v3.1.x branch (that includes the already released Open MPI 3.1.0), so I believe it is fair to ask whether we should fix or remove the support for PMIx v1.2 in both master and v3.1.x branches via the pmix/ext1x component.

@bwbarrett
Copy link
Member

3.1.0 has not been released. However, as a minor version bump, it shouldn’t remove features from 3.0.x. So what does 3.0.x look like?

@ggouaillardet
Copy link
Contributor

3.1.0 has not been released indeed. v3.0.x does (correctly) support PMIx v1.2 (the changes were not backported, and v3.0.x uses the {get,set}_contact_info() subroutines.

@jjhursey
Copy link
Member

I missed the discussion on this ticket from the call. Were there action items to move forward?

@rhc54 rhc54 added the RTE Issue likely is in RTE or PMIx areas label Jun 26, 2018
@rhc54
Copy link
Contributor

rhc54 commented Sep 11, 2018

@ggouaillardet Did these fixes get into the release branches? If so, can we close this?

@rhc54
Copy link
Contributor

rhc54 commented Apr 22, 2021

PMIx v1.2 has passed outside the support window.

@rhc54 rhc54 closed this as completed Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RTE Issue likely is in RTE or PMIx areas State-Awaiting developer information
Projects
None yet
Development

No branches or pull requests

8 participants