Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl tcp: Use reachability and graph solving for global interface matching #7134

Merged
merged 1 commit into from
Jan 27, 2020

Conversation

wckzhang
Copy link
Contributor

@wckzhang wckzhang commented Nov 1, 2019

Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.

This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.

The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).

With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.

Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.

In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.

Signed-off-by: William Zhang wilzhang@amazon.com

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 1, 2019

Here are some graphs and data that I described in the commit message. The before/after graphs were run on 64 nodes with 36 ranks each, with up to 15 interfaces attached. Things to note are the gaps in the line are where there was no route found to host (a bug in the before code). I decided to keep the gaps rather than rerun the tests, since I believe they provide useful information. The number of interfaces was selected to see the difference between minimum, maximum, and the threshold switch the old code had from 8->9 interfaces (MAX_PERMUTATION_INTERFACES):

Screen Shot 2019-09-04 at 4 49 23 PM

Simulated additional init cost between two ranks with up to 300 interfaces.

Screen Shot 2019-11-01 at 11 44 41 AM
Average total mpi init costs before and after this patch, with 1, 8, 9, and 15 interfaces.

Screen Shot 2019-11-01 at 11 47 08 AM
Min, max, and average mpi init costs for each rank with 1 interface per node.

Screen Shot 2019-11-01 at 11 47 53 AM
Min, max, and average mpi init costs for each rank with 8 interfaces per node.

Screen Shot 2019-11-01 at 11 48 58 AM
Min, max, and average mpi init costs for each rank with 9 interfaces per node.

Screen Shot 2019-11-01 at 11 50 01 AM
Min, max, and average mpi init costs for each rank with 15 interfaces per node.

@markalle
Copy link
Contributor

markalle commented Nov 4, 2019

I opened a different issue at
https://github.ibm.com/smpi/ibm-ompi/pull/569
and was told this PR might be related.

So far I don't think this PR addresses my bind() mismatch but wanted to get your thoughts since this PR is in the same code. My concern with the design is how mca_btl_tcp_create() walks opal_if_list creating one btl for each interface and storing its selection in btl->tcp_ifaddr.

But it looks to me like the endpoint selection here is associated with a specific btl but it's considering all the local interfaces so it might pick something that's incompatible with the btl's btl->tcp_ifaddr. Then if there is a mismatch possible, it would show up in the bind() in btl_tcp_endpoint.c

Overall I think I like having the btl look over all the interfaces, so I'm wondering

  • why are we opening one tcp btl per local interface in the first place?
  • at connect-time should we be binding to something else if I'm right that btl->tcp_ifaddr doesn't represent the reality of what's being used?

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 4, 2019

Hello, I cannot open the ibm github PR, since I don't have an account.

So far I don't think this PR addresses my bind() mismatch but wanted to get your thoughts since this PR is in the same code.

I think it does address the mismatch. If you are referring to the issue in #7115, the issue looks like either a reachability issue or the address that one side selects does not match the address that the other side selects. Both issues are addressed by this patch, due to using the reachability framework as well as moving the matching into a global matching problem. If possible, can you run with my patch and see if you still see the same problem?

My concern with the design is how mca_btl_tcp_create() walks opal_if_list creating one btl for each interface and storing its selection in btl->tcp_ifaddr.

This is okay because we do intend for each btl to represent one interface.

But it looks to me like the endpoint selection here is associated with a specific btl but it's considering all the local interfaces so it might pick something that's incompatible with the btl's btl->tcp_ifaddr. Then if there is a mismatch possible, it would show up in the bind() in btl_tcp_endpoint.c

All local interfaces are considered here because it's a global matching problem. We match all local interfaces with all remote interfaces and store this information in the global component structure. This isn't done for each btl separately, so we need to consider all local interfaces. This is required for optimization of pairing. There shouldn't be anything incompatible with the btl's btl->tcp_ifaddr, that is what the reachability component is supposed to address. After we have created a reachability map, we then solve for interface pairing by turning it into a graph problem with a sink and flow using negative reachability for weights. It should technically not be possible to match something that is not reachable, as long as the reachability framework is functioning correctly.

Overall I think I like having the btl look over all the interfaces, so I'm wondering why are we opening one tcp btl per local interface in the first place?

I'm not sure what the original reasoning for one tcp btl per local interface was, but I don't see much wrong with the design.

at connect-time should we be binding to something else if I'm right that btl->tcp_ifaddr doesn't represent the reality of what's being used?

I'm not sure what you mean...it looks like it binds using the btl_endpoint->endpoint_btl->tcp_ifaddr and connects using the btl_endpoint->endpoint_addr.

@jjhursey
Copy link
Member

jjhursey commented Nov 4, 2019

Maybe Mark meant to point to #7115

@markalle
Copy link
Contributor

markalle commented Nov 4, 2019

Ah thanks, I totally had a cut-and-paste error in my question, ignore that reference I made to github.ibm.com.

Thanks for the explanations, it sounds plausible. So far I've only browsed the code. It's definitely possible that the matching results in the btl using picking endpoints that match what it's going to bind to. I'll try downloading and building it next

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 4, 2019

bot:retest

@markalle
Copy link
Contributor

markalle commented Nov 5, 2019

I got a chance to test it: I think I agree now it does address my bind() mismatch in addition to doing the rest of what it does. I like that it has one TCP btl per interface and each is using the interface it's set up with. I think previously each tcp btl was potentially spanning all the interfaces with no particular pattern.

I haven't really looked at how the load gets distributed over the multiple BTLs, I have maybe a slight concern that the old code could have lucked into a better distribution of traffic if everything went over btl[0] which seems like it might be the case at least for small messages.

I do have a failure though when I went from -np 2 to -np 3 (--host hostA:2,hostB:1)
First I get a non-fatal failure as

[1,0]:[hostA:12345] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891
but it still seems to run to completion.

The above was with
-mca pml ob1 -mca btl tcp,vader,self

If I instead switch to
-mca pml ob1 -mca btl tcp,self still at -np 3 --host hostA:2,hostB:1
it fails to run, reporting

At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated

@gpaulsen
Copy link
Member

gpaulsen commented Nov 5, 2019

@wckzhang Can you please take a look at the error @markalle found?

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 5, 2019

Looking into it, thanks.

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 5, 2019

@markalle Can you paste your whole mpirun command? (I can't reproduce the issue - how many interfaces do you have on your devices?)

@markalle
Copy link
Contributor

markalle commented Nov 5, 2019

The "bad parameter" non-fatal error message comes from this command
% mpirun --tag-output -np 3 --host hostA:2,hostB:1 -mca pml ob1 -mca btl tcp,vader,self -mca btl_tcp_if_include ib0 ./a2a.x

The fatal command is
% mpirun --tag-output -np 3 --host hostA:2,hostB:1 -mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include ib0 ./a2a.x
so just using one interface.

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 5, 2019

I still couldn't reproduce it - ran the following with no issues

mpirun --tag-output -hostfile /shared/hostfile -mca pml ob1 -mca btl tcp,vader,self -np 3 -mca btl_tcp_if_include eth0 ./osu_allgather

mpirun --tag-output -hostfile /shared/hostfile -mca pml ob1 -mca btl tcp,self -np 3 -mca btl_tcp_if_include eth0 ./osu_allgather

@markalle
Copy link
Contributor

markalle commented Nov 5, 2019

Also the OPAL_ERROR didn't change much when I set GRAPH_DEBUG, the message just became

[1,0]:bipartite_to_flow failed[hostA:12345] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891

I switched to your command line and looped over each interface individually for
-mca btl_tcp_if_include [interface]
and they all (enP5p1s0f0 enP5p1s0f1 ib0 ib1 ib2 ib3) gave the same result: With vader included they passed but gave the "bad parameter" error message. With vader removed they fail.

I don't know why I didn't just try single host before but that fails too with vader removed:
% mpirun --tag-output -mca pml ob1 -mca btl tcp,self -np 2 -mca btl_tcp_if_include ib0 ./a2a.x

[1,0]:[f3n17:75535] OPAL ERROR: Bad parameter in file bipartite_graph.c at line 891

At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 5, 2019

Still can't reproduce with:

mpirun --tag-output -mca pml ob1 -mca btl tcp,self -np 2 -mca btl_tcp_if_include eth1 ./osu_allgather

I think it's probably this line that's happening:

    /* it doesn't make sense to extend this graph with a source and sink        
     * unless */                                                                
    if (num_right == 0 || num_left == 0) {                                      
        return OPAL_ERR_BAD_PARAM;                                              
    }

Can you verify in mca_btl_tcp_proc_create_interface_graph the size of local_list, remote_list, and how many vertices and edges are created?

@jsquyres
Copy link
Member

jsquyres commented Nov 5, 2019

@markalle Github pro tip: Use three single ticks on a line by themselves to start and end a verbatim section of text. See https://help.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax#quoting-code

@jsquyres
Copy link
Member

jsquyres commented Nov 5, 2019

@wckzhang Github pro tip: I think you heavily edited your last comment after creating it. The email we got didn't have nearly as much detail as you now have in your comment -- so @markalle may not have seen your question to him.

@markalle
Copy link
Contributor

markalle commented Nov 6, 2019

I printed the size of both lists, they were both 1.

Then over in the failure location the return value from opal_bp_graph_bipartite_to_flow() is -5 and the gx structure is

(gdb) p *gx
$3 = {num_vertices = 4, vertices = {super = {obj_class = 0x2000006e0470 <opal_pointer_array_t_class>, 
      obj_reference_count = 1}, lock = {super = {obj_class = 0x2000006e0a78 <opal_mutex_t_class>, 
        obj_reference_count = 1}, m_lock_pthread = {__data = {__lock = 0, __count = 0, __owner = 0, 
          __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
        __size = '\000' <repeats 39 times>, __align = 0}, m_lock_atomic = {u = {lock = 0, 
          sparc_lock = 0 '\000', padding = "\000\000\000"}}}, lowest_free = 4, number_free = 28, 
    size = 32, max_size = 2147483647, block_size = 32, free_bits = 0x3c483f50, addr = 0x3c4893f0}, 
  source_idx = 2, sink_idx = 3, v_data_cleanup_fn = 0x0, e_data_cleanup_fn = 0x0}

if that helps.

@markalle
Copy link
Contributor

markalle commented Nov 6, 2019

My first printout was using opal_list_get_size(local_list) and opal_list_get_size(remote_list). I just now printed results->num_local and results->num_remote and they were also 1.

And the call to opal_bp_graph_add_edge() never happens in that x,y loop because the cost it sees is 0.

@markalle
Copy link
Contributor

markalle commented Nov 6, 2019

So maybe wherever it's getting its data is suspect for my architecture? If I artificially change the cost to 1 in that location it passes.

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 6, 2019

That means that the reachability framework is incorrectly identifying your interfaces as unreachable. This is a bug with the reachability framework then. It sets the weight as 0 when there is no connection available.

@markalle
Copy link
Contributor

markalle commented Nov 6, 2019

In the netlink get_weights() I added a print for local_if and remote_if and they both have the right IP as 10.100.3.17, but I did notice local_if->if_name is set to "ib0" while remote_if->if_name is an empty string. Maybe it's not needed, but it made me wonder if it has all the fields it needs set up.

opal_reachable_netlink_rt_lookup() is then returning 113. What can I print that would be useful there? After

    /* recieve results */
    NL_RECVMSGS(unlsk->nlh, arg, EHOSTUNREACH, err, out);
I have
(gdb) p arg
$1 = {oif = 4, found = 0, has_gateway = 0, replied = 0, unlsk = 0x3e7a3f70}

@markalle
Copy link
Contributor

markalle commented Nov 6, 2019

I think what's happening is it's being told to look for a route using "ib0" for on-host traffic but it's only seeing routes that use "lo".

In the code we end up with the callback opal_reachable_netlink_rt_raw_parse_cb() being triggering with
nla_get_u32(tb[RTA_OIF]) = 1 which I think is associated with interface "lo" vs
lookup_arg->oif = 4 which I think is my "ib0".

So it probably only sees "lo" as routing on-host but is trying to find something using "ib0".

In the past I've always thought
-mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include ib0
was a valid combination of options, but perhaps it has to be
-mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include ib0,lo
maybe? The above is working now, because with "lo" in the include list it's able to see that there is a route back to localhost.

What do you think? If vader is out of the btl list and tcp is being used on-host is it invalid to then use an interface list that leaves out "lo"? I tend to think "lo" is a loopback but not the only loopback and it should have worked with just "ib0". Thoughts?

@wckzhang
Copy link
Contributor Author

wckzhang commented Nov 7, 2019

In the netlink get_weights() I added a print for local_if and remote_if and they both have the right IP as 10.100.3.17, but I did notice local_if->if_name is set to "ib0" while remote_if->if_name is an empty string. Maybe it's not needed, but it made me wonder if it has all the fields it needs set up.

The remote_if->if_name being empty was intentional. I actually changed the documentation of the reachable function to specify this (since there was no prior specification).

/* Build reachability matrix between local and remote ethernet                  
 * interfaces                                                                   
 *                                                                              
 * @param local_ifs (IN)     Local list of opal_if_t objects                    
 *                           The opal_if_t objects must be fully populated      
 * @param remote_ifs (IN)    Remote list of opal_if_t objects                   
 *                           The opal_if_t objects must have the following fields populated:
 *                              uint16_t                 af_family;             
 *                              struct sockaddr_storage  if_addr;               
 *                              uint32_t                 if_mask;               
 *                              uint32_t                 if_bandwidth;          
 * @return opal_reachable_t  The reachability matrix was successfully created   
 * @return NULL              The reachability matrix could not be constructed   
 *                                                                              
 * Given a list of local interfaces and remote interfaces from a                
 * single peer, build a reachability matrix between the two peers.              
 * This function does not select the best pairing of local and remote           
 * interfaces, but only a (comparable) reachability between any pair            
 * of local/remote interfaces.                                                  
 *                                                                              
 *                                                                              
 */                                                                             
typedef opal_reachable_t*                                                       
(*opal_reachable_base_module_reachable_fn_t)(opal_list_t *local_ifs,            
                                             opal_list_t *remote_ifs); 

opal_reachable_netlink_rt_lookup() is then returning 113.

This looks like no route has been found from netlink. I'm guessing it may return that when a non loopback interface is trying to use loopback. Not sure if this is a bug or an intended feature.

Try using the weighted reachability component instead of netlink and see if that returns successful without lo.

What do you think? If vader is out of the btl list and tcp is being used on-host is it invalid to then use an interface list that leaves out "lo"? I tend to think "lo" is a loopback but not the only loopback and it should have worked with just "ib0". Thoughts?

I'm not sure what the valid operation should be - hopefully @bwbarrett or @jsquyres can chime in on this. In my perspective, vader should be used for loopback, but in the case it isn't, if ib0 is capable of loopback, it should be deemed reachable.

@wckzhang
Copy link
Contributor Author

During the telecom meeting, we discussed the path forward here. I will change the reachability framework documentation to say that an interface will always be shown to be reachable to itself. Then I will change the netlinks component to show reachability if two interfaces are identical.

@markalle
Copy link
Contributor

That sounds great to me. I agree with some former comment that vader is the logical thing to use for loopback and on-host TCP isn't something we generally do. I just like having things like that working for debugging purposes.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address my two comments before merging.

@@ -118,7 +114,7 @@ int mca_btl_tcp_add_procs( struct mca_btl_base_module_t* btl,
}

tcp_endpoint->endpoint_btl = tcp_btl;
rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint);
rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint, tcp_btl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the BTL is already stored in the endpoint (look at the line above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks, I'll change that

OBJ_DESTRUCT(&tcp_proc->proc_lock);
}

static inline int mca_btl_tcp_proc_is_proc_left(opal_process_name_t a,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the qualifier for being a proc_left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc_left is just so that when two ranks, a and b are trying to match interfaces, the graphs created will be identical. Thus, if on rank A, the bipartite flows from A -> B, then on rank B, the bipartite flows from A -> B as well. Thus, A is the left proc. It has no significance other than allowing these two processes to create identical graphs.

…hing

Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.

This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.

The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).

With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.

Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.

In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.

Signed-off-by: William Zhang <wilzhang@amazon.com>
@wckzhang
Copy link
Contributor Author

Mellanox failure seems unrelated.

@wckzhang
Copy link
Contributor Author

Does anyone know how to retrigger Mellanox CI?

@rhc54
Copy link
Contributor

rhc54 commented Jan 22, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wckzhang
Copy link
Contributor Author

wckzhang commented Jan 22, 2020

Can someone look into why Mellanox CI is hitting a git checkout issue?

@gpaulsen gpaulsen added this to the v5.0.0 milestone Jan 23, 2020
@gpaulsen
Copy link
Member

bot:retest

@gpaulsen
Copy link
Member

Hmmm. a retest 1 hour ago, didn't even restart the Mellanox CI (Details on it still says: failed 15 hours ago in 1m 14s).

@rhc54
Copy link
Contributor

rhc54 commented Jan 23, 2020

The bot command hasn't been updated to use the new Mellanox Azure CI

@rhc54
Copy link
Contributor

rhc54 commented Jan 23, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jsquyres
Copy link
Member

I've asked Mellanox if they can support the bot level commands (i.e., retest and mellanox:retest).

@artemry-nv
Copy link

artemry-nv commented Jan 23, 2020

Regarding to triggering Mellanox CI with the old keywords - Azure Pipelines by default supports only /azp run and some other similar ones (comment triggers). I'm checking whether it is possible to trigger CI via webhooks and bind it to the old keywords.
Regarding to the sporadic git checkout related failures I'm investigating it also - it seems it's caused by some strange merge commits (e.g. 561f96b)

CC @amaslenn

@bwbarrett bwbarrett merged commit fc8c7a5 into open-mpi:master Jan 27, 2020
@jsquyres
Copy link
Member

@artemry-mlnx Any news on being able to support the Open MPI-usual bot commands?

@artemry-nv
Copy link

@jsquyres

  1. Regarding to the old PR comment triggers support - I haven't found a way to customize the default Azure Pipelines ones (it seems the webhook triggers in Azure Pipelines aren't yet implemented). I've asked our Microsoft Azure DevOps support contact for help on this.
  2. Regarding to the sporadic git checkout failures it seems it's a known Azure Pipelines problem which is not yet fixed.

I'm tracking both issues and inform you upon any progress there.

@jsquyres
Copy link
Member

Can we make a new issue to track this? (rather than commenting on an old/merged PR)

FWIW: the git clone error "reference is not a tree" is likely that AZP is trying to clone before Github has made the merge hash yet. There's a flag in the JSON that they're supposed to be checking to see if Github has the merge hash ready yet. If it's not set, the clone will fail with this error. The Jenkins Github plugin had this problem for a while, too.

@artemry-nv
Copy link

artemry-nv commented Jan 28, 2020

@jsquyres
Sure, submitted #7347 and #7348 - feel free to assign them to me for the investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants