Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ib_send_bw hangs #7

Closed
patrickmacarthur opened this issue Aug 22, 2016 · 3 comments
Closed

ib_send_bw hangs #7

patrickmacarthur opened this issue Aug 22, 2016 · 3 comments

Comments

@patrickmacarthur
Copy link

On the latest master (commit id 6731fa60c32c9d4a73a27e0737a4fc99fe48d7c4) running under kernel version 3.17.8, running perftest-2.4-1.el7 on Scientific Linux 7.2. The hang is purely in userspace.

Stack trace on server:

#0  0x00007ffff665752e in siw_poll_cq_mapped () from /lib64/libsiw-rdmav2.so
#1  0x00000000004127b4 in ibv_poll_cq (wc=0x624e50, num_entries=16, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1277
#2  run_iter_bw_server (ctx=ctx@entry=0x7fffffffdd70, user_param=user_param@entry=0x7fffffffde90) at src/perftest_resources.c:2699
#3  0x0000000000403677 in main (argc=<optimized out>, argv=<optimized out>) at src/send_bw.c:429

Stack trace on client:

#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:26
#1  0x00007ffff665752e in siw_poll_cq_mapped (ibcq=0x624a90, num_entries=<optimized out>, wc=0x7fffffffdb60) at src/siw_uverbs.c:478
#2  0x00000000004050d0 in ibv_poll_cq (wc=0x7fffffffdb60, num_entries=1, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1277
#3  rdma_read_keys (rem_dest=rem_dest@entry=0x625e60, comm=comm@entry=0x7fffffffdc40) at src/perftest_communication.c:407
#4  0x00000000004068c3 in ctx_hand_shake (comm=comm@entry=0x7fffffffdc40, my_dest=my_dest@entry=0x624b50, 
    rem_dest=rem_dest@entry=0x625e60) at src/perftest_communication.c:1103
#5  0x00000000004036f0 in main (argc=<optimized out>, argv=<optimized out>) at src/send_bw.c:440

This is reproducible about 90% of the time.

Please let me know if you need any more information to reproduce the issue.

@patrickmacarthur
Copy link
Author

Having done some more digging on this, this issue only occurs for message sizes <= about 32 bytes.

@BernardMetzler
Copy link
Member

So far, I cannot reproduce it. It might be the sender overruns the receiver with SENDs, where the receiver cannot catch up with pre-posting RECEIVEs? A SEND to an empty RQ would break the connection. Do you see any such 'RX ERROR' messages via dmesg?

@patrickmacarthur
Copy link
Author

I looked at dmesg and realized that there appears to be a firmware bug in the underlying NIC. I was able to work around the bug by disabling the relevant offload feature on the NIC and now the test runs fine.

There appears to be a different issue with the RDMA READ bandwidth test but I don't have time to debug it now. I will open a new ticket when I am able to gather more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants