Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not add std::chrono::milliseconds to std::chrono::time_point #40

Merged
merged 1 commit into from
Oct 10, 2017

Conversation

shekharhimanshu
Copy link
Contributor

@shekharhimanshu shekharhimanshu commented Sep 29, 2017

std::chrono::time_point is unit less in itself and adding that to std::chrono milliseconds is undefined.
std::chrono::duration_cast should be used when comparing with std::chrono::milliseconds.

In my tests, there was a case when the SDK would sometimes hang for 20-30 mins due to this bug.

std::chrono::time_point is unit less in itself and adding that to std::chrono milliseconds is undefined.
std::chrono::duration_cast should be used when comparing.
@chaurah
Copy link
Contributor

chaurah commented Oct 2, 2017

Hi @shekharhimanshu,
Thank you for pointing this out. I am going to run our internal testing on this commit to make sure everything else is working as expected before merging. Could you possibly also include the test that was causing the SDK to hang? If you're unable to add code, a description of what you are doing in your tests would be helpful as well. We are working on significantly improving the testing for this SDK and as always, your suggestions will be extremely useful.

Rahul

@shekharhimanshu
Copy link
Contributor Author

We found this bug while doing scenario testing, so I dont have a test case ready. But the gist is as follows. Lets see the loop inside MbedTLSConnection::ReadInternal:

....
int ret;
size_t total_read_length = 0;
size_t remaining_bytes_to_read = size_bytes_to_read;
auto timeout = std::chrono::system_clock::now() + tls_read_timeout_;
do {
    // This read will timeout after IOT_SSL_READ_TIMEOUT if there's no data to be read
    ret = mbedtls_ssl_read(&ssl_, &buf[buf_read_offset], remaining_bytes_to_read);
    if (ret > 0) {
        buf_read_offset += ret;
        total_read_length += ret;
        remaining_bytes_to_read -= ret;
        size_read_bytes_out = total_read_length;
    } else if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE
        && ret != MBEDTLS_ERR_SSL_TIMEOUT) {
        return ResponseCode::NETWORK_SSL_READ_ERROR;
    }
} while (remaining_bytes_to_read > 0 && timeout < std::chrono::system_clock::now());     
... 

What we found was sometimes, this loop would take forever to finish (in our test 20-30 mins). The terminating condition for this loop is either all data is read (remaining_bytes_to_read <= 0) or timeout occurs.
Our test tried to simulate a condition where MQTT client's disconnect and then connect is called while data is already being read from some other thread. In such a case the connect would hang and return only after 20 - 30 mins (way more than our tls_read/write/connect timeout of 60 sec).

So whats happening is that timeout < std::chrono::system_clock::now() condition is valid for a longer time than the assigned tls_read_timeout_ since timeout here is calculated incorrectly (std::chrono::system_clock::now() + tls_read_timeout_) and remaining_bytes_to_read would never decrease. The latter might be due the fact that disconnect calls mbedtls_ssl_close_notify and mbedtls_ssl_read after that returns negative ret value causing the value of remaining_bytes_to_read to not change.

I wonder if we should improve the error handing as well by maybe catching MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY. But thats the story of some other PR.

@chaurah
Copy link
Contributor

chaurah commented Oct 5, 2017

That is an interesting find. Maybe it makes more sense to implement the timeout differently using a Condition variable or something. I have been debating re-writing the entire Network Layer in a different way to be honest. I don't like the polling system we have going here. Makes more sense to have the Network layer in a separate thread that calls into the SDK. I just hesitate to break what is a relatively stable SDK.

Thank for the information, this is very helpful.

@vareddy-zz vareddy-zz merged commit 19d815a into aws:master Oct 10, 2017
@vareddy-zz
Copy link
Contributor

The fix has been merged into master.

@shekharhimanshu
Copy link
Contributor Author

Thanks!

@shekharhimanshu shekharhimanshu deleted the fix-mbedtls-timeout branch October 11, 2017 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants