Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable postgres SSL to fix random query timeout #9144

Conversation

xin-hedera
Copy link
Collaborator

@xin-hedera xin-hedera commented Aug 24, 2024

Description:

This PR fixes the random readonly query timeout issue in citus

  • Disable SSL
  • Enable postgres-util container
  • Reduce postgresql work_mem to 24MB to lower memory pressue
  • Add TEST_REPORTS_DIR env variable to all testkube tests

Related issue(s):

Fixes #8748
Fixes #9143

Notes for reviewer:

After much effort to diagnose and experiment, found disabling SSL solves the issue. Though I'm not clear about why SSL connection (via unix socket) can cause such issue, the changes here are much needed so we can move on to other tasks.

If needed, we can open a ticket upstream, however, the configuration itself (coordinator -> worker connection pgbouncer pooling) isn't supported by stackgres.

The changes which enable postgres-util container is a must to gain pgbouncer info. Without it, it's impossible to view a lot of pgbouner stats. The problem is pgbouner can only be connected via unix socket with psql command, psql -p 6432 -U pgbouncer pgbouncer. We can do so in the patroni container. However a lot of commands will just fail to show results, e.g., show clients, complaining about more is not available as it's not installed in the patroni container.

Verified for the upgrade path from <= 0.112.0-rc3 helm charts with SSL defaults to enabled, to charts in this PR, no manual steps are needed.

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

Signed-off-by: Xin Li <xin@swirldslabs.com>
Signed-off-by: Xin Li <xin@swirldslabs.com>
@xin-hedera xin-hedera added this to the 0.113.0 milestone Aug 24, 2024
@xin-hedera xin-hedera self-assigned this Aug 24, 2024
@xin-hedera xin-hedera linked an issue Aug 24, 2024 that may be closed by this pull request
@xin-hedera xin-hedera marked this pull request as draft August 24, 2024 02:53
@xin-hedera xin-hedera added bug Type: Something isn't working enhancement Type: New feature performance database Area: Database labels Aug 24, 2024
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.54%. Comparing base (ca7e24b) to head (c7a37ee).
Report is 27 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9144      +/-   ##
============================================
+ Coverage     92.14%   92.54%   +0.40%     
+ Complexity     7713     6995     -718     
============================================
  Files           933      901      -32     
  Lines         30823    29525    -1298     
  Branches       3762     3723      -39     
============================================
- Hits          28402    27325    -1077     
+ Misses         1575     1436     -139     
+ Partials        846      764      -82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jascks
jascks previously approved these changes Aug 24, 2024
Copy link
Contributor

@jascks jascks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jnels124 jnels124 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous pr to disable postgres node exporter container, we left configuration that allocates I beleive 100m cpu. Can we remove that and add in specific resource constraints for postgres-util?

@xin-hedera
Copy link
Collaborator Author

In previous pr to disable postgres node exporter container, we left configuration that allocates I beleive 100m cpu. Can we remove that and add in specific resource constraints for postgres-util?

My recollection of when I enabled it manually, the container had no resources section at all. Perhaps we should double check and give a reasonable cpu limit. I’d prefer getting it in a separate PR still.

@xin-hedera xin-hedera marked this pull request as ready for review August 26, 2024 14:39
@xin-hedera xin-hedera requested a review from a team August 26, 2024 14:40
edwin-greene
edwin-greene previously approved these changes Aug 26, 2024
Copy link
Contributor

@edwin-greene edwin-greene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

jnels124
jnels124 previously approved these changes Aug 26, 2024
Copy link
Contributor

@jnels124 jnels124 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

steven-sheehy
steven-sheehy previously approved these changes Aug 26, 2024
mgoelswirlds
mgoelswirlds previously approved these changes Aug 26, 2024
@xin-hedera
Copy link
Collaborator Author

In previous pr to disable postgres node exporter container, we left configuration that allocates I beleive 100m cpu. Can we remove that and add in specific resource constraints for postgres-util?

My recollection of when I enabled it manually, the container had no resources section at all. Perhaps we should double check and give a reasonable cpu limit. I’d prefer getting it in a separate PR still.

Ticket here: #9166

executor_slow_start_interval

Signed-off-by: Xin Li <xin@swirldslabs.com>
Signed-off-by: Xin Li <xin@swirldslabs.com>
Copy link

sonarcloud bot commented Aug 28, 2024

Copy link
Contributor

@jascks jascks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@steven-sheehy steven-sheehy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xin-hedera xin-hedera merged commit a56eae8 into main Aug 28, 2024
37 checks passed
@xin-hedera xin-hedera deleted the 9143-some-readonly-queries-time-out-in-citus-with-pgbouncer-pooling branch August 28, 2024 15:24
xin-hedera added a commit that referenced this pull request Aug 28, 2024
- Disable SSL
- Enable postgres-util container
- Reduce postgresql work_mem to 24MB to lower memory pressue
- Add TEST_REPORTS_DIR env variable to all testkube tests

Signed-off-by: Xin Li <xin@swirldslabs.com>
@steven-sheehy steven-sheehy added the citus Area: Citus label Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Type: Something isn't working citus Area: Citus database Area: Database enhancement Type: New feature performance
Projects
None yet
6 participants