Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Tokio Console support #517

Merged
merged 14 commits into from
May 22, 2024

Conversation

b-yap
Copy link
Contributor

@b-yap b-yap commented Apr 29, 2024

This does not completely close #512, but it helps with tracking.

But if we can see which threads have been running the longest; which ones are idle for a long time, it could help determine where/when a vault gets stuck.

The tokio-console is described as:

a debugging and profiling tool for asynchronous Rust applications, which collects and displays in-depth diagnostic data on the asynchronous tasks, resources, and operations in an application.

^ From using tokio-console, a zombie task was found.
When the vault restarts, it shuts down all running tasks except for 1:

tokio::spawn(async move {
let me_clone = Arc::clone(&me);
loop {
pause_process_in_secs(interval_in_seconds).await;
me_clone._resubmit_transactions_from_cache().await;
}
});



Aside from this, I found out that the stream also gets stuck:
match connector.tcp_stream.read(&mut buff_for_reading).await {

Requires a timeout to trigger reconnection.


How to begin the review:

I have added comments to help with the review.

  1. all affected toml files:
    • tracing feature of tokio is necessary for tokio-console
    • console-subscriber dependency is necessary for tokio-console
  2. clients/wallet/src/stellar_wallet.rs:
    • I needed a channel to stop the task, the reason for a new field of StellarWallet:
            /// a sender to 'stop' a scheduled resubmission task
        pub(crate) resubmission_end_signal: Option<mpsc::Sender<()>>,
  3. clients/wallet/src/resubmissions.rs:
    • Added a fn stop_periodic_resubmission_of_transactions(&mut self)
    • Update inside fn start_periodic_resubmission_of_transactions_from_cache(...):
      • create a new channel and store the sender in StellarWallet's resubmission_end_signal field.
      • inside the spawned task, break the loop once the receiver receives a shutdown signal
             loop {
           // a shutdown message was sent. Stop the loop.
           if let Some(_) = receiver.recv().await {
           	break;
           }
        ...		
  4. clients/vault/src/system.rs:
    • calls the fn stop_periodic_resubmission_of_transactions(...) from the wallet field of VaultService:
           /// shuts down the resubmission task running in the background
       async fn shutdown_wallet(&self) {
        ...
        let mut wallet = self.stellar_wallet.write().await;
        wallet.stop_periodic_resubmission_of_transactions().await;
       ...
      }
    • calls the fn shutdown_wallet(...) when the service stops.
          	async fn start(&mut self) -> Result<(), ServiceError<Error>> {
        let result = self.run_service().await;
      
        self.shutdown_wallet().await;
        ...
  5. clients/stellar-relay-lib/src/connection/connector/message_reader.rs:
    • introduce READ_TIMEOUT_IN_SECS; how long to wait on reading from the stream.
    • add a timeout to reading the stream
           timeout(
        	 Duration::from_secs(READ_TIMEOUT_IN_SECS),
        	 connector.tcp_stream.read(&mut buff_for_reading),
            ).await
  6. clients/README.md:
    • documentation on how to use tokio-console

@b-yap b-yap force-pushed the 512-investigate-occasional-high-cpu-usage-of-vault-clients branch from 74c01ca to 16cc8e1 Compare May 9, 2024 08:55
b-yap added 5 commits May 13, 2024 14:57
add `tracing` feature of tokio, and the console-subscriber
add necessary logs;
add timeout in `read_message_from_stellar` because it gets stuck
@b-yap b-yap force-pushed the 512-investigate-occasional-high-cpu-usage-of-vault-clients branch from db588a5 to 87ad4d5 Compare May 13, 2024 06:57
@@ -7,7 +7,7 @@ edition = "2021"
[dependencies]
clap = { version = "4.0.17", features = ["derive"]}
hex = "0.4.3"
tokio = { version = "1.8", features = ["rt-multi-thread", "macros", "time"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest tokio vresion is 1.37.

@@ -31,7 +31,7 @@ log = "0.4.0"
serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.71"
thiserror = "1.0"
tokio = { version = "1.0", features = ["full"] }
tokio = { version = "1.37", features = ["full"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is the "lost its waker" issue in tokio, fixed in the newer versions.

@@ -16,7 +16,7 @@ wallet = { path = "../wallet", features = ["testing-utils"] }

[dependencies]
hex = "0.4.3"
log = {version = "0.4.14"}
tracing = { version = "0.1", features = ["log"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helps with tokio-console and instrumentation:

the console-subscriber crate in this repository contains an implementation of the instrumentation-side API as a tracing-subscriber Layer, for projects using Tokio and tracing.

@@ -0,0 +1,2 @@
[build]
rustflags = ["--cfg", "tokio_unstable"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make tokio-console work::

in order to collect task data from Tokio, the tokio_unstable cfg must be enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the Rust stable version, there hasn't been any problems. The tokio-console command displays just fine, although I was not able to see the histogram.

The worst it could do would be returning a blank view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for me, unfortunately I have to explicitly pass the RUSTFLAGS="--cfg tokio_unstable" when trying to run or build the client.

I assume it's because I do have a ~/.cargo/config.toml file where I already specify some rust flags

[target.aarch64-apple-darwin]
rustflags = ["-C", "link-arg=-fuse-ld=/opt/homebrew/opt/llvm/bin/ld64.lld"]

and this might override the workspace configuration.

I confirmed this by removing my 'global' config.toml and then recompiling. This worked.

Generally I would prefer if we could enable this tokio-console thing with an extra feature flag and have it disabled by default. Do you think this would be possible @b-yap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e77d7ac

I tried to set it through [env] .cargo/config.toml, and it didn't work. All RUSTFLAGS, CARGO_ENCODED_RUSTFLAGS and CARGO_BUILD_FLAGS won't work.

I also tried to do it through a build script and didn't work:

fn main(){
    cfg_if::cfg_if! {
        if #[cfg(feature = "allow-debugger")] {
            println!("cargo:rustc-env=RUSTFLAGS=\"--cfg tokio_unstable\"");
        }
    }
}

and
println!("cargo:rustc-flags=\"--cfg tokio_unstable\"");
as explained in rust-lang/cargo#10141.

In the end I updated the doc to tell the user to set the rustflags themselves.

"verify_auth(): remote sequence: {}, auth message sequence: {}",
remote_info.sequence(),
auth_msg.sequence
);

let auth_msg_xdr = auth_msg.to_base64_xdr();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove this because the xdr was huge and may not be helpful at all.

other => {
log::trace!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are only interested on messages that will be processed by the vault.
Logging should be done over there than here.

@@ -147,6 +147,7 @@ async fn start() -> Result<(), ServiceError<Error>> {

#[tokio::main]
async fn main() {
console_subscriber::init();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -679,15 +682,6 @@ impl VaultService {

tasks.append(&mut replace_tasks);

tasks.push((
Copy link
Contributor Author

@b-yap b-yap May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 removing this duplicate, as it is also found in

(
"Parachain Block Listener",
run(active_block_listener(
self.spacewalk_parachain.clone(),
issue_event_tx,
replace_event_tx,
)),
),

inside fn create_initial_tasks(...


Originally it runs after Replace Cancellation Scheduler and before Redeem Request Listener:

),
(
"Parachain Block Listener",
run(active_block_listener(
self.spacewalk_parachain.clone(),
issue_event_tx.clone(),
replace_event_tx.clone(),
)),
),
(
"Redeem Request Listener",
run(listen_for_redeem_requests(
self.shutdown.clone(),
self.spacewalk_parachain.clone(),
self.vault_id_manager.clone(),
self.config.payment_margin_minutes,
oracle_agent.clone(),
)),
),
(
"Bridge Metrics Listener",

@b-yap b-yap marked this pull request as ready for review May 13, 2024 10:56
@b-yap b-yap requested a review from a team May 13, 2024 10:56
Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me overall, nice job and great feature @b-yap 🚀
I left some comments.

clients/vault/src/system.rs Outdated Show resolved Hide resolved
clients/vault/src/system.rs Outdated Show resolved Hide resolved
clients/wallet/src/resubmissions.rs Outdated Show resolved Hide resolved
clients/wallet/src/resubmissions.rs Outdated Show resolved Hide resolved
@@ -0,0 +1,2 @@
[build]
rustflags = ["--cfg", "tokio_unstable"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for me, unfortunately I have to explicitly pass the RUSTFLAGS="--cfg tokio_unstable" when trying to run or build the client.

I assume it's because I do have a ~/.cargo/config.toml file where I already specify some rust flags

[target.aarch64-apple-darwin]
rustflags = ["-C", "link-arg=-fuse-ld=/opt/homebrew/opt/llvm/bin/ld64.lld"]

and this might override the workspace configuration.

I confirmed this by removing my 'global' config.toml and then recompiling. This worked.

Generally I would prefer if we could enable this tokio-console thing with an extra feature flag and have it disabled by default. Do you think this would be possible @b-yap?

@b-yap b-yap requested a review from ebma May 21, 2024 05:13
Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it locally again, can confirm it works with the new feature flag.

Great changes, thanks a lot @b-yap 👌

@b-yap b-yap merged commit 93ddbce into main May 22, 2024
2 checks passed
@b-yap b-yap deleted the 512-investigate-occasional-high-cpu-usage-of-vault-clients branch May 22, 2024 11:52
@b-yap
Copy link
Contributor Author

b-yap commented May 22, 2024

@ebma Would be great if we have the vaults running with this feature. I bet it'll look very different (and interesting).

@ebma
Copy link
Member

ebma commented May 22, 2024

True, but I'm not sure if we can easily connect to the clients with tokio-console if they are running in a docker container.

@zoveress in this PR we added support for a profiling tool that lets us investigate resource consumption of tasks within our process. As you can see here, it's fairly simple to run. As far as I understand, the changes we made to the client will make it expose a new endpoint (by default) on port 6669 that is used by the tokio-console tool for accessing the debug information. Using this setup would be very easy to do in the EC2 instances but maybe not in our Kubernetes cluster. @zoveress do you have an idea how we could be able to access it also on Kubernetes?

This is not super important but would be nice to have.

@zoveress
Copy link

Is the endpoint password secured?

@ebma
Copy link
Member

ebma commented May 22, 2024

No it's not and I think it's only supposed to be used locally. Can we expose it to the GitLab runner only and access it from there somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate occasional high CPU usage of vault clients
3 participants