Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reported completed simulation time is 0s when cancelling simulation being run in serial #599

Open
deanchester opened this issue Mar 26, 2021 · 1 comment
Assignees

Comments

@deanchester
Copy link

When you cancel SST (running without MPI or threads) during a run it reports that the simulation is complete and the simulated time is 0s despite the simulated being not being 0 like so:

 ^CEMERGENCY SHUTDOWN (0,0)!
 # Simulated time:                  1.06299 s
 EMERGENCY SHUTDOWN Complete (0,0)!
 Simulation is complete, simulated time: 0 s

I am using SST-Core commit 1c68395 on OS X (High Sierra) compiled with Clang and OpenMPI 4.

I tried 3 different run configurations (MPI, Serial, Threaded) here is the output when exiting mid-simulation:

MPI:

deangchester@Deans-MBP-17:models-sst-11/exascale_applications ‹master*›$ mpirun -np 4 sst tealeaf_problem3_HE_64.py
36
Allreduce: ranks 288, loop 1, 1 double(s), latency 16926.795 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
^C[Deans-MBP-17:17799] *** An error occurred in MPI_Allreduce
[Deans-MBP-17:17799] *** reported by process [4218290177,1]
[Deans-MBP-17:17799] *** on communicator MPI_COMM_WORLD
[Deans-MBP-17:17799] *** MPI_ERR_TRUNCATE: message truncated
[Deans-MBP-17:17799] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Deans-MBP-17:17799] ***    and potentially your MPI job)
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.492 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
EMERGENCY SHUTDOWN (1,0)!
# Simulated time:                  252.551 ms

Serial:

deangchester@Deans-MBP-17:models-sst-11/exascale_applications ‹master*›$ sst tealeaf_problem3_HE_64.py
36
Allreduce: ranks 288, loop 1, 1 double(s), latency 16926.795 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.492 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
^CEMERGENCY SHUTDOWN (0,0)!
# Simulated time:                  1.06299 s
EMERGENCY SHUTDOWN Complete (0,0)!
Simulation is complete, simulated time: 0 s

Threaded:

deangchester@Deans-MBP-17:models-sst-11/exascale_applications ‹master*›$ sst -n 4 tealeaf_problem3_HE_64.py
36
Allreduce: ranks 288, loop 1, 1 double(s), latency 16926.795 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
^CEMERGENCY SHUTDOWN (0,0)!
# Simulated time:                  148.404 ms
^C

In the case of the threaded run it didn't exit and I had issue another kill interrupt to SST core to get it to exit - I left it for approximately 3 minutes to exit before issuing another interrupt.

This isn't a major problem; just thought I'd raise it incase any other users come across a similar issue.

@feldergast
Copy link
Contributor

What's happening here is that the serial run is able to run through all the "post-run" code, whereas the parallel runs are not able to do that. The simulated time to the point of the interrupt is printed immediately, then the simulation tries to finish. Only the serial job can do that and is able to print the reported end simulation time (which is not reported because of the interrupt). You'll notice that a couple lines above the "Simulation is complete, simulated time: 0 s", you have "# Simulated time: 1.06299 s". That's the time at interrupt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants