Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

[Proposal] TensorBoard fixes to discuss #266

Closed
wants to merge 2 commits into from

Conversation

QuentinDuval
Copy link
Contributor

This PR is a follow up on a few issues that were discussed in #264 regarding TensorBoard:

  1. The current implementation creates several SummaryWriter although they will not be used (only the primary process logs for the time being) leading to empty files
  2. The current implementation uses the checkpoint directory instead of the configuration dedicated to tensorboard for logging

Now, things are not so simple regarding point 2: to distinguish two jobs running on the same configuration, I concatenated the JOB_ID of SLURM, which is unique (allows to distinguish jobs), stable (upon pre-emption it stays the same) and useful to find back which job produced what log.

But I do not know if there is such a trick for other job schedulers: the advantage of using the checkpoint directory was that it was unique to a JOB and stable, allowing TB to know in which file to log even after pre-emption. So I am a bit unsure about the consequences of the second point.

…the checkpoint folder and append the SLURM job_id when scheduling on SLURM to avoid overriding the same job
…ent TB hook will be instantiated in several workers, leading to empty TB files
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2021
@QuentinDuval QuentinDuval changed the title TensorBoard fixes to discuss [Proposal] TensorBoard fixes to discuss Apr 1, 2021
@QuentinDuval QuentinDuval deleted the tb_hook branch July 19, 2021 14:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants