Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotstart overwrites previous output #56

Open
saltynexus opened this issue Oct 10, 2019 · 15 comments
Open

Hotstart overwrites previous output #56

saltynexus opened this issue Oct 10, 2019 · 15 comments

Comments

@saltynexus
Copy link

Bug description

during hotstart, GPUSPH fails to write output to specified directory

Summary

I'm currently running GPUSPH on a cluster, which uses SLURM scheduling. The cluster scheduling is configured to give priority to certain users. In one instance, my job was killed during execution. I therefore attempted to resume the job using a hotstart file. GPUSPH successfully read the hotstart file and the simulation carried on as expected.

After the job finished, I check my output directory and noticed that there was no output generated following the hotstart. The only output provided was that associated with the initial simulation, prior to the job being killed.

This is the command that I executed in the initial job submission

./GPUSPH --deltap 0.005 --dir /home/user/nfs_fs02/high_res

This is the command that I executed after the job was killed to resume

./GPUSPH --deltap 0.005 --dir /home/user/nfs_fs02/high_res --resume /home/user/nfs_fs02/high_res/data/hot_00082.bin

The simulation is a modified version of the "WaveTank" example test case provided with the GPUSPH source code downloaded from here (github master branch). The only thing that I changed was removal of the slope in the experiment. I've run it in the past and it works as intended, so I'm 99.9% sure it has nothing to do with the specific application.

I suspect that the bug might be related to me specifying the output directory (non default). Somewhere in the hotstart procedure, it fails to properly identify that output is requested and where it is to be generated.

Details

Here is my error log

WARNING: dt 5e-05 will be used only for the first iteration because adaptive dt is enabled
Successfully restored hot start file 1 / 1
HotFile( version=1, pc=17086279, bc=1)
Restarting from t=8.20035, iteration=139290, dt=5.90909e-05
WARNING: simulation has rigid bodies and/or moving boundaries, resume will not give identical results

and here is my output log

 * No devices specified, falling back to default (0)...
GPUSPH version v5.0+custom
Release version without fastmath for compute capability 7.5
Chrono : enabled
HDF5   : enabled
MPI    : disabled
Catalyst : disabled
Compiled for problem "MY_WaveTank"
[Network] rank 0 (1/1), host 
 tot devs = 1 (1 * 1)

paddle_amplitude (radians): 0.218669
Info stream: GPUSPH-776718
Initializing...
Water level not set, autocomputed: 0.4525
Max particle speed not set, autocomputed from max fall: 2.97136
Expected maximum shear rate: 3076.92 1/s
dt = 5e-05 (CFL conditions from soundspeed: 6.5e-05, from gravity 0.00514816, from viscosity 5.28125)
Using computed max neib list size 128
Using computed neib bound pos 127
Artificial viscosity epsilon is not set, using default value: 4.225000e-07
Problem calling set grid params
Influence radius / neighbor search radius / expected cell side	: 0.013 / 0.013 / 0.013
Autocomputed SPS Smagorinsky factor 3.6e-07 from C_s = 0.12, ∆p = 0.005
Autocomputed SPS isotropic factor 1.1e-07 from C_i = 0.0066, ∆p = 0.005
 - World origin: 0 , 0 , 0
 - World size:   12 x 1.2 x 1
 - Cell size:    0.0130011 x 0.0130435 x 0.0131579
 - Grid size:    923 x 92 x 76 (6,453,616 cells)
 - Cell linearization: y,z,x
 - Dp:   0.005
 - R0:   0.005
Generating problem particles...
Hot starting from /home/user/nfs_fs02/high_res/data/hot_00082.bin...
VTKWriter will write every 0.1 (simulated) seconds
HotStart checkpoints every 0.1 (simulated) seconds
	will keep the last 8 checkpoints
Allocating shared host buffers...
Numbodies : 1
Numforcesbodies : 0
numOpenBoundaries : 0
  allocated 1.27 GiB on host for 17,086,280 particles (17,086,279 active)
read buffer header: Position
read buffer header: Velocity
read buffer header: Info
read buffer header: Hash
Restoring body #0 ...
RB First/Last Index:
Preparing the problem...
Body: 0
	 Cg grid pos: 13 46 25
	 Cg pos: -0.00144029 -0.00652174 0.00613915
 - device at index 0 has 17,086,279 particles assigned and offset 0
Integrator predictor/corrector instantiated.
Starting workers...
number of forces rigid bodies particles = 0
thread 0x2b93acd3c700 device idx 0: CUDA device 0/1, PCI device 0000:1b:00.0: GeForce RTX 2080 Ti
Device idx 0: free memory 10821 MiB, total memory 10989 MiB
Estimated memory consumption: 400B/particle
Device idx 0 (CUDA: 0) allocated 0 B on host, 6.1 GiB on device
  assigned particles: 17,086,279; allocated: 17,086,280
GPUSPH: initialized
Performing first write...
Letting threads upload the subdomains...
Thread 0 uploading 17086279 Position items (260.72 MiB) on device 0 from position 0
Thread 0 uploading 17086279 Velocity items (260.72 MiB) on device 0 from position 0
Thread 0 uploading 17086279 Info items (130.36 MiB) on device 0 from position 0
Thread 0 uploading 17086279 Hash items (65.18 MiB) on device 0 from position 0
Entering the main simulation cycle
Simulation time t=8.200351e+00s, iteration=139,290, dt=5.909090e-05s, 17,086,279 parts (0, cum. 0 MIPPS), maxneibs 83+0
Simulation time t=8.300006e+00s, iteration=140,977, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 91+0
Simulation time t=8.400047e+00s, iteration=142,670, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 91+0
Simulation time t=8.500029e+00s, iteration=144,362, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 91+0
Simulation time t=8.600003e+00s, iteration=146,054, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 92+0
Simulation time t=8.700042e+00s, iteration=147,747, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=8.800022e+00s, iteration=149,439, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=8.900055e+00s, iteration=151,134, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.000036e+00s, iteration=152,826, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.100010e+00s, iteration=154,518, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.200050e+00s, iteration=156,211, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.300029e+00s, iteration=157,903, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.400006e+00s, iteration=159,595, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.500047e+00s, iteration=161,288, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.600018e+00s, iteration=162,980, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.700022e+00s, iteration=164,674, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.800039e+00s, iteration=166,367, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.900003e+00s, iteration=168,059, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=1.000004e+01s, iteration=169,752, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Elapsed time of simulation cycle: 3.7e+04s
Peak particle speed was ~2.30357 m/s at 9.50005 s -> can set maximum vel 2.5 for this problem
Simulation end, cleaning up...
Deallocating...

The "git_branch.txt" output is

v5.0+custom
* master ec5e7b1 [origin/master] Further do generation fixes

The "make_show.txt" output is

GPUSPH version:  v5.0+custom
Platform:        Linux
Architecture:    x86_64
Current dir:     /home/user/gpusph
This Makefile:   /home/user/gpusph/Makefile
Used Makefiles:   Makefile Makefile.conf Makefile.local dep/command_type.d dep/HDF5SphReader.d dep/pugixml.d dep/simframework.d dep/GPUWorker.d dep/Synchronizer.d dep/VTUReader.d dep/ProblemCore.d dep/mai
n.d dep/Writer.d dep/base64.d dep/ParticleSystem.d dep/Options.d dep/GPUSPH.d dep/vector_print.d dep/Reader.d dep/Integrator.d dep/buffer_traits.d dep/debugflags.d dep/predcorr_alloc_policy.d dep/XYZReade
r.d dep/cuda/cudautil.d dep/geometries/Cube.d dep/geometries/Torus.d dep/geometries/STLMesh.d dep/geometries/Cylinder.d dep/geometries/Point.d dep/geometries/TopoCube.d dep/geometries/Object.d dep/geometr
ies/Vector.d dep/geometries/Disk.d dep/geometries/EulerParameters.d dep/geometries/Cone.d dep/geometries/Sphere.d dep/geometries/Rect.d dep/geometries/Plane.d dep/integrators/RepackingIntegrator.d dep/int
egrators/PredictorCorrectorIntegrator.d dep/problem_api/ProblemAPI_1.d dep/writers/UDPWriter.d dep/writers/CustomTextWriter.d dep/writers/CallbackWriter.d dep/writers/CommonWriter.d dep/writers/HotFile.d 
dep/writers/VTKWriter.d dep/writers/VTKLegacyWriter.d dep/writers/TextWriter.d dep/writers/HotWriter.d dep/NetworkManager.d dep/problems/BuoyancyTest.d dep/problems/ProblemExample.d dep/problems/WaveTank.
d dep/problems/user/MY_WaveTank.d dep/BuoyancyTest.gen.d dep/ProblemExample.gen.d dep/WaveTank.gen.d dep/MY_WaveTank.gen.d
Problem:         
Linearization:   yzx
Snapshot file:   ./GPUSPH-v5.0+custom-2019-06-13.tgz
Last problem:    MY_WaveTank
Sources dir:     src src/adaptors src/cuda src/geometries src/integrators src/problem_api src/problems src/writers
Options dir:     options
Objects dir:     build build/adaptors build/cuda build/geometries build/integrators build/problem_api build/problems build/problems/user build/writers
Scripts dir:     scripts
Docs dir:        docs
Doxygen conf:    
Verbose:         
Debug:           0
CXX:             g++
CXX version:     g++ (GCC) 6.3.0
MPICXX:          g++
nvcc:            /opt/apps/software/system/CUDA/10.1.105/bin/nvcc -ccbin=g++
nvcc version:    10.1
LINKER:          /opt/apps/software/system/CUDA/10.1.105/bin/nvcc -ccbin=g++
Compute cap.:    75
Fastmath:        0
USE_MPI:         0
USE_HDF5:        1
USE_CHRONO:      1
default paths:   /home/user/gpusph/as /home/user/gpusph/it /home/user/gpusph/is /home/user/gpusph/a /home/user/gpusph/non-system /home/user/gpusph/directory /home/user/gpusph/t
hat /home/user/gpusph/duplicates /home/user/gpusph/a /home/user/gpusph/system /home/user/gpusph/directory /home/user/gpusph/as /home/user/gpusph/it /home/user/gpusph/is /home/t
royheit/gpusph/a /home/user/gpusph/non-system /home/user/gpusph/directory /home/user/gpusph/that /home/user/gpusph/duplicates /home/user/gpusph/a /home/user/gpusph/system /home/tro
yheit/gpusph/directory /opt/apps/software/data/HDF5/1.10.5-iimpi-2018.4.274/include /opt/apps/software/tools/Szip/2.1.1-GCCcore-6.3.0/include /opt/apps/software/lib/zlib/1.2.11/include /opt/apps/software/
mpi/impi/2018.4.274-iccifort-2018.5.274-GCC-6.3.0-2.26/include64 /opt/apps/software/lib/libfabric/1.7.1/include /opt/apps/software/compiler/ifort/2018.5.274-GCC-6.3.0-2.26/include /opt/apps/software/compi
ler/icc/2018.5.274-GCC-6.3.0-2.26/compilers_and_libraries_2018.5.274/linux/tbb/include /opt/apps/software/tools/binutils/2.26-GCCcore-6.3.0/include /opt/apps/software/mpi/OpenMPI/3.1.2-GCC-8.2.0-2.31.1/in
clude /opt/apps/software/system/hwloc/1.11.11-GCCcore-8.2.0/include /opt/apps/software/system/libpciaccess/0.14-GCCcore-8.2.0/include /opt/apps/software/lib/libxml2/2.9.8-GCCcore-8.2.0/include/libxml2 /op
t/apps/software/lib/libxml2/2.9.8-GCCcore-8.2.0/include /opt/apps/software/tools/XZ/5.2.4-GCCcore-8.2.0/include /opt/apps/software/tools/numactl/2.0.12-GCCcore-8.2.0/include /opt/apps/software/system/CUDA
/10.1.105/nvvm/include /opt/apps/software/system/CUDA/10.1.105/extras/CUPTI/include /opt/apps/software/system/CUDA/10.1.105/include /opt/apps/software/devel/ncurses/6.1-GCCcore-7.3.0/include /opt/apps/sof
tware/math/Eigen/3.3.7/include /opt/apps/software/compiler/GCCcore/6.3.0/include/c++/6.3.0 /opt/apps/software/compiler/GCCcore/6.3.0/include/c++/6.3.0/x86_64-pc-linux-gnu /opt/apps/software/compiler/GCCco
re/6.3.0/include/c++/6.3.0/backward /opt/apps/software/compiler/GCCcore/6.3.0/lib/gcc/x86_64-pc-linux-gnu/6.3.0/include /opt/apps/software/compiler/GCCcore/6.3.0/include /opt/apps/software/compiler/GCCcor
e/6.3.0/lib/gcc/x86_64-pc-linux-gnu/6.3.0/include-fixed /usr/include
INCPATH:          -Isrc -Isrc/adaptors -Isrc/cuda -Isrc/geometries -Isrc/integrators -Isrc/problem_api -Isrc/problems -Isrc/writers -Isrc/problems -Isrc/problems/user -Ioptions -isystem /home/user/chr
ono/include -isystem /home/user/chrono/include -isystem /home/user/chrono/include/chrono -isystem /home/user/chrono/include/chrono/collision/bullet
LIBPATH:          -L/usr/local/lib -L/opt/apps/software/system/CUDA/10.1.105/lib64 -L/home/user/chrono/lib
LIBS:             -lcudart -L/opt/apps/software/data/HDF5/1.10.5-iimpi-2018.4.274/lib -lhdf5 -lsz -lz   -lpthread -lrt -lChronoEngine
LDFLAGS:          --linker-options -rpath,/home/user/chrono/lib  -L/usr/local/lib -L/opt/apps/software/system/CUDA/10.1.105/lib64 -L/home/user/chrono/lib -arch=sm_75
CPPFLAGS:          -Isrc -Isrc/adaptors -Isrc/cuda -Isrc/geometries -Isrc/integrators -Isrc/problem_api -Isrc/problems -Isrc/writers -Isrc/problems -Isrc/problems/user -Ioptions -isystem /home/user/ch
rono/include -isystem /home/user/chrono/include -isystem /home/user/chrono/include/chrono -isystem /home/user/chrono/include/chrono/collision/bullet -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MAC
ROS -D_GLIBCXX_USE_C99_MATH -DUSE_HDF5=1 -I/opt/apps/software/data/HDF5/1.10.5-iimpi-2018.4.274/include   -D__COMPUTE__=75
CXXFLAGS:         -m64 -std=c++11   -O3
CUFLAGS:          -arch=sm_75 --generate-line-info -std=c++11 --compiler-options -m64,-O3


The "summary.txt" output is

Simulation parameters:
 deltap = 0.005
 sfactor = 1.3
 slength = 0.0065
 kerneltype: 3 (Wendland)
 kernelradius = 2
 influenceRadius = 0.013
 SPH formulation: 1 (F1)
 multi-fluid support: disabled
 Rheology: Newtonian
	Turbulence model: Sub-particle scale
	Computational viscosity type: Kinematic
	Viscous model operator: Morris 1997
	Viscous averaging operator: Harmonic
	(constant viscosity optimizations)
 periodicity: 0 (none)
 initial dt = 5e-05
 simulation end time = 10
 neib list construction every 10 iterations
 Shepard filter every 20 iterations
 adaptive time stepping enabled
    safety factor for adaptive time step = 0.2
 internal energy computation disabled
 XSPH correction disabled
 moving bodies disabled
 open boundaries disabled
 water depth computation disabled
 time-dependent gravity disabled
 geometric boundaries: 
   DEM: disabled
   planes: enabled, 6 defined

Physical parameters:
 gravity = (0, 0, -9.81) [9.81] fixed
 numFluids = 1
 rho0[ 0 ] = 1000
 B[ 0 ] = 57142.9
 gamma[ 0 ] = 7
 sscoeff[ 0 ] = 20
 sspowercoeff[ 0 ] = 3
 sound speed[ 0 ] = 2.00601e+10
 partsurf = 0
 Lennard-Jones boundary parameters:
	r0 = 0.005
	d = 22.0725
	p1 = 12
	p2 = 6
Newtonian rheology with Sub-particle scale turbulence model. Parameters:
	Smagfactor = 3.6e-07
	kSPSfactor = 1.1e-07
	kinematicvisc[ 0 ] = 1e-06 (m^2/s)
	visc_consistency[ 0 ] = 0.001 (Pa^n s)
	visccoeff[ 0 ] = 1e-06 (m^2/s)

Comman-line options:
 problem: MY_WaveTank
 dem: 
 dir: /home/user/nfs_fs02/high_res
 deltap: 0.005
 tend: nan
 dt: nan
 hosts: 0
 saving enabled
 GPUDirect disabled
 striping disabled
 async network transfers disabled
 Other options:

@Oblomov
Copy link
Contributor

Oblomov commented Oct 11, 2019

I believe the actual issue in this case is that when the specified output directory is the same as the directory where the resume file is located leads to the new writes overwriting the old ones, since the file counters aren't kept across resumes, so they restart at 0. This should be relatively easy to check, by checking the write timestamps of the files and the corresponding simulated times in the index files. Could you verify that, please?

@saltynexus
Copy link
Author

Ahhh...yes. I believe you are correct. I checked my output files and PART_00000.vtp - PART_00018.vtp are all timestamped at a later date. Since I started at hot_00082.bin, and there is 100 output files, there would be 18 files left to create. I can also confirm this by looking at the last hotstart file, which is hot_00018.bin.

@Oblomov Oblomov changed the title Hotstart fails to write output ~~Hotstart fails to write output~~ Hotstart overwrites previous output Oct 12, 2019
@Oblomov Oblomov changed the title ~~Hotstart fails to write output~~ Hotstart overwrites previous output Hotstart overwrites previous output Oct 12, 2019
@Oblomov
Copy link
Contributor

Oblomov commented Oct 12, 2019

Thanks. I've taken the liberty to rename the issue title.

I've been thinking about the possible approaches to solve this. I can think of three, some of which have more far-reaching implications:

  1. refuse to write when the target directory exists —regardless of whether we're resuming or not; simplest to implement, extremely safe, may provide a command-line option such as --overwrite to restore the current behavior;
  2. when resuming with a target directory that is the same where the hotfile is stored, try to restart the output file numbering; this is not as simple as it sounds: it would require a breaking change in the hotfile format, and we need to check how hard or easy it would be for all writers to resume numbering from an arbitrary number (especially when ancillary metadata files are present, this may be quite complex);
  3. something in-between in terms of complexity would be to append a suffix to the target directory when resuming (if it exists already): e.g. <targetdir>/resume<N> where N is the number of times we have had to resume. This isn't as clean and complete as 2., but it would at least avoid data loss, and make it clear which data files were created after the resume (could be useful for debugging future issues).

In fact, 3. could probably be extended for standard (non-resume) writes, where <targetdir>.<N> is used if <targetdir> is found (and no --overwrite is specified).

Suggestions and opinions welcome.

@Narcolessico
Copy link
Contributor

The bin file is by default in save-dir/data; what about just checking if the parent of the hotfile dir does not match the save dir?

@Oblomov
Copy link
Contributor

Oblomov commented Oct 14, 2019

The question is not how to detect the situation but what to do about it. Should we just abort, or resume but put stuff in a different subdir ?

@saltynexus
Copy link
Author

saltynexus commented Oct 14, 2019

I would prefer the solution be found with automation in mind. If my job is running with SLURM and gets killed, it will be queued for automatic restart. I've already modified GPUSPH to handle "auto resume", but clearly the internal counter status is not saved.

  1. Aborting is not the answer. If I wanted that functionality, I would just use the default directory (no --dir option). As stated above, GPUSPH should not have to ask me for another directory.

  2. The best option is to save the counter status with the hotstart file, without question. In the mean time, maybe there is a compromise here. Can you just parse the given file name for the counter number then add that to the internally generated filename? For example....parsing "hot_00082.bin" gives you "82". If you don't touch the internal code for the counter, it will start at filename "00000". The only part of the code you need to touch is the output filename ... "00000+00082" = "00082" (new filename). Done! In this perspective, the problem fix is the filename (I don't know the internal workings of GPUSPH that well to comment on the counter save in hotstart file solution or restart counter in code).

  3. Yeah, subdir are a compromise, and will work, but I'm not excited about it. You're just dumping off more work to the end user and things could get messy. What if I need to resume 3-4 times? Your asking each individual user to come up with there own way to manage/merge data in post processing. Way more chance for human error.

@Oblomov
Copy link
Contributor

Oblomov commented Oct 14, 2019

Hello @saltynexus,

the issue with your 2nd proposal is that different writers will have different counters, and trying to map simulated time to counter is unreliable (off by one depending on the actual time-step, extra saves requested by the problem, etc). This is particularly relevant when resuming from any hotfile but the last (which might be the case e.g. if the last hotfile is corrupt), but also for the last hotfile case (e.g. if GPUSPH got terminated abruptly during or right after a VTKWRITER checkpoint, but before the next HOTWRITER checkpoint).

This is in addition to the fact that several writers don't take very well to resumes anyway, because they either have a single output file (e.g. the common writer produces most data as “append only”) or they have additional metadata that would have to be reloaded to be kept in sync. This makes the implementation of the counter resume non-trivial in coding terms (i.e. it will take longer to get to a reliably working solution).

I'm not particularly happy about the subdir/altdir solution either, but it could be a stopgap to avoid data loss until the proper counter resume is implemented.

@saltynexus
Copy link
Author

saltynexus commented Oct 14, 2019

@Oblomov OK, so we both agree that the subdir option is a temporary fix and that the real solution lies within the workings of the counter. If I knew more how GPUSPH works in this regard, I could offer advice, but I'm a new user and still learning.

As for the subdir option, I know you said "<targetdir>/resume<N> where N is the number of times we have had to resume.", but can you comment on my outline below (which reflects my understanding)? Am I missing something or is this the same vision you're proposing?

  1. begin simulation with options --dir /home/user/directory
  2. simulation is killed with the last hotstart file being "hot_00082.bin"
  3. resume simulation with options --dir /home/user/directory --resume /home/user/directory/data/hot_00082.bin
  4. GPUSPH sees the directory exists and creates "/home/user/directory/resume1"
  5. GPUSPH writes output in new directory, as usual, beginning with suffix "00000"
  6. For the sake of understanding, hypothetically the simulation is killed again 5 outputs later. The process returns to step 3, in which we resume the simulation with options --dir /home/user/directory --resume /home/user/directory/resume1/hot_00004.bin. This time the new directory is "/home/user/directory/resume2". How did GPUSPH know to call it "2"? I'm assuming using some file system functions to read directory names and parse them. Can we pad the suffix with a few zeros to help with sorting i.e. "resume001" or "resume_001"?
  7. The above process is repeated until simulation finished, in which we have N "resume" directories, as you said prior.

I'm totally fine with this temporary solution. However can you advise on how to perform the post processing? Will GPUSPH stitch the output files together or are we responsible for this? Maybe it's not necessary? Would renaming the files in order, then place them all in one directory work? Again, I'm asking because I'm a new user of GPUSPH and all I know is to point Paraview to the "data" directory and load the VTK group files.

@Oblomov
Copy link
Contributor

Oblomov commented Oct 15, 2019

Hello @saltynexus,

that would be indeed the general idea. Padding the name of the resume directory is a good idea if we assume a worst-case scenario where more than 10 resumes are needed.

For data visualization and post-processing, rather than opening the VTK file groups directly (which wouldn't work out-of-the-box on resume due to the counter restart), something that should work almost out of the box would be to open the VTUinp.pvd file: this works as an index to the VTK files, but also includes simulated time information (in fact, I usually work with the index file because of that).

I think it should also be relatively easy to write a post-processing script that takes the index files from all the resume directories and builds a new index file, and possibly symlinks all the data files (reindexing them as appropriate) into a new 'recovery' directory, where at least the VTKWriter output can be perused as if it was the usual data/ directory output.

@saltynexus
Copy link
Author

Awesome...sounds like a plan!

@saltynexus
Copy link
Author

@Oblomov I'm sure you have a lot on your plate, but I just wanted to see if you can provide an estimated timeline on when this temporary solution will become available?

giuseppebilotta pushed a commit that referenced this issue Oct 28, 2019
Before this change, when resuming into the same sudir that we resumed
from, the older data files would be overwritten, because we do not track
the writer's file indices, and resuming for some writers would be very
non-trivial (metadata handling etc).

To avoid data loss, following the discussion around issue #56 on GitHub,
we enact the following policy:

1. nothing changes if we resume into a new directory;
2. we prevent resuming into an existing directory that is _not_ the same
   we are resuming from;
3. when resuming into the same directory we resume from, the actual
   problem dir is shifted into the first available `resumeN` subdir.
@Oblomov
Copy link
Contributor

Oblomov commented Oct 28, 2019

Hello @saltynexus , I've just pushed to the next branch an implementation of the resumeN dir approach (no script to merge the VTUs yet, but at least it should prevent data loss). I'm keeping this issue open until we have the merge script, though.

@saltynexus
Copy link
Author

@Oblomov Thank you for your support with this!

@saltynexus
Copy link
Author

@Oblomov
I just wanted to inform you that the resumeN patch is looking good. I've been working with it recently and have not found any problems thus far.

As far as the "merge the VTUs" stuff goes, I reached out to the ParaView group and got some good advice (https://discourse.paraview.org/t/how-to-merge-multiple-pvd-and-vtp-files/3506/6). I haven't tried testing or developing the script yet but it's a good start. I'll probably work on it early next week and if I make any progress I'll share with you.

@Oblomov
Copy link
Contributor

Oblomov commented Feb 7, 2020

@saltynexus that's very good news. The proposed idea for merging the PVDs also looks very promising and surprisingly simple. Excellent. Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants