Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory At Unlinearizing Exposure Step #727

Open
eas342 opened this issue Sep 16, 2021 · 15 comments
Open

Out of Memory At Unlinearizing Exposure Step #727

eas342 opened this issue Sep 16, 2021 · 15 comments

Comments

@eas342
Copy link
Collaborator

eas342 commented Sep 16, 2021

I am using the grism_tso_simulator.py and running into memory problems at the unlinearize step. I allocated 128 GB for the job and it got killed by an out-of-memory handler. The script got through 3 of 6 segments before running out of memory. So, perhaps the previous finished segments are not getting cleared out of memory? FYI, each _linear.fits file is 6.1 GB.

2021-09-15 20:23:53,121 - mirage.ramp_generator.obs_generator - INFO - Final linearized exposure saved to:
2021-09-15 20:23:53,121 - stpipe - INFO - Final linearized exposure saved to:
2021-09-15 20:23:53,121 - stpipe - INFO - Final linearized exposure saved to:
2021-09-15 20:23:53,121 - mirage.ramp_generator.obs_generator - INFO - /xdisk/eas342/eas342/mirage_runs/mirage_output_004_manatee/sim_data/jw01185002001_01101_00002-seg004_nrca5_linear.fits
2021-09-15 20:23:53,121 - stpipe - INFO - /xdisk/eas342/eas342/mirage_runs/mirage_output_004_manatee/sim_data/jw01185002001_01101_00002-seg004_nrca5_linear.fits
2021-09-15 20:23:53,121 - stpipe - INFO - /xdisk/eas342/eas342/mirage_runs/mirage_output_004_manatee/sim_data/jw01185002001_01101_00002-seg004_nrca5_linear.fits
2021-09-15 20:23:53,122 - mirage.ramp_generator.obs_generator - INFO - Unlinearizing exposure.
2021-09-15 20:23:53,122 - stpipe - INFO - Unlinearizing exposure.
2021-09-15 20:23:53,122 - stpipe - INFO - Unlinearizing exposure.
/var/spool/slurm/d/job2070773/slurm_script: line 41: 50547 Killed                  python NIRCam_grism_tso_wrapper.py mirage_input_004_manatee/source_params_wasp80.yaml
Sep 15 20:27:49.972842 50504 slurmstepd   0x2ada3b547340: error: Detected 1 oom-kill event(s) in StepId=2070773.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

@eas342
Copy link
Collaborator Author

eas342 commented Sep 17, 2021

I allocated 256 GB of memory and was able to get farther before running into a different issue

@bhilbert4
Copy link
Collaborator

I've got a memory test running that will (hopefully) shed a little light on this. It's slow going for testing though. Within obs_generator.py, everything is in a loop (over dark current files), so if you made it through 3 segments before having trouble, then there must be a memory leak somewhere.

The unlinearize functions are clunky and memory intensive, but all the memory they use should be freed up after they finish each time.

@eas342
Copy link
Collaborator Author

eas342 commented Oct 1, 2021

OK, thanks for investigating! Now that I run mirage TSO sims on a high performance computer and allocate 256 GB, I have not run into any problems but I imagine some people someday will want to run this on desktop or laptop that has 16 GB to 32 GB. If it's of any use to you, I have watched the machine's memory usage at times and it sometimes goes up to ~170 GB**, then down to something smaller like ~130 GB** so it is probably freeing up some of the memory but not all.

**working off my brain's memory here so it could be overloaded/leaking too and I don't recall the exact values.

@trevorfoote
Copy link

Hi, I was wondering if there were any results on the memory testing? I am trying to run this program but the best computer setup I have access to currently is capped at ~60GB RAM, so I'm hoping there was a leak found that might help me be able to run the TSO with my current resources.

@bhilbert4
Copy link
Collaborator

I haven't found any obvious memory leaks yet in my testing. Unless python isn't cleaning up after functions complete. I'll keep digging into it.

@nespinoza
Copy link

nespinoza commented Nov 5, 2021

Hi folks,

Just coming to this discussion to tell you that we (@hover2pi and I) found initially similar problems with SOSS even before merging awesimsoss to mirage, and we believe it's not really a memory leak problem per se (like, "forgetting to free things"), but a problem on how python decides to free up memory which is a bit random, and difficult to do for JWST simulations in general. For large number of integrations (>50), there is no easy way around freeing memory because one is creating large arrays in memory. A ~500 integration of 6 groups on a 2048 x 256 subarray (which was our typical "problem" with SOSS) is about ~10 GB of memory on its own, so no wonder @eas342 is seeing those spikes considering dark files are being loaded, seed images, reference files, psfs, etc. This is specially troublesome if there is any kind of parallelization in place (which is the particular problem we had, not sure if NIRCam has any parallelization being performed within mirage).

How we managed at the end on the SOSS simulator is to create integrations in reasonable (~50) chunks (< 1 GB each), and then explicitly deleting the most memory intensive arrays. That works for us now and we are able to generate ~thousands of integrations in a common laptop. Perhaps the same ought to be implemented for the NIRCam simulator.

Just a thought!

N.

@bhilbert4
Copy link
Collaborator

Thanks @nespinoza! Mirage does do something similar for TSO simulations, where the data are broken into multiple segments once the numbers of groups/ints gets too large. I had been assuming when looping over each segment that all of the memory in the re-used variables would be freed up at each iteration of the loop. I'll try following SOSS's lead and explicitly delete those variables to see if it helps.

@emma-miles
Copy link

Hi everyone,

I am currently experiencing a similar issue when running the NIRISS SOSS example. I have found that clearing my cache and restarting my laptop helped the simulation run a little bit further but not enough to complete the simulation. I have also updated all of my Python packages to meet the requirements.

The error that comes up each time that I run the code is as follows:

image

I have also attached the log file in case anyone notices something that I have missed.

mirage_latest(unlinearizedexposures).log

And these are my laptop's specifications:
image

Is it possible that my Mac simply cannot run Mirage because of a memory issue? Or is it an issue with Python forgetting to free up memory? Are there any suggestions for how I might resolve this?

@bhilbert4
Copy link
Collaborator

It looks like the failure is happening when trying to unlinearize the exposure. @hover2pi have you ever run into this when creating a SOSS exposure?
One thing to try would be to reduce the size of the data chunks that the exposure is broken up into. Go into Mirage's utils/constants.py file and decrease the value of FILE_SPLITTING_LIMIT.

To find the location of this file:
import mirage
print(mirage.file)

Then the constants file will be in the utils subdirectory under the path listed there.

Another option that might be useful depending on exactly what data you are looking for would be to work from the linearized data and skip the unlinearization. That can be done by going into the yaml file and changing the Output:datatype entry to 'linear'.

In the meantime, I'll keep trying to figure out why the memory use seems to get so large for these large observations. Sorry about this. I haven't had much time in the last couple months to delve into Mirage.

@hover2pi
Copy link
Collaborator

hover2pi commented Mar 23, 2022

Yeah, the line is right here:

max_frames = 50

I have it hardcoded to 50 frames but I can definitely make that a SossSim.create keyword argument so users can adjust it to their needs. 50 worked for large numbers of integrations on my laptop but we definitely didn't test it for all possible scenarios. My guess is that this would fix it.

I'll open a PR to fix it and we can see if that works for @emma-miles use case.

@hover2pi
Copy link
Collaborator

PR is here #775 !

@emma-miles
Copy link

@bhilbert4 @hover2pi thank you both for getting back to me on this!

I have taken into consideration all of your solutions. There no longer appears to be a memory issue where I was experiencing it previously but another issue has arisen.

I had changed the yaml file to only take in the linearized dark files but for some reason, it expects a raw output for the seed image after segment 1 is complete. I was trying to read the soss_simulator.py code where the error appears but I can't quite make sense of why this might be happening.

image

I'm not sure if this was an issue previously as the code has never run this far for me. Is there any way around this?

Cheers,
Emma

@hover2pi
Copy link
Collaborator

I think this I know what this is. I'll check it out and push a fix for it ASAP.

@hover2pi
Copy link
Collaborator

This should be fixed when #775 is merged.

@emma-miles
Copy link

@hover2pi I have incorporated your fix but the same error seems to be occurring. Do you have any other solutions for this?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants