Hi everyone, I am a new user for running deep learning model with GPU. My model can run smoothly for the first several thousand iterations, but it stopped running due to out of memory issue. I attached the error file in this message. mpi_job.o6092769 . I had adjusted batch sizes and also decreased input data (just one quarter of the total training data), but it still has this problem. For my understanding, the GPU has difficulties to release memory and it broke at certain iteration step. Each iteration should use a very similar memory. Please let me know if you have similar problem as mine and how you solved it. Thank you very much!
I would suggest adding the following commands before initializing your keras model so that tensorflow minimizes memory usage of the GPU:
gpus = tf.config.get_visible_devices("GPU")
for device in gpus:
tf.config.experimental.set_memory_growth(device, True)
Last updated: May 16 2025 at 17:14 UTC