flaky GPU memory issues

Question

Rodrigo el 9 de Feb. de 2012

1
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/28500-flaky-gpu-memory-issues

Editada: Cedric el 8 de Oct. de 2013

We have a 580 GTX with 3Gb of ram running in a linux (Ubuntu Lucid with Natty backported kernel) machine with 2011b and I find myself fighting with seemingly random crashes due to memory allocation in the GPU. The first thing I noticed was that overwriting a variable defined on the GPU does not always give me back all the ram that the old variable had minus the size of the new data, so I have to clear the variable instead of overwriting it; is there some collection of best practices to avoid wasting memory in ways similar to this?

I also find that a calculation that has been running for hours, and that has successfully complete before will sometimes crash with an "unexpected error" which seems to correlate with running close to maximum memory capacity. Since the program had completed before, I am left assuming that some other program interfered with memory allocation in the GPU and killed my task. Is there a way to prevent this from happening? Maybe running the server headless, or putting in another, smaller video card to run the display?

Thanks

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Edric Ellis el 9 de Feb. de 2012

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/28500-flaky-gpu-memory-issues#answer_36847

In your first observation about overwriting variables on the GPU, I presume you're using the output of "gpuDevice" to check the amount of free memory on the GPU. You're quite right that overwriting an array may not necessarily cause the old memory to be freed immediately; however, it will be freed automatically if necessary to prevent running out of memory.

It's not clear what the 'unexpected error' might be, this is not something I've seen here at The MathWorks on our test machines. Do these errors show up in similar places each time? I.e. does there seem to be a gpuArray operation that particularly causes this?

One final thing to note: like CPU memory, GPU memory can become fragmented over time, and it's possible that this might cause you to run out of GPU memory earlier than you might otherwise anticipate. However, I would not normally expect this to result in 'unexpected errors' - rather, I'd expect to see failed allocations.

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Rodrigo el 12 de Feb. de 2012

I still have no idea what this command does, but it seems to have solved the problem. I still have about 1 month of back to back calculations that have to happen -- so this may be too early to tell -- but whatever black voodoo this feature command does seems to work.

Rodrigo el 13 de Abr. de 2012

So this fix seems to break in R2012a. Any ideas for how to unbreak it?

Iniciar sesión para comentar.

Answer 2

Walter Roberson el 9 de Feb. de 2012

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/28500-flaky-gpu-memory-issues#answer_36831

It is not safe to assume that some other program interfered with the memory allocation. Instead, you have to take in to account that your program might have corrupted memory in a way that does not always cause a crash but does sometimes. For example if the corrupted memory block does not happen to be needed again until a lot of memory is in use...

2 comentarios
Mostrar NingunoOcultar Ninguno

Rodrigo el 9 de Feb. de 2012

I see. So is there a way to periodically flush the GPU memory to avoid this corruption? Right now the full computation takes about 24hrs, and having it crash at the 23rd hour stings. I suppose I can dump the partial results to disk and try to recover after a crash, but since I don't know what the "unexpected error" actually is I have a hard time adjusting my programs to avoid it.

In case a Mathworks engineer is reading, posting a set of best practices and common "unexpected errors" would be really helpful.

Walter Roberson el 9 de Feb. de 2012

If you _do_ have a memory corruption problem from your code (or from something on MathWork's implementation), then releasing all memory _or_ using all memory would trigger the problem. However, releasing the gpu from operations could, depending on implementation, potentially have the effect of just throwing away all of the memory without bothering to put all the fragments together.

It would not be impossible for a memory allocator to offer a "Ignore everything known about the current state of memory and just re-initialize back to the starting state". I do not recall ever encountering a memory allocation library that offered that as a user call, however.

I have not examined the memory allocation system used for the GPU routines; I am reflecting back to my past experiences [redacted] years ago, using [redacted] on [redacted] (redactions to protect my delusions that I am not _that_ old...)

Iniciar sesión para comentar.

Answer 3

Ruby Fu el 10 de Feb. de 2012

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/28500-flaky-gpu-memory-issues#answer_37009

Abrir en MATLAB Online

Hi Rodrigo, Eric and Walter, It is great that I found this post just when I need it! I have the exact same problem as Rodrigo. My experience has been this:

my program runs perfectly fine with a smaller resolution problem, meaning smaller matrices and less memory allocation.
when i try to run the program in higher resolution, it yells at me for not having enough memory
so naturally I clear several intermediate matrices at each iteration after they are done being useful; they get updated at the next iteration anyway.
Now I test run the new program (with cleared memory at each iteration) in the _small_ resolution problem.(just to make sure i did not accidentally clear some useful variables)  

       Error using parallel.gpu.GPUArray/fft
        MATLAB encountered an unexpected error in evaluation on the GPU.

Coincidentally, this error occurred at a fft operation. However, it is also the first function call in the program.

Do you think having a bigger GPU will solve the problem? I have a GTX580 as well and it only comes with 1.5GB. Would having a Tesla 6GB solve this problem or is there something else we are missing here?

Eric, I have the latest CUD driver so that should not be an issue.

Thank you! Ruby

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Edric Ellis el 13 de Feb. de 2012

The error message you are getting is due to CUFFT - NVIDIA's FFT algorithm - running out of memory. Unfortunately, it sometimes reports back to us this out-of-memory condition as an "unexpected error", which we then report to you. This sort of unpredictable behaviour can sometimes be helped by the "feature" command I suggested to Rodrigo - but if you're that close to running out of memory, you may still have problems. A bigger memory card would almost certainly help you.

Iniciar sesión para comentar.

Answer 4

Max W.K. Law el 9 de Mayo de 2013

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/28500-flaky-gpu-memory-issues#answer_84903

Abrir en MATLAB Online

I got the same error while trying to ifftn (complex to complex) a 256*256*516 complex-single 3D array. It is a 258MB chunk of data. It fails on my 4GB GTX 680 card. YES, if it is about running short of memory, that means 4GB memory couldn't take a 258MB data chunk, and give the error "MATLAB encountered an unexpected error in evaluation on the GPU."

There are some other data in the GPU that many cause fragmentation. The code that produces this error is just "temp=ifftn(temp);" Please, is there any way to enforce the in-place transform?

Here is the gpuDevice() command result

                      Name: 'GeForce GTX 680'
                     Index: 1
         ComputeCapability: '3.0'
            SupportsDouble: 1
             DriverVersion: 5
            ToolkitVersion: 5
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.147483647000000e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 4.294639616000000e+09
                FreeMemory: 1.676050432000000e+09
       MultiprocessorCount: 8
              ClockRateKHz: 1163000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

flaky GPU memory issues

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (3)

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

flaky GPU memory issues

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

8 comentarios Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (3)

2 comentarios Mostrar NingunoOcultar Ninguno

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos