if you look at the magnitude of the difference compared to the magnitude of the values, it is pretty small:
>> max(max(gather(abs(k-kim)))) / max(max(abs(k)))
ans = 1.0144e-016
Infact, it is similar to the limit of accuracy for any calculation on the input:
ans = 2.2204e-016
(the two are not measuring the same thing, but hopefully you get the idea).
The CPU and GPU implementations of FFT/FFT2 are necessarily quite different in order to take best advantage of the hardware. The GPU version needs to be massively parallel. I believe that the difference you are seeing is well within what one should expect from any two different implementations of FFT/FFT2.
Whilst we can aspire to identical results from the CPU and GPU versions, sometimes it simply isn't possible without massive compromises on speed.