This creates a double array.
ar = ar + single(AccelDouble);
Here AccelDouble is converted to a single at first, which must be converted back to a double, because it is added to a double. This is a waste of time.
Either convert all values to singles from the beginning. Or stay at doubles.
mi ~= 0 is excluded twice. If would be more efficient, to remove all m==0 before the loops.
"sign(mi)"? Do you mean "single(mi)"?
AccelDouble = zeros(3,'double') ;
This produces a [3 x 3] array. Because it is overwritten ineach iteration, this is not a pre-allocation, but a waste of time. It does not matter, that it is overwritten by a [1 x 3] array, but confusing only.
I assume, your code can be accelerated. What are the used sizes of the input?
The integration scheme based of delta(v) = delta(t)*a is very basic. There are integrations methods on a higher order, which allow to increase the step sizes, which reduces the run time.
[EDITED] Vectorizing the inner loop and using SINGLE for all data reduces the runtim from 1.64 sec to 0.06 sec for 2e3 objetcs:
heavy = rand(2e3, 7) .* [1e9, 1e6, 1e6, 1e6, 1e3, 1e3, 1e3];
heavy = Core1(heavy);
heavyS = single(heavy)
heavy = Core1(heavyS);
function heavy = Core3(heavy)
timeStep = 0.1;
G = 6.67430e-11;
M = heavy(:, 1);
X = heavy(:, 2:4);
V = heavy(:, 5:7);
n = numel(M);
MG = M * G;
for i = 1:n
mi = M(i);
D = X - X(i, :);
r2 = sum(D .* D, 2);
F = (D ./ sqrt(r2)) .* (MG * mi ./ r2);
F(i, :) = 0;
Ft = sum(F, 1);
V(i, :) = V(i, :) + Ft * timeStep;
X(i, :) = X(i, :) + timeStep * V(i, :);
heavy(:, 2:4) = X;
heavy(:, 5:7) = V;
I've implemented this with the assumption that your "sign(mi)" was thought as "single(mi)".
But note this: In the current implementation the problem is O(n^2). Doubling the input size does increase the runtime by a factor 4. I've tested the code with 2e3 rows. For 1e6 rows this will be more than 4 hours per time step.
How many timesteps do you want to calculate? I assume, there is no way to perform this efficiently in Matlab with this simple apporach. Even if you convert this to a C-mex method and increase the speed by a factor 2 (a pure guess only!), 2 hours per time step is still far away from allowing a serious number of time steps.
Bjorn has mentioned, that the force matrix is symmetrical, but it is not trivial to exploit this for 1e6 elements without exhausting the the memory. Using a sophisticated intergator is essential to reduce the number of timesteps for a certain accuracy.
Professional simulations use e.g. the leap-frog algorithm. This simplifies the computation by combining a neighborhood of elements to their center of mass: If we simulate the trajectories of the stars in the galaxy, the distribution of other stars in a spatial segment far away is in first approximation the force of its center of mass containing the sum of the masses. This can reduce the computing time by a factor 100 without loosing too much accruracy.