# How to quickly identify a list of nearby stations?

3 views (last 30 days)

Show older comments

I have a total of 1 million stations numbered from 1 to 1 million. For each of them, how do I quickly identify the rest of the stations (Station numbers) that are within 2 km of distance? I wrote the below program, but it was extremely slow. Any recommendations would be greatly appreciated. Many thanks!

%% calculate the distance between the stations:

m2 = numel(Sta); % The list of profile numbers

B0 = zeros(m2*m2, 6);

n0 = 0;

for i=1:m2

for j=1:m2

if Sta{i}.lon <= 180

Lon1 = Sta{i}.lon;

else

Lon1 = Sta{i}.lon-360;

end

if Sta{j}.lon <= 180

Lon2 = Sta{j}.lon;

else

Lon2 = Sta{j}.lon-360;

end

n0 = n0+1;

B0(n0,:) = [i, j, Sta{i}.lat, Lon1, Sta{j}.lat, Lon2];

end

end

Dis = distance(B0(:,3), B0(:,4), B0(:,5), B0(:,6), 6371); % The radius of the earth is 6371km.

%units: km.

D = []; % cell array storing the list of stations that are close to this profile:

for i=1:m2

Ind1 = B0(:,1)==i; % Index of the distance values for Station i.

B = [B0(Ind1,2), Dis(Ind1)]; % Two columns: List of profiles, Distance.

Ind = B(:,2) <= 2; % distance is less than 2km.

if sum(Ind) >= 2 % at least two profiles identified, one of them is itself.

C0 = setdiff( B(Ind,1), i ); % list of profiles that are close to this station (excluding itself)

D{i} = C0;

else

D{i} = [];

end

end

##### 0 Comments

### Accepted Answer

Torsten
on 2 Oct 2022

Edited: Torsten
on 2 Oct 2022

And for your double loop, you should replace

for i=1:m2

for j=1:m2

by

for i=1:m2

for j=i+1:m2

to save computation time.

##### 7 Comments

Image Analyst
on 2 Oct 2022

### More Answers (3)

Image Analyst
on 2 Oct 2022

Well let's see why it's slow. You have a million elements and you create an array that is a million by a million by 8 bytes. So that is 8000 gigabytes! Now maybe you have an 8 terabyte drive in your computer for swap space, but it will still be slow. I don't even know if MATLAB can handle an array that big even if you had infinite memory.

The second reason it is slow is the use of cell arrays. Cell arrays are very slow and memory inefficient. Put all your locations into standard double arrays, not cell arrays.

Now I don't know what this distance function you're using is (there are several) but it might be slow because it is general purpose with lots of overhead or maybe it takes the square root or even some more complicated formula. That could be avoided by using the simple Euclidean distance formula but not taking the square root. If you want the distance to be less than 2 km, that is the same as wanting the squared distance to be less than 4. Hence just square the distances and compare to 4 but don't take the square root. That is just an extra, unnecessary operation that will slow it down.

So why do I think the square distance will probably be better? Well if you're trying to find stations that are within 2 km of each other you're probably not considering stations placed all over the globe. You're probably not worried about whether a station in Beijing is within 2 km of one in Miami. You wouldn't need that kind of precision. You're probably considering a much smaller area, like just stations in Belgium or some country or state. Though why you'd have a million stations in a small area, I don't know. What is a station anyway? A radio station? A train station? An antenna location? So if you're using a small area of the globe you can convert the lats and lons to linear (x,y) locations. But you'll say "oh but it won't be accurate" and I'll say "who cares?" Certainly not us/you. It will be accurate enough to find stations within 2 km since there is no earth curvature effect over such a small distance, and if it's inaccurate for stations separated by 500 km, who cares? You don't want to identify a pair of stations like that (very distant) anyway so who cares if it says the distance is 480 km instead of 500 km? It doesn't matter. In the end all you care about are stations where the squared distance is less than 4, not the absolutely precise distance down to the nearest millimeter.

So just have a 2-D array where you compute the squared distance of every station to every other station using the Pythagorean theorem and your (x,y) coordinates (not cell arrays). Try meshgrid. But you still have the problem of having a trillion distances to compute. You're just going to somehow reduce that down to something more manageable - a trillion distances is not realistic. So now once you have the square distances in your array, finding the ones closer than 4 is trivial

mask = (squaredDistances > 0) & (squaredDistances <= 4);

% Find station indexes

[stationID1, stationID2] = find(mask);

Walter Roberson
on 2 Oct 2022

First, divide the stations into subsets:

- stations close enough to the North Pole that the smaller distance between degrees of longitude could start substantially affecting the calculation of which stations are close to each other. For the moment, call this NorthStations
- A subset of NorthStations that is potentially within 2 km of the sourthern boundary of the NorthStations. Call this NorthBorderStations for the moment
- Likewise, create SouthStations and SouthBorderStations
- Take the remaining stations that are not in NorthStations or in SouthStations, and add copies of NorthBorderStations and SouthBorderStations to that. Call this OtherStations at the moment
- Take a subset of OtherStations that are potentially within 2km of -180 degrees (remember that OtherStations is clipped at 2km North/South of the boundary you decided earlier so the curve at the North/South boundary will be limited.) Add 360 to those longitudes and add that subset back into OtherStations

Now NorthStations and SouthStations are substantially reduced subsets of the original data. You can use the Haversine distance formula within each subset, perhaps using pdist(). This will compare every point in each of the subsets to every other point within the subsets, but the subsets should only be including the locations where the curve introduces notable risk of unexpected connections. The distances calculated within each of the subsets will be exact on the first pass for each of the two subsets.

Now take OtherStations and use knnsearch and Euclidean distance, with the radius set to the largest euclidean distance that maps to 2km right at the border you used for dividing into the North / South subsets. For example if you choose to put the boundary at acosd(1/3) (70.5 degrees) then use a radius corresponding to 1/(1/3)*2 = 6 km. knnsearch with these settings with create a quadtree and search it for nearby points. This will, if I recall, take roughly N * log(N) time where N is the number of points in the subset.

The list of neighbours you get will be an overestimate -- but now you can take the points and their neighbours and do pairwise havershine formulas to get exact distances.

The higher/lower you put the latitude boundaries the more you reduce the size of the NorthStations and SouthStations subsets that have to be compared pairwise, but the more you overpopulate the search in OtherStations (because the more extreme the boundary, the larger you need to make the euclidean search radius so as to be able to find the matches right at the boundary.

I guess perhaps this search strategy could end up with some duplicates, if you happen to have pairs that lie exactly within the NorthBorderStations or SouthBorderStations, or entirely within the band that is duplicated at -180, so be sure to do a final filtering pass.

##### 0 Comments

Walter Roberson
on 2 Oct 2022

If you have the Mapping Toolbox then use distance

If you do not have that toolbox then see the File Exchange https://www.mathworks.com/matlabcentral/fileexchange/27785-distance-calculation-using-haversine-formula

##### 3 Comments

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!