How to quickly identify a list of nearby stations?

Question

Leon el 2 de Oct. de 2022

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1815800-how-to-quickly-identify-a-list-of-nearby-stations

Respondida: Walter Roberson el 2 de Oct. de 2022

I have a total of 1 million stations numbered from 1 to 1 million. For each of them, how do I quickly identify the rest of the stations (Station numbers) that are within 2 km of distance? I wrote the below program, but it was extremely slow. Any recommendations would be greatly appreciated. Many thanks!

%% calculate the distance between the stations:
m2 = numel(Sta); % The list of profile numbers 
B0 = zeros(m2*m2, 6);
n0 = 0;
for i=1:m2
    for j=1:m2
        
        if Sta{i}.lon <= 180
            Lon1 = Sta{i}.lon; 
        else
            Lon1 = Sta{i}.lon-360; 
        end
        
        if Sta{j}.lon <= 180
            Lon2 = Sta{j}.lon; 
        else
            Lon2 = Sta{j}.lon-360; 
        end
        
        n0 = n0+1;
        B0(n0,:) = [i, j, Sta{i}.lat, Lon1, Sta{j}.lat, Lon2];
        
    end
end
Dis = distance(B0(:,3), B0(:,4), B0(:,5), B0(:,6), 6371); % The radius of the earth is 6371km.
%units: km.
D = []; % cell array storing the list of stations that are close to this profile:
for i=1:m2
    
    Ind1 = B0(:,1)==i; % Index of the distance values for Station i.
    B   = [B0(Ind1,2), Dis(Ind1)]; % Two columns: List of profiles, Distance.
    Ind = B(:,2) <= 2; % distance is less than 2km. 
    
    if sum(Ind) >= 2 % at least two profiles identified, one of them is itself.
        C0 = setdiff( B(Ind,1), i ); % list of profiles that are close to this station (excluding itself)
        D{i} = C0;
    else    
        D{i} = [];
    end
    
end

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Torsten el 2 de Oct. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1815800-how-to-quickly-identify-a-list-of-nearby-stations#answer_1065040

Editada: Torsten el 2 de Oct. de 2022

Abrir en MATLAB Online

https://de.mathworks.com/help/stats/pdist.html

And for your double loop, you should replace

for i=1:m2
    for j=1:m2

by

for i=1:m2
    for j=i+1:m2

to save computation time.

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Leon el 2 de Oct. de 2022

Many thanks! I forgot everything I learned in my Advanced Mathematics class :-)

Image Analyst el 2 de Oct. de 2022

You accepted this answer but we don't know how you solved the problem of needing 8000 GB of memory to hold your B0 array for your trillion distances. How did you do it?

Iniciar sesión para comentar.

Answer 2

Image Analyst el 2 de Oct. de 2022

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1815800-how-to-quickly-identify-a-list-of-nearby-stations#answer_1065090

Abrir en MATLAB Online

Well let's see why it's slow. You have a million elements and you create an array that is a million by a million by 8 bytes. So that is 8000 gigabytes! Now maybe you have an 8 terabyte drive in your computer for swap space, but it will still be slow. I don't even know if MATLAB can handle an array that big even if you had infinite memory.

The second reason it is slow is the use of cell arrays. Cell arrays are very slow and memory inefficient. Put all your locations into standard double arrays, not cell arrays.

Now I don't know what this distance function you're using is (there are several) but it might be slow because it is general purpose with lots of overhead or maybe it takes the square root or even some more complicated formula. That could be avoided by using the simple Euclidean distance formula but not taking the square root. If you want the distance to be less than 2 km, that is the same as wanting the squared distance to be less than 4. Hence just square the distances and compare to 4 but don't take the square root. That is just an extra, unnecessary operation that will slow it down.

So why do I think the square distance will probably be better? Well if you're trying to find stations that are within 2 km of each other you're probably not considering stations placed all over the globe. You're probably not worried about whether a station in Beijing is within 2 km of one in Miami. You wouldn't need that kind of precision. You're probably considering a much smaller area, like just stations in Belgium or some country or state. Though why you'd have a million stations in a small area, I don't know. What is a station anyway? A radio station? A train station? An antenna location? So if you're using a small area of the globe you can convert the lats and lons to linear (x,y) locations. But you'll say "oh but it won't be accurate" and I'll say "who cares?" Certainly not us/you. It will be accurate enough to find stations within 2 km since there is no earth curvature effect over such a small distance, and if it's inaccurate for stations separated by 500 km, who cares? You don't want to identify a pair of stations like that (very distant) anyway so who cares if it says the distance is 480 km instead of 500 km? It doesn't matter. In the end all you care about are stations where the squared distance is less than 4, not the absolutely precise distance down to the nearest millimeter.

So just have a 2-D array where you compute the squared distance of every station to every other station using the Pythagorean theorem and your (x,y) coordinates (not cell arrays). Try meshgrid. But you still have the problem of having a trillion distances to compute. You're just going to somehow reduce that down to something more manageable - a trillion distances is not realistic. So now once you have the square distances in your array, finding the ones closer than 4 is trivial

mask = (squaredDistances > 0) & (squaredDistances <= 4);
% Find station indexes
[stationID1, stationID2] = find(mask);

2 comentarios
Mostrar NingunoOcultar Ninguno

Image Analyst el 2 de Oct. de 2022

You can speed this up even further by not computing the squared distance. Just compute the absolute value of the delta X and delta Y. Then you can throw out any that are not within 2. That will narrow it down immensely. However you're still going to have a trillion delta x distances and a trillion delta y distances initially. Plus you're avoiding the two squaring operations and a summing operation - that's a total of 3 trillion mathematical operations you're avoiding.

Now, on that smaller subset (maybe a few million or so), you can compute the more accurate distances using a more precise method if you wish.

Leon el 2 de Oct. de 2022

Wow, thank you so much for the very insightful comments! I find it very helpful.

Iniciar sesión para comentar.

Answer 3

Walter Roberson el 2 de Oct. de 2022

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1815800-how-to-quickly-identify-a-list-of-nearby-stations#answer_1065200

First, divide the stations into subsets:

stations close enough to the North Pole that the smaller distance between degrees of longitude could start substantially affecting the calculation of which stations are close to each other. For the moment, call this NorthStations
A subset of NorthStations that is potentially within 2 km of the sourthern boundary of the NorthStations. Call this NorthBorderStations for the moment
Likewise, create SouthStations and SouthBorderStations
Take the remaining stations that are not in NorthStations or in SouthStations, and add copies of NorthBorderStations and SouthBorderStations to that. Call this OtherStations at the moment
Take a subset of OtherStations that are potentially within 2km of -180 degrees (remember that OtherStations is clipped at 2km North/South of the boundary you decided earlier so the curve at the North/South boundary will be limited.) Add 360 to those longitudes and add that subset back into OtherStations

Now NorthStations and SouthStations are substantially reduced subsets of the original data. You can use the Haversine distance formula within each subset, perhaps using pdist(). This will compare every point in each of the subsets to every other point within the subsets, but the subsets should only be including the locations where the curve introduces notable risk of unexpected connections. The distances calculated within each of the subsets will be exact on the first pass for each of the two subsets.

Now take OtherStations and use knnsearch and Euclidean distance, with the radius set to the largest euclidean distance that maps to 2km right at the border you used for dividing into the North / South subsets. For example if you choose to put the boundary at acosd(1/3) (70.5 degrees) then use a radius corresponding to 1/(1/3)*2 = 6 km. knnsearch with these settings with create a quadtree and search it for nearby points. This will, if I recall, take roughly N * log(N) time where N is the number of points in the subset.

The list of neighbours you get will be an overestimate -- but now you can take the points and their neighbours and do pairwise havershine formulas to get exact distances.

The higher/lower you put the latitude boundaries the more you reduce the size of the NorthStations and SouthStations subsets that have to be compared pairwise, but the more you overpopulate the search in OtherStations (because the more extreme the boundary, the larger you need to make the euclidean search radius so as to be able to find the matches right at the boundary.

I guess perhaps this search strategy could end up with some duplicates, if you happen to have pairs that lie exactly within the NorthBorderStations or SouthBorderStations, or entirely within the band that is duplicated at -180, so be sure to do a final filtering pass.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 4

Walter Roberson el 2 de Oct. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1815800-how-to-quickly-identify-a-list-of-nearby-stations#answer_1064960

If you have the Mapping Toolbox then use distance

If you do not have that toolbox then see the File Exchange https://www.mathworks.com/matlabcentral/fileexchange/27785-distance-calculation-using-haversine-formula

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Walter Roberson el 2 de Oct. de 2022

How close to the poles do you get?

Leon el 2 de Oct. de 2022

Editada: Leon el 2 de Oct. de 2022

The stations can be anywhere in the global ocean. Some of them will be close to the poles.

Iniciar sesión para comentar.

How to quickly identify a list of nearby stations?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Más respuestas (3)

2 comentarios
Mostrar NingunoOcultar Ninguno

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

How to quickly identify a list of nearby stations?

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

7 comentarios Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Más respuestas (3)

2 comentarios Mostrar NingunoOcultar Ninguno

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo