Random sampling of elements from an array based on a target condition in MATLAB

Question

0 votos

I have an array (let's call it ElmInfo) of size Nx2 representing a geometry. In that array the element number and element volume are on the column 1 and column 2 respectively. The volume of elements largely vary. The sum of the volume of all elements leads to a value V which can be obtained in MATLAB as:

V=sum(ElmInfo(:,2));

I want to randomly sample elements from the array ElmInfo in such a way that the volume of sampled elements (with no repetition) will lead to a target volume V1. Note: V1 is less than V. So I don't know the number of elements to be sampled. I am giving an example. For a sampling case number of sampled element can be '10' whereas in other sampling number of sampled element can be '15'.

There is no straightforward MATLAB in-built function to meet the target condition. How can I implement the code in MATLAB?

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Richard Brown el 5 de Nov. de 2019

Abrir en MATLAB Online

1 voto

If your process is just to sample until you are within a tolerance of V_1 and you want things to be truly random, how about the following algorithm:

Randomly shuffle the entries of V
Take a cumulative sum
If there is an entry within a certain tolerance of V_1, keep the corresponding entries and exit
Otherwise, repeat

This is a completely naive brute-force approach, which won't work well at all if your tolerance is small. But if your tolerance is generous, the following might do the job well enough, and will actually sample things at random.

n = 300;
v = rand(n, 1);
V1 = 17;
tol = 1e-2;
sample = [];
maxits = 10000;
for count = 1:maxits
    p = randperm(n);
    s = cumsum(v(p));
    k = find(abs(s - V1) < tol);
    if ~isempty(k)
        sample_indices = p(1:k(1));
        sample = v(sample_indices);
        fprintf('sample found after %d iterations\n', count);
        break
    end
end

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Rudraprasad Bhattacharyya el 5 de Nov. de 2019

@Richard, Thanks a lot.

This works for me. I have tried with a tolerance of 1e-5.

Iniciar sesión para comentar.

Answer 2

John D'Errico el 4 de Nov. de 2019

Editada: John D'Errico el 4 de Nov. de 2019

1 voto

Are you looking for an exact match to V1? An approximate sum equal to V1? If so, then how close of a tolerance do you need?

Is there anything already in MATLAB to do this exactly? Absolutely not that I know of. Certainly not in MATLAB proper. You might find something on the file exchange, depending on whether you need an exact match to V1 or one with a tolerance.

Essentially, I would call this a random boolean (or 0-1) knapsack problem. An element from your set either appears in the knapsack, or it does not, but not more than one copy of any element. The problem is there may well be many possible subsets that fit your goal,

https://en.wikipedia.org/wiki/Knapsack_problem

I could offer a solution that uses intlinprog, since that is a classical way to solve such a problem, although much depends on whether you need an exact solution or not.

8 comentarios
Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos

Richard Brown el 5 de Nov. de 2019

Abrir en MATLAB Online

The integer linear programming problem will find a solution, if one exists, but it doesn't really constitute "random sampling", i.e. choosing one random solution out of the set of all possible solutions. I'm not sure there is even a way to achieve this outcome at all. Depending on your application, you could just sample elements one at a time and stop when your sample exceeds V_1 - this might do the job if your elements are quite small compared with V_1.

As far as finding a sample goes, I'll save John the typing ? - here's the integer linear programming solution:

You can pose a very similar problem as a maximisation problem, where you're looking for the subset whose sum is largest, and less than V1_target (or V1_target + tolerance). It depends on your application as to which is the better choice.

you want to

So this is pretty much in exactly the right form to pass to intlinprog. Entries of the vector x will be 1 if an element is included, 0 otherwise.

n = numel(V)
tol = 1e-4;
lb = zeros(1, n);
ub = ones(1, n);
intcon = 1:n;
A = V.';
b = V_1 + tol;
[x, fval, exitflag, output] = intlinprog(-V, intcon, A, b, [], [], lb, ub)

Also, first time I've been on MATLAB Answers for a few years, good to see you're still going strong @John D'Errico!

John D'Errico el 5 de Nov. de 2019

Editada: John D'Errico el 5 de Nov. de 2019

Abrir en MATLAB Online

Actually, there is a way to use ILP, and have it generate a random solution. People doubt me. Probably for good reasons. ;-)

But the approach I was thinking of when I wrote my anwer was...

Assume a vector V of length 300. I'll pick some arbitrary numbers here.

Vols = rand(1,300);
Vtarget = 25;
Vtol = 0.1;
n = numel(Vols);
lb = zeros(1,n);
ub = ones(1,n);
intcon = 1:n;
A = [Vols;-Vols];
b = [Vtarget + Vtol;-Vtarget + Vtol];
f = randn(n,1); % a random orientation for the objective!
[x, fval, exitflag, output] = intlinprog(f, intcon, A, b, [], [], lb, ub);

As you can see, it solved the problem, with 79 elements chosen this time.

Vols*x
ans =
       25.098
       
sum(x)
ans =
           79

If I chose a different random vector f, it will find a different random solution, based on f. So on a second try, we see:

sum(x)
ans =
           75
Vols*x
ans =
       25.056
       
find(x).'
ans =
  Columns 1 through 24
     6    11    15    21    23    26    28    35    36    37    38    39    52    62    69    70    71    77    78    81    82    86    87    91
  Columns 25 through 48
    92    93    97    98   101   102   104   108   109   115   116   123   130   144   147   148   153   154   159   161   163   175   179   182
  Columns 49 through 72
   183   187   188   189   191   192   193   198   205   207   208   213   221   224   228   231   235   253   257   262   264   276   282   284
  Columns 73 through 75
   289   292   296

I won't debate the distribution of the solution, because that would seem difficult to resolve. It is pretty fast, taking around 0.09 seconds of time to run. That seems eminently reasonable to me, given the work it is doing.

Note that for some tolerances and sets of volumes, it is possible that no solution exists. Such a problem is trivially created.

Vols = 1:10;
Vtarget = 12.5;
Vtol = 0.1;

I won't say I am still going strong, but just still going. Over 30 years at it now.

John D'Errico el 6 de Nov. de 2019

Editada: John D'Errico el 6 de Nov. de 2019

What I cannot say is anything about the distribution, at least not without some thought. So here are some vague thoughts.

For example, the restriction about the sum reduces the search space to a convex polytope in 300 dimensions. If it was an exact equality constraint, the polytope would actually lie in a hyperplane of dimension 299. But since there is a tolerance, we actually have a polytope in 300 dimensions, containing non-zero volume.

Any solutions that exist (and depending on the tolerance there may potentially be a gazillion of them in 300 dimensions) will lie somewhere in that polytope. Some possibly on the surface, some inside. The integer constraint ensures that all solutions lie on the surface of that polytope, at least in this case, since the problem is a boolean one, and the search space is a subset of the nodes of a 300 boolean dimensional hypercube.

That means that all possible solutions to the problem are accessible to this approach. If the search space were larger, for example, we could have as many as k(for k>1) copies of any sub-volume element, then some solutions would lie inside the polytope. But that is not the case for a boolean problem.

So all possible solutions are accessible.The pertinent question is, is each solution equally likely? That depends on the vector f chosen by the call to randn. randn is a circularly symmetric random variable. Note that I carefully did not suggest using rand to generate the vector f, as rand is very non-circularly symmetric.

You can think of the random vector f as choosing a random rotation of the polytope. Then the solver finds a specific nodal solution of the problem from the set of possible solutions, the lowest, in a sense. I will conjecture the circular symmetry of the implicit rotation suggests that all solutions are equally possible. But that is merely a wild conjecture at this point, said without any true conviction. But what you want to have happen is indeed that all possible solutions would be equally likely.

So now, let me return to the solution polytope itself. Were the problem a continuous one, we could decribe it as a virtually paper thin one, composed of two parallel hyper-planes intersecting the R^300 unit hypercube, and the space between them. The integer constraint actually reduces that to the convex hull of all possible solutions. So it is still a very thin thing in 300 dimensions, and still a convex polytope. Now we introduce a circularly symmetric random rotation on that polytope, and after the rotation, we choose the "lowest" polytope vertex in the rotated space. Is that choice now an equal choice among the candidates? Here, I worry that might not be since the shape of the polytope is dependent on the set of volumes in the original vector of sub-volumes. For example, what happens if one of those sub-volumes was zero, or almost nearly so? My intuition is lagging a bit on this now (it is very early in the am as I write this.)

I will stand by the claim that all possible solutions are accessible though.

Richard Brown el 6 de Nov. de 2019

Editada: Richard Brown el 6 de Nov. de 2019

Abrir en MATLAB Online

I fully agree that all possible solutions are accessible this way.

I was giving myself a headache trying to figure out whether each solution was equally likely (in the sense that the probability of each possible solution is equal), but ended up managing to find a counterexample numerically.

Consider the vector of volumes

, where we want to find a selection of volumes that add to 4. (Going for an exact problem, rather than a tolerance one). In this case we can just write down the solutions, generate random objectives and see which one wins. I've ordered the solutions so that the one with all ones is first, the ones involving a 2 and 2 ones are next, then the ones involving the three are last.

V = [1; 2; 1; 1; 3; 1];
X = [1 0 1 1 0 1;
     1 1 1 0 0 0;
     1 1 0 1 0 0;
     1 1 0 0 0 1;
     0 1 1 1 0 0;
     0 1 1 0 0 1;
     0 1 0 1 0 1;
     1 0 0 0 1 0;
     0 0 1 0 1 0;
     0 0 0 1 1 0;
     0 0 0 0 1 1]';
n = numel(V);
nsols = size(X, 2);
f = randn(n, 100000);
[~, I] = max(f'*X, [], 2);
hist(I, 1:nsols)

I find that the solution involving all four ones is more likely than the others, followed by the solution involving the three, then those involving the twos. Note n is large enough that the pattern is evident.

John D'Errico el 6 de Nov. de 2019

Editada: John D'Errico el 6 de Nov. de 2019

Abrir en MATLAB Online

Yes, but that is a really bad example in my opinion. No solution method can distinguish between the four ones in that vector. If the goal is a binary solve, so any element can appear no more than once, then if your vector has replicates in it, you confuse the issue because you are not properly treating the goal that any element cannot appear more than once. This confuses the question of what uniform sampling means. I'm not even sure what one might mean about uniform sampling in the case of replicate elements. There are certainly multiple ways you could argue it.

Your example (with no tolerance) is a variant of an integer partitions problem, solved by my partitions code (found on the FEX.)

For example, given the set 1:8, and a target sum of 20.

sols = partitions(20,1:8,1)*diag(1:8)
sols =
   2     3     4     5     6     0     0
   0     3     4     5     0     7     0
   2     0     4     0     6     7     0
   0     3     4     0     6     7     0
   2     0     0     5     6     7     0
   2     0     4     5     0     0     8
   0     3     4     5     0     0     8
   2     3     0     0     6     0     8
   2     0     4     0     6     0     8
   0     0     0     5     6     0     8
   2     3     0     0     0     7     8
   0     0     4     0     0     7     8
   0     0     0     5     0     7     8
     
sum(sols,2)'
ans =
  20    20    20    20    20    20    20    20    20    20    20    20
    
size(sols)
ans =
   8

There are 13 distinct ways to generate that target sum. The rows of that matrix represent the entire set of possible solutions.

Are each of them equally likely, if I use the random projection scheme I described? Your scheme should work as a Monte Carlo simulation to see how well it does, but you NEED to be more careful with hist!!!!!!!!!!!!!!!

[~,ind] = min(sols*randn(8,10000000),[],1);
hist(ind,100)

It does not look that bad, but it is admittedly non-uniform, in the sense that each possible event was not equally likely. Not too far off though. I'd probably accept it as a sampling scheme, given the alternatives. For example, the random scheme you proposed has no reason to be any better in terms of "uniformity".

However, if I use hist like this:

hist(ind)

See that hist has edge problems. So your histogram is likely invalid, as you generated it. Never use hist as you did, allowing it to choose the bins. Note that you got the same kind of behavior when you used hist, thinking it showed a highly non-uniform sampling. Hist is confusing the issue when you use it like that.

Richard Brown el 7 de Nov. de 2019

Editada: Richard Brown el 7 de Nov. de 2019

My histogram is fine :)

If you look a little closer, you'll see I specified the bin centres (1:nsols) -- you can see that the 11 bars correspond to the 11 solutions. And the bars are exactly as expected - all solutions of the same class (e.g. a two and two ones) have around the same height, which is why there are three different heights evident, corresponding to the different ways of choosing (1,1,1,1), (2, 1, 1), and (3, 1)

While my toy example involving integers is admittedly contrived, it makes the point I was trying to make that each possible solution (where replicates are considered distinct) is not equally likely to be chosen! No more, no less!

And yes, the random scheme I proposed (before you suggested your elegant random objective orientation) is no better ... no arguments there

Iniciar sesión para comentar.

Random sampling of elements from an array based on a target condition in MATLAB

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Más respuestas (1)

8 comentarios
Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos

Categorías

Etiquetas

Community Treasure Hunt

Random sampling of elements from an array based on a target condition in MATLAB

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Más respuestas (1)

8 comentarios Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos

Categorías

Etiquetas

Ver también

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

8 comentarios
Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos