Random sampling of elements from an array based on a target condition in MATLAB
Mostrar comentarios más antiguos
I have an array (let's call it ElmInfo) of size Nx2 representing a geometry. In that array the element number and element volume are on the column 1 and column 2 respectively. The volume of elements largely vary. The sum of the volume of all elements leads to a value V which can be obtained in MATLAB as:
V=sum(ElmInfo(:,2));
I want to randomly sample elements from the array ElmInfo in such a way that the volume of sampled elements (with no repetition) will lead to a target volume V1. Note: V1 is less than V. So I don't know the number of elements to be sampled. I am giving an example. For a sampling case number of sampled element can be '10' whereas in other sampling number of sampled element can be '15'.
There is no straightforward MATLAB in-built function to meet the target condition. How can I implement the code in MATLAB?
Respuesta aceptada
Más respuestas (1)
John D'Errico
el 4 de Nov. de 2019
Editada: John D'Errico
el 4 de Nov. de 2019
1 voto
Are you looking for an exact match to V1? An approximate sum equal to V1? If so, then how close of a tolerance do you need?
Is there anything already in MATLAB to do this exactly? Absolutely not that I know of. Certainly not in MATLAB proper. You might find something on the file exchange, depending on whether you need an exact match to V1 or one with a tolerance.
Essentially, I would call this a random boolean (or 0-1) knapsack problem. An element from your set either appears in the knapsack, or it does not, but not more than one copy of any element. The problem is there may well be many possible subsets that fit your goal,
I could offer a solution that uses intlinprog, since that is a classical way to solve such a problem, although much depends on whether you need an exact solution or not.
8 comentarios
Rudraprasad Bhattacharyya
el 5 de Nov. de 2019
Richard Brown
el 5 de Nov. de 2019
The integer linear programming problem will find a solution, if one exists, but it doesn't really constitute "random sampling", i.e. choosing one random solution out of the set of all possible solutions. I'm not sure there is even a way to achieve this outcome at all. Depending on your application, you could just sample elements one at a time and stop when your sample exceeds V_1 - this might do the job if your elements are quite small compared with V_1.
As far as finding a sample goes, I'll save John the typing ? - here's the integer linear programming solution:
You can pose a very similar problem as a maximisation problem, where you're looking for the subset whose sum is largest, and less than V1_target (or V1_target + tolerance). It depends on your application as to which is the better choice.
you want to
So this is pretty much in exactly the right form to pass to intlinprog. Entries of the vector x will be 1 if an element is included, 0 otherwise.
n = numel(V)
tol = 1e-4;
lb = zeros(1, n);
ub = ones(1, n);
intcon = 1:n;
A = V.';
b = V_1 + tol;
[x, fval, exitflag, output] = intlinprog(-V, intcon, A, b, [], [], lb, ub)
Also, first time I've been on MATLAB Answers for a few years, good to see you're still going strong @John D'Errico!
John D'Errico
el 5 de Nov. de 2019
Editada: John D'Errico
el 5 de Nov. de 2019
Actually, there is a way to use ILP, and have it generate a random solution. People doubt me. Probably for good reasons. ;-)
But the approach I was thinking of when I wrote my anwer was...
Assume a vector V of length 300. I'll pick some arbitrary numbers here.
Vols = rand(1,300);
Vtarget = 25;
Vtol = 0.1;
n = numel(Vols);
lb = zeros(1,n);
ub = ones(1,n);
intcon = 1:n;
A = [Vols;-Vols];
b = [Vtarget + Vtol;-Vtarget + Vtol];
f = randn(n,1); % a random orientation for the objective!
[x, fval, exitflag, output] = intlinprog(f, intcon, A, b, [], [], lb, ub);
As you can see, it solved the problem, with 79 elements chosen this time.
Vols*x
ans =
25.098
sum(x)
ans =
79
If I chose a different random vector f, it will find a different random solution, based on f. So on a second try, we see:
sum(x)
ans =
75
Vols*x
ans =
25.056
find(x).'
ans =
Columns 1 through 24
6 11 15 21 23 26 28 35 36 37 38 39 52 62 69 70 71 77 78 81 82 86 87 91
Columns 25 through 48
92 93 97 98 101 102 104 108 109 115 116 123 130 144 147 148 153 154 159 161 163 175 179 182
Columns 49 through 72
183 187 188 189 191 192 193 198 205 207 208 213 221 224 228 231 235 253 257 262 264 276 282 284
Columns 73 through 75
289 292 296
I won't debate the distribution of the solution, because that would seem difficult to resolve. It is pretty fast, taking around 0.09 seconds of time to run. That seems eminently reasonable to me, given the work it is doing.
Note that for some tolerances and sets of volumes, it is possible that no solution exists. Such a problem is trivially created.
Vols = 1:10;
Vtarget = 12.5;
Vtol = 0.1;
I won't say I am still going strong, but just still going. Over 30 years at it now.
Richard Brown
el 6 de Nov. de 2019
Ha, never would have thought of that! I like it
Forgive me my unbelief ;-)
John D'Errico
el 6 de Nov. de 2019
Editada: John D'Errico
el 6 de Nov. de 2019
What I cannot say is anything about the distribution, at least not without some thought. So here are some vague thoughts.
For example, the restriction about the sum reduces the search space to a convex polytope in 300 dimensions. If it was an exact equality constraint, the polytope would actually lie in a hyperplane of dimension 299. But since there is a tolerance, we actually have a polytope in 300 dimensions, containing non-zero volume.
Any solutions that exist (and depending on the tolerance there may potentially be a gazillion of them in 300 dimensions) will lie somewhere in that polytope. Some possibly on the surface, some inside. The integer constraint ensures that all solutions lie on the surface of that polytope, at least in this case, since the problem is a boolean one, and the search space is a subset of the nodes of a 300 boolean dimensional hypercube.
That means that all possible solutions to the problem are accessible to this approach. If the search space were larger, for example, we could have as many as k(for k>1) copies of any sub-volume element, then some solutions would lie inside the polytope. But that is not the case for a boolean problem.
So all possible solutions are accessible.The pertinent question is, is each solution equally likely? That depends on the vector f chosen by the call to randn. randn is a circularly symmetric random variable. Note that I carefully did not suggest using rand to generate the vector f, as rand is very non-circularly symmetric.
You can think of the random vector f as choosing a random rotation of the polytope. Then the solver finds a specific nodal solution of the problem from the set of possible solutions, the lowest, in a sense. I will conjecture the circular symmetry of the implicit rotation suggests that all solutions are equally possible. But that is merely a wild conjecture at this point, said without any true conviction. But what you want to have happen is indeed that all possible solutions would be equally likely.
So now, let me return to the solution polytope itself. Were the problem a continuous one, we could decribe it as a virtually paper thin one, composed of two parallel hyper-planes intersecting the R^300 unit hypercube, and the space between them. The integer constraint actually reduces that to the convex hull of all possible solutions. So it is still a very thin thing in 300 dimensions, and still a convex polytope. Now we introduce a circularly symmetric random rotation on that polytope, and after the rotation, we choose the "lowest" polytope vertex in the rotated space. Is that choice now an equal choice among the candidates? Here, I worry that might not be since the shape of the polytope is dependent on the set of volumes in the original vector of sub-volumes. For example, what happens if one of those sub-volumes was zero, or almost nearly so? My intuition is lagging a bit on this now (it is very early in the am as I write this.)
I will stand by the claim that all possible solutions are accessible though.
Richard Brown
el 6 de Nov. de 2019
Editada: Richard Brown
el 6 de Nov. de 2019
I fully agree that all possible solutions are accessible this way.
I was giving myself a headache trying to figure out whether each solution was equally likely (in the sense that the probability of each possible solution is equal), but ended up managing to find a counterexample numerically.
Consider the vector of volumes
, where we want to find a selection of volumes that add to 4. (Going for an exact problem, rather than a tolerance one). In this case we can just write down the solutions, generate random objectives and see which one wins. I've ordered the solutions so that the one with all ones is first, the ones involving a 2 and 2 ones are next, then the ones involving the three are last.
V = [1; 2; 1; 1; 3; 1];
X = [1 0 1 1 0 1;
1 1 1 0 0 0;
1 1 0 1 0 0;
1 1 0 0 0 1;
0 1 1 1 0 0;
0 1 1 0 0 1;
0 1 0 1 0 1;
1 0 0 0 1 0;
0 0 1 0 1 0;
0 0 0 1 1 0;
0 0 0 0 1 1]';
n = numel(V);
nsols = size(X, 2);
f = randn(n, 100000);
[~, I] = max(f'*X, [], 2);
hist(I, 1:nsols)
I find that the solution involving all four ones is more likely than the others, followed by the solution involving the three, then those involving the twos. Note n is large enough that the pattern is evident.

John D'Errico
el 6 de Nov. de 2019
Editada: John D'Errico
el 6 de Nov. de 2019
Yes, but that is a really bad example in my opinion. No solution method can distinguish between the four ones in that vector. If the goal is a binary solve, so any element can appear no more than once, then if your vector has replicates in it, you confuse the issue because you are not properly treating the goal that any element cannot appear more than once. This confuses the question of what uniform sampling means. I'm not even sure what one might mean about uniform sampling in the case of replicate elements. There are certainly multiple ways you could argue it.
Your example (with no tolerance) is a variant of an integer partitions problem, solved by my partitions code (found on the FEX.)
For example, given the set 1:8, and a target sum of 20.
sols = partitions(20,1:8,1)*diag(1:8)
sols =
0 2 3 4 5 6 0 0
1 0 3 4 5 0 7 0
1 2 0 4 0 6 7 0
0 0 3 4 0 6 7 0
0 2 0 0 5 6 7 0
1 2 0 4 5 0 0 8
0 0 3 4 5 0 0 8
1 2 3 0 0 6 0 8
0 2 0 4 0 6 0 8
1 0 0 0 5 6 0 8
0 2 3 0 0 0 7 8
1 0 0 4 0 0 7 8
0 0 0 0 5 0 7 8
sum(sols,2)'
ans =
20 20 20 20 20 20 20 20 20 20 20 20 20
size(sols)
ans =
13 8
There are 13 distinct ways to generate that target sum. The rows of that matrix represent the entire set of possible solutions.
Are each of them equally likely, if I use the random projection scheme I described? Your scheme should work as a Monte Carlo simulation to see how well it does, but you NEED to be more careful with hist!!!!!!!!!!!!!!!
[~,ind] = min(sols*randn(8,10000000),[],1);
hist(ind,100)

It does not look that bad, but it is admittedly non-uniform, in the sense that each possible event was not equally likely. Not too far off though. I'd probably accept it as a sampling scheme, given the alternatives. For example, the random scheme you proposed has no reason to be any better in terms of "uniformity".
However, if I use hist like this:
hist(ind)

See that hist has edge problems. So your histogram is likely invalid, as you generated it. Never use hist as you did, allowing it to choose the bins. Note that you got the same kind of behavior when you used hist, thinking it showed a highly non-uniform sampling. Hist is confusing the issue when you use it like that.
Richard Brown
el 7 de Nov. de 2019
Editada: Richard Brown
el 7 de Nov. de 2019
My histogram is fine :)
If you look a little closer, you'll see I specified the bin centres (1:nsols) -- you can see that the 11 bars correspond to the 11 solutions. And the bars are exactly as expected - all solutions of the same class (e.g. a two and two ones) have around the same height, which is why there are three different heights evident, corresponding to the different ways of choosing (1,1,1,1), (2, 1, 1), and (3, 1)
While my toy example involving integers is admittedly contrived, it makes the point I was trying to make that each possible solution (where replicates are considered distinct) is not equally likely to be chosen! No more, no less!
And yes, the random scheme I proposed (before you suggested your elegant random objective orientation) is no better ... no arguments there
Categorías
Más información sobre Direct Search en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!