MeX implementation of Tensorproduct multiplications

Question

ConvexHull el 6 de Dic. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/680483-mex-implementation-of-tensorproduct-multiplications

Editada: ConvexHull el 23 de Ag. de 2021

Dear all,

I am searching for a more efficient way to calculate tensorproduct multiplications of arbitrary sizes. First of all, i have already searched the web to get inspired by different solutions, e.g. here Tensorproduct at stackexchange:

1.) External Libraries, e.g. Eigen C++.

2.) Pure Matlab solution based in reshape and shiftdim.

3.) Matlab Tensor Toolbox.

However, so far i ended up with a MeX-file implementation with OpenMP parallelization.

--

In the following, i will give a short introduction of the math behind the implementation:

Suppose you have a matrix vector multiplication, where a matrix C is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). More informations can be found here Wikipedia.

.

In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations.

--

So far so good. In following example the sizes of the arrays are

matrix_xi = 4x4,

matrix_eta = 4x4,

vector = 16x500000x5,

(new version: vector = 16x2500000)

and may be initialized with ones or zeros. Moreover, rather than calculating only one Kronecker matrix vector multiplication, i want to calculate e.g. 500000x5 operations with one call (see the size of vector).

--

My current MeX-C file implementation:

/*************************************************
 * CALLING:
 *
 * out = tensorproduct(A,B,vector)     
 * 
 *************************************************/
#include "mex.h"
#include "omp.h"
#define PP_A      prhs[0]
#define PP_B      prhs[1]
#define PP_vector prhs[2]
#define PP_out    plhs[0]
#define n 4 
#define m 4 
#define p 4 
#define q 4
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
  const mwSize   *dim;
  int            i,j,k,s,l;
  double         temp[n*q];
  double *A      = mxGetPr(PP_A);
  double *B      = mxGetPr(PP_B);
  double *vector = mxGetPr(PP_vector);
  dim            = mxGetDimensions(PP_vector);
  l              = dim[1];
  PP_out         = mxCreateDoubleMatrix(l*n*p,1,mxREAL);
  double *out    = mxGetPr(PP_out);
            
  #pragma omp parallel for private(i,j,k,s,temp)
  for(k=0; k<l; k++){
    for(i=0; i<n; i++){
      for(j=0; j<q; j++){
        temp[i+j*n]=0;
      }
    }
    for(s=0; s<m; s++){
      for(i=0; i<n; i++){
        for(j=0; j<m; j++){
          temp[i+s*n]+=A[i+j*n]*vector[j+s*m+k*m*p];
        }
      }
    }
    for(s=0; s<n; s++){
      for(i=0; i<p; i++){
        for(j=0; j<q; j++){
           out[i*n+s+k*n*p]+=B[i+j*p]*temp[j*n+s];
        }
      }
    }
  }
} 

--

The code can be compiled with:

mex FFLAGS='$FFLAGS             -fopenmp -Ofast -funroll-loops' ...    
    CFLAGS='$CFLAGS             -fopenmp -Ofast -funroll-loops' ...    
    LDFLAGS='$LDFLAGS           -fopenmp -Ofast -funroll-loops' ...    
    COPTIMFLAGS='$COPTIMFLAGS   -fopenmp -Ofast -funroll-loops' ...    
    LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...    
    DEFINES='$DEFINES           -fopenmp -Ofast -funroll-loops' ... 
    tensorproduct.c 

--

Currently on my Notebook: (Ubuntu 18.04, GCC 7.5.0, 4 Cores)

matrix_xi  = ones(4,4);
matrix_eta = ones(4,4);
vector     = ones(16,500000,5);
n          = 50;
tic
for i=1:n
    vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 8.036490 seconds.

--

A quite simple Matlab implementation without the use of the Kronecker property gives

matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
    vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 10.489606 seconds.

--

With a different size of n,m,p,q = 7 the difference surprisingly does not change and gets worse (here you have to change the constants in the C-file)

matrix_xi  = ones(7,7);
matrix_eta = ones(7,7);
vector     = ones(49,500000,5);
n          = 50;
tic
for i=1:n
    vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 31.436541 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
    vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 35.829957 seconds.

Luckily I am still faster than Matlab.

--

Note that only n,m,p,q are constants during compile time. To make the code useable for arbitrary sizes i have written a Macro to generate many C-files with different const n,m,p,q which will be included later with a header and a main C-file. This will help the compiler to unrole and inline the code.

This is the most efficient way i have achieved so far. I hope there are some experts here in this forum who can help to optimize these few lines. I am also open for other solutions, e.g. Matlab based or based on external libraries.

Thank you!

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

ConvexHull el 6 de Dic. de 2020

Editada: ConvexHull el 6 de Dic. de 2020

Abrir en MATLAB Online

I refere to matrix vector multiplication, where matrix C is build up by kron. Commonly it is called "vec trick".

C=kron(A,B);

With N ≡ n=m=p=q you can perform a matrix vector product with O(2*N*N*N) instead of O(N*N*N*N).

Here:

C = matrix

B = matrix_xi.';

A = matrix_eta;

vec(X) = vector (500000x5 times).

You can think of it in a simple way by applying matrix vector multiplications linewise in each direction. Here i have visualized X in matrix representation in order to apply the Tensorproduct multiplication linewise. Without the Kronecker property X would be handled as vector.

A horizontal arrow represents a matrix vector multiplication with e.g. "matrix_xi" on a small set of X.
A vertical arrow represents a matrix vector multiplication with e.g. "matrix_eta" on a small set of X.
The naive implementation would be to unfold X in a vector representation and apply "matrix" on whole X.

Are you familiar with Tensorproducts and their properties?

What exactly is unclear? I thought i have defined all stuff quite sufficient.

Regards

Bruno Luong el 6 de Dic. de 2020

Editada: Bruno Luong el 6 de Dic. de 2020

Abrir en MATLAB Online

Yes I do know tensor and kron. But your notation is confusing. Actually your second statement

A = matrix_xi
B = matrix_eta

is wrong. Actually it's

B = matrix_xi.';
A = matrix_eta;

And your vector consists of a set of multiple X matrices.

Each reshape(vector(:,p,q), size(A,2), size(B,1)) is an X.

Your description is very confusing.

I knew what you want to compute. For the general reader, the description is careless and confusing.

ConvexHull el 6 de Dic. de 2020

Editada: ConvexHull el 6 de Dic. de 2020

Sorry, I am a little bit inaccurate here. I hope you got it.

Summarizing, i simply want to use the Kronecker property to apply many (500000*5) efficient matrix vector multiplications using O(npq+qnm) with one call.

Thank you for your help.

Edit: My main concern is a more efficient C implementation.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Bruno Luong el 6 de Dic. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/680483-mex-implementation-of-tensorproduct-multiplications#answer_566778

Editada: Bruno Luong el 22 de Ag. de 2021

Abrir en MATLAB Online

I propose two other methods of MATLAB.

The first simply replaces (:,:) by RESHAPE. The first doing some data copy, the second do not. It seems to be the fatest.

The second method uses MATLAB pagemtimes introduced in R2020b. (Edit: Jan corrects the release)

function benchtensor
matrix_xi  = rand(7,7);
matrix_eta = rand(7,7);
vector     = rand(49,500000,5);
n          = 50;
tic
for i=1:n
    vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc % Elapsed time is 19.429296 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
    vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc % Elapsed time is 28.08963 seconds.
tic
for i=1:n
    vector_out = reshape(matrix*reshape(vector,49,[]),size(vector));
end
toc % Elapsed time is 14.30025 seconds.
B = matrix_xi.';
A = matrix_eta;
tic
for i=1:n
    X = reshape(vector, size(A,2), size(B,1), []);
    CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
end
toc % Elapsed time is 29.296709 seconds.
end

17 comentarios
Mostrar 15 comentarios más antiguosOcultar 15 comentarios más antiguos

Bruno Luong el 22 de Ag. de 2021

Thanks Jan I fix it

ConvexHull el 23 de Ag. de 2021

Editada: ConvexHull el 23 de Ag. de 2021

Note that i merged k/l loops in the original post as suggested. For the C-code you have to use vector=size(49x2500000) instead of vector=size(49x500000x5) .

The other two hints did not give me any speed up.

Iniciar sesión para comentar.

MeX implementation of Tensorproduct multiplications

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Respuesta aceptada

17 comentarios
Mostrar 15 comentarios más antiguosOcultar 15 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

MeX implementation of Tensorproduct multiplications

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Respuesta aceptada

17 comentarios Mostrar 15 comentarios más antiguosOcultar 15 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

17 comentarios
Mostrar 15 comentarios más antiguosOcultar 15 comentarios más antiguos