Optimized GPU implementation of strided, batched matrix multiply with add operation


D = gpucoder.stridedMatrixMultiplyAdd(A,B,C) performs strided matrix-matrix multiplication and add of a batch of matrices. The input matrices A, B, and C for each instance of the batch are located at fixed address offsets from their addresses in the previous instance. gpucoder.stridedMatrixMultiplyAdd performs matrix-matrix multiplication of the form:


where α and β are scalar multiplication factors, A, B, C, and D are matrices with dimensions m-by-k, k-by-n, m-by-n, and m-by-n respectively. A, B, and C can optionally be transposed or hermitian-conjugated. By default, α and β are set to one and the matrices are not transposed. Use the Name,Value pair arguments to specify a different scalar multiplication factor and to specify transpose operations on the input matrices.

All the batches passed to the gpucoder.stridedMatrixMultiplyAdd function must be uniform. That is, all instances must have the same dimensions m,n,k.


___ = gpucoder.stridedMatrixMultiplyAdd(___,Name,Value) performs batched matrix multiply and add operation using the options specified by one or more Name,Value pair arguments.


collapse all

This example performs a simple batched matrix-matrix multiplication with add and uses the gpucoder.stridedMatrixMultiplyAdd function to generate CUDA® code that calls appropriate cublas<t>gemmStridedBatched APIs.

In one file, write an entry-point function myStridedMatMulAdd that accepts matrix inputs A, B, and C. The input matrices are not transposed, therefore use the 'nn' option.

function [D] = myStridedMatMulAdd(A,B,C,alpha,beta)

[D] = gpucoder.stridedMatrixMultiplyAdd(A,B,C,'alpha',alpha,...


Use the coder.newtype function to create a type for a matrix of doubles for use in code generation.

A = coder.newtype('double',[12,14],[0 0]);
B = coder.newtype('double',[14,16],[0 0]);
C = coder.newtype('double',[12,16],[0 0]);
alpha = 0.3;
beta = 0.6;
inputs = {A,B,C,alpha,beta};

Use the codegen function to generate a CUDA library.

cfg = coder.gpuConfig('lib');
cfg.GpuConfig.EnableCUBLAS = true;
cfg.GpuConfig.EnableCUSOLVER = true;
cfg.GenerateReport = true;
codegen -config cfg-args inputs myStridedMatMulAdd

The generated CUDA code contains kernels: myStridedMatMulAdd_kernelNN for initializing the input and output matrices. It also contains the cublasDgemmStridedBatched API calls to the cuBLAS library. The following is a snippet of the generated code.

// File:

void myStridedMatMulAdd(const double A[168], const double B[224], const double
  C[192], double alpha, double beta, double D[192])
  double alpha1;
  cudaMemcpy(gpu_C, (void *)&C[0], 1536UL, cudaMemcpyHostToDevice);
  myStridedMatMulAdd_kernel1<<<dim3(1U, 1U, 1U), dim3(192U, 1U, 1U)>>>(*gpu_C,
  cudaMemcpy(gpu_alpha1, &alpha1, 8UL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_A, (void *)&A[0], 1344UL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_B, (void *)&B[0], 1792UL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_beta1, &beta1, 8UL, cudaMemcpyHostToDevice);
  cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N,
    12, 16, 14, (double *)gpu_alpha1, (double *)&(*gpu_A)[0], 12, 0, (double *)
    &(*gpu_B)[0], 14, 0, (double *)gpu_beta1, (double *)&(*gpu_D)[0], 12, 192, 1);
  cudaMemcpy(&D[0], gpu_D, 1536UL, cudaMemcpyDeviceToHost);

Input Arguments

collapse all

Operands, specified as vectors or matrices. A, B, and C must be 2-D arrays. The number of columns in A must be equal to the number of rows in B. The number of rows in A must be equal to the number of rows in C. The number of columns in B must be equal to the number of columns in C.

Data Types: double | single | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
Complex Number Support: Yes

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: D = gpucoder.stridedMatrixMultiplyAdd(A,B,C,'alpha',0.3,'beta',0.6,'transpose','CC');

Value of the scalar used for multiplication with A. Default value is one.

Value of the scalar used for multiplication with C. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matrices A and B prior to matrix multiplication. Possible values are normal ('N'), transposed ('T'), or complex conjugate transpose ('C').

Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. Array D has the same number of rows as input A and the same number of columns as input B.

Introduced in R2020a