Optimized GPU implementation of strided, batched matrix multiply with add operation

## Syntax

``D = gpucoder.stridedMatrixMultiplyAdd(A,B,C)``
``___ = gpucoder.stridedMatrixMultiplyAdd(___,Name,Value)``

## Description

````D = gpucoder.stridedMatrixMultiplyAdd(A,B,C)` performs strided matrix-matrix multiplication and add of a batch of matrices. The input matrices `A`, `B`, and `C` for each instance of the batch are located at fixed address offsets from their addresses in the previous instance. `gpucoder.stridedMatrixMultiplyAdd` performs matrix-matrix multiplication of the form: $D=\alpha AB+\beta C$where $\alpha$ and $\beta$ are scalar multiplication factors, `A`, `B`, `C`, and `D` are matrices with dimensions `m`-by-`k`, `k`-by-`n`, `m`-by-`n`, and `m`-by-`n` respectively. `A`, `B`, and `C` can optionally be transposed or hermitian-conjugated. By default, $\alpha$ and $\beta$ are set to one and the matrices are not transposed. Use the `Name,Value` pair arguments to specify a different scalar multiplication factor and to specify transpose operations on the input matrices.All the batches passed to the `gpucoder.stridedMatrixMultiplyAdd` function must be uniform. That is, all instances must have the same dimensions `m,n,k`.```

example

````___ = gpucoder.stridedMatrixMultiplyAdd(___,Name,Value)` performs batched matrix multiply and add operation using the options specified by one or more `Name,Value` pair arguments.```

## Examples

collapse all

This example performs a simple batched matrix-matrix multiplication with add and uses the `gpucoder.stridedMatrixMultiplyAdd` function to generate CUDA® code that calls appropriate `cublas<t>gemmStridedBatched` APIs.

In one file, write an entry-point function `myStridedMatMulAdd` that accepts matrix inputs `A`, `B`, and `C`. The input matrices are not transposed, therefore use the `'nn'` option.

```function [D] = myStridedMatMulAdd(A,B,C,alpha,beta) [D] = gpucoder.stridedMatrixMultiplyAdd(A,B,C,'alpha',alpha,... 'beta',beta,'transpose','nn'); end ```

Use the `coder.newtype` function to create a type for a matrix of doubles for use in code generation.

```A = coder.newtype('double',[12,14],[0 0]); B = coder.newtype('double',[14,16],[0 0]); C = coder.newtype('double',[12,16],[0 0]); alpha = 0.3; beta = 0.6; inputs = {A,B,C,alpha,beta}; ```

Use the `codegen` function to generate a CUDA library.

```cfg = coder.gpuConfig('lib'); cfg.GpuConfig.EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen -config cfg-args inputs myStridedMatMulAdd ```

The generated CUDA code contains kernels: `myStridedMatMulAdd_kernelNN` for initializing the input and output matrices. It also contains the `cublasDgemmStridedBatched` API calls to the cuBLAS library. The following is a snippet of the generated code.

```// // File: myStridedMatMulAdd.cu ... void myStridedMatMulAdd(const double A, const double B, const double C, double alpha, double beta, double D) { double alpha1; ... cudaMemcpy(gpu_C, (void *)&C, 1536UL, cudaMemcpyHostToDevice); myStridedMatMulAdd_kernel1<<<dim3(1U, 1U, 1U), dim3(192U, 1U, 1U)>>>(*gpu_C, *gpu_D); cudaMemcpy(gpu_alpha1, &alpha1, 8UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_A, (void *)&A, 1344UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_B, (void *)&B, 1792UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_beta1, &beta1, 8UL, cudaMemcpyHostToDevice); cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 12, 16, 14, (double *)gpu_alpha1, (double *)&(*gpu_A), 12, 0, (double *) &(*gpu_B), 14, 0, (double *)gpu_beta1, (double *)&(*gpu_D), 12, 192, 1); cudaMemcpy(&D, gpu_D, 1536UL, cudaMemcpyDeviceToHost); ... } ```

## Input Arguments

collapse all

Operands, specified as vectors or matrices. `A`, `B`, and `C` must be 2-D arrays. The number of columns in `A` must be equal to the number of rows in `B`. The number of rows in `A` must be equal to the number of rows in `C`. The number of columns in `B` must be equal to the number of columns in `C`.

Data Types: `double` | `single` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`
Complex Number Support: Yes

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: ```D = gpucoder.stridedMatrixMultiplyAdd(A,B,C,'alpha',0.3,'beta',0.6,'transpose','CC');```

Value of the scalar used for multiplication with `A`. Default value is one.

Value of the scalar used for multiplication with `C`. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matrices `A` and `B` prior to matrix multiplication. Possible values are normal (`'N'`), transposed (`'T'`), or complex conjugate transpose (`'C'`).

## Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. Array `D` has the same number of rows as input `A` and the same number of columns as input `B`.