I think I found a relevant MATLAB example (Train Network on Image and Feature Data) which could help me. The URL is here: https://www.mathworks.com/help/deeplearning/ug/train-network-on-image-and-feature-data.html
In the example, the training data are converted into datastore Type via arrayDatastore and then combined into dsTrain, as seen in the picture below

Seems like the sequence of the combined data is the same as the input required by the neural net, as seen below

dsTrain = combine(dsX1Train,dsX2Train,dsTTrain);
dsX1Train(ImageInput), dsX2Train(rotation angle), dsTTrain(output).
Am I correct?
However, an answer from an experienced user or Mathworker would help a lot, :D.