Getting Jumps in mini-batch loss when training YoloV2

3 visualizaciones (últimos 30 días)
ohad a
ohad a el 2 de Mayo de 2019
Respondida: Zahra Moayed el 5 de Ag. de 2019
Hello.
i'm trying to train YOLOV2 on my person detector data set.
For some reason i get big Training loss jumps in the middle of the training. i can also see that the temp checkpoint models files are reducing in size dramatically (e.g - from 59MB to 1.5Mb).
i'm using about 170 pictures with 1-6 bounding box each.
here is the code:
% Define the image input size.
imageSize = [450 800 3];
% Define the number of object classes to detect.
numClasses = width(personDataSet)-1;
anchorBoxes = [
76 43
208 147
103 68
158 106
198 137
129 81
73 40
];
baseNetwork = resnet50
% Specify the feature extraction layer.
featureLayer = 'activation_49_relu';
analyzeNetwork(baseNetwork);
%reorgLayer = 'activation_47_relu';
% Create the YOLO v2 object detection network.
% lgraph = yolov2Layers(imageSize,numClasses,anchorBoxes,baseNetwork,featureLayer,'ReorglayerSource',reorgLayer);
lgraph = yolov2Layers(imageSize,numClasses,anchorBoxes,baseNetwork,featureLayer);
% Configure the training options.
% * Lower the learning rate to 1e-3 to stabilize training.
% * Set CheckpointPath to save detector checkpoints to a temporary
% location. If training is interrupted due to a system failure or
% power outage, you can resume training from the saved checkpoint.
options = trainingOptions('sgdm', ...
'MiniBatchSize', 34, ...
'InitialLearnRate',1e-3, ...
'MaxEpochs',30,...
'VerboseFrequency',2, ...
'CheckpointPath', tempdir);
%'LearnRateSchedule','piecewise', ...
%'LearnRateDropPeriod',10 , ...
%'Shuffle','every-epoch');
% Train YOLO v2 detector.
[detector,info] = trainYOLOv2ObjectDetector(trainingData,lgraph,options);
as seen in code i also tried with 'LearnRateSchedule' and 'Shuffle' and with different learnRate, batch size and epochs. and also getting same results.
this is an example of the one in code:
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 8).
Training on single CPU.
|========================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning |
| | | (hh:mm:ss) | RMSE | Loss | Rate |
|========================================================================================|
| 1 | 1 | 00:00:37 | 8.56 | 73.2 | 0.0010 |
| 1 | 2 | 00:01:14 | 3.55 | 12.6 | 0.0010 |
| 1 | 4 | 00:02:27 | 2.15 | 4.6 | 0.0010 |
| 2 | 6 | 00:03:44 | 2.81 | 7.9 | 0.0010 |
| 2 | 8 | 00:04:57 | 2.89 | 8.4 | 0.0010 |
| 2 | 10 | 00:06:10 | 2.91 | 8.5 | 0.0010 |
| 3 | 12 | 00:07:26 | 2.80 | 7.8 | 0.0010 |
| 3 | 14 | 00:08:39 | 2.65 | 7.0 | 0.0010 |
| 4 | 16 | 00:09:55 | 2.18 | 4.7 | 0.0010 |
| 4 | 18 | 00:11:08 | 2.23 | 5.0 | 0.0010 |
| 4 | 20 | 00:12:21 | 2.32 | 5.4 | 0.0010 |
| 5 | 22 | 00:13:37 | 2.40 | 5.8 | 0.0010 |
| 5 | 24 | 00:14:50 | 2.42 | 5.9 | 0.0010 |
| 6 | 26 | 00:16:06 | 2.53 | 6.4 | 0.0010 |
| 6 | 28 | 00:17:18 | 2.59 | 6.7 | 0.0010 |
| 6 | 30 | 00:18:31 | 2.37 | 5.6 | 0.0010 |
| 7 | 32 | 00:19:47 | 2.29 | 5.2 | 0.0010 |
| 7 | 34 | 00:20:59 | 2.34 | 5.5 | 0.0010 |
| 8 | 36 | 00:22:15 | 2.24 | 5.0 | 0.0010 |
| 8 | 38 | 00:23:28 | 2.69 | 7.2 | 0.0010 |
| 8 | 40 | 00:24:41 | 2.86 | 8.2 | 0.0010 |
| 9 | 42 | 00:25:56 | 1.63 | 2.7 | 0.0010 |
| 9 | 44 | 00:27:09 | 1.71 | 2.9 | 0.0010 |
| 10 | 46 | 00:28:25 | 1.65 | 2.7 | 0.0010 |
| 10 | 48 | 00:29:37 | 1.68 | 2.8 | 0.0010 |
| 10 | 50 | 00:30:50 | 1.65 | 2.7 | 0.0010 |
| 11 | 52 | 00:32:07 | 1.68 | 2.8 | 0.0010 |
| 11 | 54 | 00:33:20 | 1.71 | 2.9 | 0.0010 |
| 12 | 56 | 00:34:35 | 1.65 | 2.7 | 0.0010 |
| 12 | 58 | 00:35:47 | 1.63 | 2.7 | 0.0010 |
| 12 | 60 | 00:36:58 | 1.62 | 2.6 | 0.0010 |
| 13 | 62 | 00:38:13 | 1.70 | 2.9 | 0.0010 |
| 13 | 64 | 00:39:25 | 1.79 | 3.2 | 0.0010 |
| 14 | 66 | 00:40:40 | 1.66 | 2.8 | 0.0010 |
| 14 | 68 | 00:41:52 | 1.66 | 2.7 | 0.0010 |
| 14 | 70 | 00:43:04 | 2.08 | 4.3 | 0.0010 |
| 15 | 72 | 00:44:19 | 4.30 | 18.5 | 0.0010 |
| 15 | 74 | 00:45:30 | 9.76 | 95.2 | 0.0010 |
| 16 | 76 | 00:46:42 | 9.08 | 82.5 | 0.0010 |
| 16 | 78 | 00:47:54 | 8.59 | 73.8 | 0.0010 |
| 16 | 80 | 00:49:05 | 8.25 | 68.1 | 0.0010 |
| 17 | 82 | 00:50:17 | 8.10 | 65.6 | 0.0010 |
| 17 | 84 | 00:51:30 | 7.86 | 61.7 | 0.0010 |
| 18 | 86 | 00:52:41 | 7.09 | 50.2 | 0.0010 |
| 18 | 88 | 00:53:52 | 6.51 | 42.3 | 0.0010 |
| 18 | 90 | 00:55:04 | 6.66 | 44.4 | 0.0010 |
| 19 | 92 | 00:56:16 | 6.70 | 45.0 | 0.0010 |
| 19 | 94 | 00:57:27 | 6.65 | 44.2 | 0.0010 |
| 20 | 96 | 00:58:39 | 6.18 | 38.3 | 0.0010 |
| 20 | 98 | 00:59:50 | 5.88 | 34.6 | 0.0010 |
| 20 | 100 | 01:01:01 | 6.15 | 37.8 | 0.0010 |
| 21 | 102 | 01:02:13 | 5.88 | 34.5 | 0.0010 |
| 21 | 104 | 01:03:25 | 6.09 | 37.0 | 0.0010 |
| 22 | 106 | 01:04:37 | 6.14 | 37.7 | 0.0010 |
| 22 | 108 | 01:05:48 | 5.12 | 26.2 | 0.0010 |
| 22 | 110 | 01:06:59 | 5.99 | 35.9 | 0.0010 |
| 23 | 112 | 01:08:10 | 5.95 | 35.4 | 0.0010 |
| 23 | 114 | 01:09:21 | 6.21 | 38.6 | 0.0010 |
| 24 | 116 | 01:10:33 | 6.07 | 36.9 | 0.0010 |
| 24 | 118 | 01:11:44 | 5.80 | 33.7 | 0.0010 |
| 24 | 120 | 01:12:55 | 6.30 | 39.7 | 0.0010 |
| 25 | 122 | 01:14:07 | 5.90 | 34.9 | 0.0010 |
| 25 | 124 | 01:15:18 | 6.17 | 38.0 | 0.0010 |
| 26 | 126 | 01:16:31 | 5.85 | 34.2 | 0.0010 |
| 26 | 128 | 01:17:42 | 5.53 | 30.6 | 0.0010 |
| 26 | 130 | 01:18:53 | 5.91 | 35.0 | 0.0010 |
| 27 | 132 | 01:20:05 | 5.88 | 34.6 | 0.0010 |
| 27 | 134 | 01:21:16 | 6.14 | 37.8 | 0.0010 |
| 28 | 136 | 01:22:28 | 6.03 | 36.4 | 0.0010 |
| 28 | 138 | 01:23:40 | 5.26 | 27.6 | 0.0010 |
| 28 | 140 | 01:24:53 | 5.90 | 34.8 | 0.0010 |
| 29 | 142 | 01:26:04 | 5.86 | 34.3 | 0.0010 |
| 29 | 144 | 01:27:16 | 6.14 | 37.7 | 0.0010 |
| 30 | 146 | 01:28:28 | 5.60 | 31.3 | 0.0010 |
| 30 | 148 | 01:29:40 | 5.76 | 33.2 | 0.0010 |
| 30 | 150 | 01:30:52 | 5.89 | 34.7 | 0.0010 |
|========================================================================================|

Respuestas (2)

ping.jiang
ping.jiang el 13 de Jun. de 2019
所以,你的问题是什么呢?

Zahra Moayed
Zahra Moayed el 5 de Ag. de 2019
I had the same issue but when I decided to choose [224 224 3] which is the input size of ResNet and then resize the anchorboxes, it finally worked. However it only worked with Single class.
I also used MiniBatchSize =16 and Shuffle=every-epoch but the main change was the input size

Categorías

Más información sobre Deep Learning Toolbox en Help Center y File Exchange.

Productos


Versión

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by