Create U-Net layers for semantic segmentation


lgraph = unetLayers(imageSize,numClasses)
lgraph = unetLayers(imageSize,numClasses,Name,Value)



lgraph = unetLayers(imageSize,numClasses) returns a U-Net network. unetLayers includes a pixelClassificationLayer to predict the categorical label for every pixel in an input image.

Use unetLayers to create the network architecture for U-Net. You must train the network using the Deep Learning Toolbox™ function trainNetwork.

lgraph = unetLayers(imageSize,numClasses,Name,Value)specifies options using one or more name-value pairs. Enclose each property name in quotes. For example, unetLayer(imageSize,numClasses,'NumOutputChannels',64) additionally sets the number of output channels to 64 for the first encoder subsection.


collapse all

Create U-Net layers with an encoder/decoder depth of 3.

imageSize = [480 640 3];
numClasses = 5;
encoderDepth = 3;
lgraph = unetLayers(imageSize,numClasses,'EncoderDepth',encoderDepth)
lgraph = 
  LayerGraph with properties:

         Layers: [46x1 nnet.cnn.layer.Layer]
    Connections: [48x2 table]

Display the network.


Load training images and pixel labels.

dataSetDir = fullfile(toolboxdir('vision'),'visiondata','triangleImages');
imageDir = fullfile(dataSetDir,'trainingImages');
labelDir = fullfile(dataSetDir,'trainingLabels');

Create an imageDatastore holding the training images.

imds = imageDatastore(imageDir);

Define the class names and their associated label IDs.

classNames = ["triangle","background"];
labelIDs   = [255 0];

Create a pixelLabelDatastore holding the ground truth pixel labels for the training images.

pxds = pixelLabelDatastore(labelDir,classNames,labelIDs);

Create U-Net.

imageSize = [32 32];
numClasses = 2;
lgraph = unetLayers(imageSize, numClasses)
lgraph = 
  LayerGraph with properties:

         Layers: [58×1 nnet.cnn.layer.Layer]
    Connections: [61×2 table]

Create data source for training a semantic segmentation network.

ds = pixelLabelImageDatastore(imds,pxds);

Set up training options.

options = trainingOptions('sgdm','InitialLearnRate',1e-3, ...

Train the network.

net = trainNetwork(ds,lgraph,options)
Training on single CPU.
Initializing image normalization.
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |   Accuracy   |     Loss     |      Rate       |
|       1 |           1 |       00:00:04 |        5.21% |      15.1044 |          0.0010 |
|      10 |          10 |       00:00:43 |       96.09% |       0.4845 |          0.0010 |
|      20 |          20 |       00:01:25 |       94.38% |       0.7715 |          0.0010 |
net = 
  DAGNetwork with properties:

         Layers: [58×1 nnet.cnn.layer.Layer]
    Connections: [61×2 table]

Input Arguments

collapse all

Network input image size, specified as a:

  • 2-element vector in the format [height, width].

  • 3-element vector in the format [height, width, depth]. depth is the number of image channels. Set depth to 3 for RGB images, 1 for grayscale images, or to the number of channels for multispectral and hyperspectral images.


Each encoder section has a 2x2 maxPooling2dLayer that halves the image size. The height and width of the input image must be a multiple of 2D, where D is the value of EncoderDepth.

Number of classes in the semantic segmentation, specified as an integer greater than 1.

Name-Value Pair Arguments

Example: 'EncoderDepth',3

Encoder depth, specified as a positive integer. U-Net is composed of an encoder and corresponding decoder subnetwork. The depth of these networks determines the number of times the input image is downsampled or upsampled as it is processed. The encoder network downsamples the input image by a factor of 2D, where D is the value of EncoderDepth. The decoder network upsamples the encoder network output by a factor of 2D.

Number of output channels for the first subsection in the U-Net encoder network, specified as a positive integer or vector of positive integers. Each of the subsequent enoder subsections double the number of output channels. unetLayers sets the number of output channels in the decoder sections to match the corresponding encoder sections.

Convolutional layer filter size, specified as a positive odd integer or a 2-element row vector of positive odd integers. Typical values are in the range [3, 7].

scalarThe filter is square.
2-element row vector

The filter has the size [height width].

Output Arguments

collapse all

Layers that represent the U-Net network architecture, returned as a layerGraph object.


  • The sections within the U-Net encoder subnetworks consist of two sets of convolutional and ReLU layers, followed by a 2x2 max pooling layer. The decoder subnetworks consist of a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers.

  • Convolutional layers in unetLayers use 'same' padding, which retains the data size from input to output and enables a broad set of input image sizes. The original version by Ronneberger[1] does not use padding and is constrained to a smaller set of input image sizes.

  • The bias term of all convolutional layers is initialized to zero.

  • Convolution layer weights in the encoder and decoder subnetworks are initialized using the 'He' weight initialization method [2].

  • Networks produced by unetLayers support GPU code generation for deep learning once they are trained with trainNetwork. See Deep Learning Code Generation (Deep Learning Toolbox) for details and examples.


[1] Ronneberger, O., P. Fischer, and T. Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." Medical Image Computing and Computer-Assisted Intervention (MICCAI). Vol. 9351, 2015, pp. 234–241.

[2] He, K., X. Zhang, S. Ren, and J. Sun. "Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." Proceedings of the IEEE International Conference on Computer Vision. 2015, 1026–1034.

Introduced in R2018b