Main Content

Speech Command Recognition by Using FPGA

This example shows how to deploy a custom pretrained series network that detects the presence of speech commands in audio to a Xilinx™ Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit. This example uses the pretrained network that was trained by using the Speech Commands Dataset [1]. To create the pretrained network, see Speech Command Recognition Using Deep Learning.

Prerequisites

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC

  • Audio Toolbox™

  • Xilinx™ Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit

Load Speech Commands Data Set

This example uses the Google Speech Commands Dataset [1]. Download the dataset and untar the downloaded file. Set PathToDatabase to the location of the data.

url = 'https://ssd.mathworks.com/supportfiles/audio/google_speech.zip';
downloadFolder = tempdir;
dataFolder = fullfile(downloadFolder,'google_speech');

if ~exist(dataFolder,'dir')
    disp('Downloading data set (1.4 GB) ...')
    unzip(url,downloadFolder)
end

Load Pretrained Speech Recognition Network

The pretrained network trainedAudioNet is a simple series network made up of 24 layers. The network uses max pooling layers to downsample the feature maps "spatially" (that is, in time and frequency) and a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translation invariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Global pooling also significantly reduces the number of parameters in the final fully connected layer. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer.

The network is small, as it has only five convolutional layers with few filters. numF controls the number of filters in the convolutional layers. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. You can also try increasing the number of convolutional filters by increasing numF.

Use a weighted cross entropy classification loss. The weightedClassificationLayer(classWeights) function creates a custom classification layer that calculates the cross entropy loss with observations weighted by classWeights. Specify the class weights in the same order as the classes appear in categories(YTrain). To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights. Load the pretrained network trainedAudioNet.

load('trainedAudioNet.mat');

Create Training and Validation Datastore

Create an audioDataStore that points to the training and validation data sets. See audioDataStore.

ads = audioDatastore(fullfile(dataFolder, 'train'), ...
    'IncludeSubfolders',true, ...
    'FileExtensions','.wav', ...
    'LabelSource','foldernames')
ads = 
  audioDatastore with properties:

                       Files: {
                              ' ...\AppData\Local\Temp\google_speech\train\bed\00176480_nohash_0.wav';
                              ' ...\AppData\Local\Temp\google_speech\train\bed\004ae714_nohash_0.wav';
                              ' ...\AppData\Local\Temp\google_speech\train\bed\004ae714_nohash_1.wav'
                               ... and 40393 more
                              }
                     Folders: {
                              'C:\Users\skapali\AppData\Local\Temp\google_speech\train'
                              }
                      Labels: [bed; bed; bed ... and 40393 more categorical]
    AlternateFileSystemRoots: {}
              OutputDataType: 'double'
      SupportedOutputFormats: ["wav"    "flac"    "ogg"    "mp4"    "m4a"]
         DefaultOutputFormat: "wav"

Specify the words that you want your model to recognize as commands. Label words that are not commands as unknown. Labeling words that are not commands as unknown creates a group of words that approximates the distribution of all words other than the commands. The network uses this group to learn the difference between commands and all other words.

To reduce the class imbalance between the known and unknown words and speed up processing, include only a fraction of the unknown words in the training set.

To create a datastore that contains only the commands and the subset of unknown words, Use subset (Audio Toolbox). Count the number of examples belonging to each category.

commands = categorical(["yes","no","up","down","left","right","on","off","stop","go"]);

isCommand = ismember(ads.Labels,commands);
isUnknown = ~isCommand;

includeFraction = 0.2;
mask = rand(numel(ads.Labels),1) < includeFraction;
isUnknown = isUnknown & mask;
ads.Labels(isUnknown) = categorical("unknown");

adsTrain = subset(ads,isCommand|isUnknown);
countEachLabel(adsTrain)
ans=11×2 table
     Label     Count
    _______    _____

    down       1842 
    go         1861 
    left       1839 
    no         1853 
    off        1839 
    on         1864 
    right      1852 
    stop       1885 
    unknown    4390 
    up         1843 
    yes        1860 

ads = audioDatastore(fullfile(dataFolder, 'validation'), ...
    'IncludeSubfolders',true, ...
    'FileExtensions','.wav', ...
    'LabelSource','foldernames')
ads = 
  audioDatastore with properties:

                       Files: {
                              ' ...\AppData\Local\Temp\google_speech\validation\bed\026290a7_nohash_0.wav';
                              ' ...\AppData\Local\Temp\google_speech\validation\bed\060cd039_nohash_0.wav';
                              ' ...\AppData\Local\Temp\google_speech\validation\bed\060cd039_nohash_1.wav'
                               ... and 5457 more
                              }
                     Folders: {
                              'C:\Users\skapali\AppData\Local\Temp\google_speech\validation'
                              }
                      Labels: [bed; bed; bed ... and 5457 more categorical]
    AlternateFileSystemRoots: {}
              OutputDataType: 'double'
      SupportedOutputFormats: ["wav"    "flac"    "ogg"    "mp4"    "m4a"]
         DefaultOutputFormat: "wav"

isCommand = ismember(ads.Labels,commands);
isUnknown = ~isCommand;

includeFraction = 0.2;
mask = rand(numel(ads.Labels),1) < includeFraction;
isUnknown = isUnknown & mask;
ads.Labels(isUnknown) = categorical("unknown");

adsValidation = subset(ads,isCommand|isUnknown);
countEachLabel(adsValidation)
ans=11×2 table
     Label     Count
    _______    _____

    down        264 
    go          260 
    left        247 
    no          270 
    off         256 
    on          257 
    right       256 
    stop        246 
    unknown     618 
    up          260 
    yes         261 

To train the network with the entire dataset and achieve the highest possible accuracy, set reduceDataset to false. To run this example quickly, set reduceDataset to true.

reduceDataset = false;
if reduceDataset
    numUniqueLabels = numel(unique(adsTrain.Labels));
    % Reduce the dataset by a factor of 20
    adsTrain = splitEachLabel(adsTrain,round(numel(adsTrain.Files) / numUniqueLabels / 20));
    adsValidation = splitEachLabel(adsValidation,round(numel(adsValidation.Files) / numUniqueLabels / 20));
end

Compute Auditory Spectograms

To prepare the data for efficient training of a convolutional neural network, convert the speech waveforms to auditory-based spectrograms.

Define the parameters of the feature extraction. The segmentDuration variable is the duration of each speech clip (in seconds). The frameDuration variable is the duration of each frame for spectrum calculation. The hopDuration variable is the time step between each spectrum. numBands is the number of filters in the auditory spectrogram.

To perform the feature extraction, create an audioFeatureExtractor (Audio Toolbox) object.

fs = 16e3; % Known sample rate of the data set.

segmentDuration = 1;
frameDuration = 0.025;
hopDuration = 0.010;

segmentSamples = round(segmentDuration*fs);
frameSamples = round(frameDuration*fs);
hopSamples = round(hopDuration*fs);
overlapSamples = frameSamples - hopSamples;

FFTLength = 512;
numBands = 50;

afe = audioFeatureExtractor( ...
    'SampleRate',fs, ...
    'FFTLength',FFTLength, ...
    'Window',hann(frameSamples,'periodic'), ...
    'OverlapLength',overlapSamples, ...
    'barkSpectrum',true);
setExtractorParams(afe,'barkSpectrum','NumBands',numBands,'WindowNormalization',false);

Read a file from the dataset. Training a convolutional neural network requires input to be a consistent size. Some files in the data set are less than 1 second long. Apply zero-padding to the front and back of the audio signal so that it is of length segmentSamples.

x = read(adsTrain);

numSamples = size(x,1);

numToPadFront = floor( (segmentSamples - numSamples)/2 );
numToPadBack = ceil( (segmentSamples - numSamples)/2 );

xPadded = [zeros(numToPadFront,1,'like',x);x;zeros(numToPadBack,1,'like',x)];

To extract audio features, call extract. The output is a Bark spectrum with time across rows.

features = extract(afe,xPadded);
[numHops,numFeatures] = size(features)
numHops = 98
numFeatures = 50

In this example, you post-process the auditory spectrogram by applying a logarithm. Taking a log of small numbers can lead to roundoff error.

To speed up processing, you can distribute the feature extraction across multiple workers by using parfor.

First, determine the number of partitions for the dataset. If you do not have Parallel Computing Toolbox™, use a single partition.

if ~isempty(ver('parallel')) && ~reduceDataset
    pool = gcp;
    numPar = numpartitions(adsTrain,pool);
else
    numPar = 1;
end
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 6).

For each partition, read from the datastore, zero-pad the signal, and then extract the features.

parfor ii = 1:numPar
    subds = partition(adsTrain,numPar,ii);
    XTrain = zeros(numHops,numBands,1,numel(subds.Files));
    for idx = 1:numel(subds.Files)
        x = read(subds);
        xPadded = [zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((segmentSamples-size(x,1))/2),1)];
        XTrain(:,:,:,idx) = extract(afe,xPadded);
    end
    XTrainC{ii} = XTrain;
end

Convert the output to a four-dimensional array that has auditory spectrograms along the fourth dimension.

XTrain = cat(4,XTrainC{:});

[numHops,numBands,numChannels,numSpec] = size(XTrain)
numHops = 98
numBands = 50
numChannels = 1
numSpec = 22928

To obtain data that has a smoother distribution, take the logarithm of the spectrograms by using a small offset.

epsil = 1e-6;
XTrain = log10(XTrain + epsil);

Perform the feature extraction steps described above for the validation set.

if ~isempty(ver('parallel'))
    pool = gcp;
    numPar = numpartitions(adsValidation,pool);
else
    numPar = 1;
end
parfor ii = 1:numPar
    subds = partition(adsValidation,numPar,ii);
    XValidation = zeros(numHops,numBands,1,numel(subds.Files));
    for idx = 1:numel(subds.Files)
        x = read(subds);
        xPadded = [zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((segmentSamples-size(x,1))/2),1)];
        XValidation(:,:,:,idx) = extract(afe,xPadded);
    end
    XValidationC{ii} = XValidation;
end
XValidation = cat(4,XValidationC{:});
XValidation = log10(XValidation + epsil);

Isolate the train and validation labels. Remove empty categories.

YTrain = removecats(adsTrain.Labels);
YValidation = removecats(adsValidation.Labels);

Add Background Noise Data

The network must be able to recognize different spoken words and also to detect if the input contains silence or background noise.

Tto create samples of one-second clips of background noise, use the audio files in the _background_ folder. Create an equal number of background clips from each background noise file. You can also create your own recordings of background noise and add them to the _background_ folder. Before calculating the spectrograms, the function rescales each audio clipby using a factor sampled from a log-uniform distribution in the range provided by volumeRange.

adsBkg = audioDatastore(fullfile(dataFolder, 'background'))
adsBkg = 
  audioDatastore with properties:

                       Files: {
                              ' ...\AppData\Local\Temp\google_speech\background\doing_the_dishes.wav';
                              ' ...\skapali\AppData\Local\Temp\google_speech\background\dude_miaowing.wav';
                              ' ...\skapali\AppData\Local\Temp\google_speech\background\exercise_bike.wav'
                               ... and 3 more
                              }
                     Folders: {
                              'C:\Users\skapali\AppData\Local\Temp\google_speech\background'
                              }
    AlternateFileSystemRoots: {}
              OutputDataType: 'double'
                      Labels: {}
      SupportedOutputFormats: ["wav"    "flac"    "ogg"    "mp4"    "m4a"]
         DefaultOutputFormat: "wav"

numBkgClips = 4000;
if reduceDataset
    numBkgClips = numBkgClips/20;
end
volumeRange = log10([1e-4,1]);

numBkgFiles = numel(adsBkg.Files);
numClipsPerFile = histcounts(1:numBkgClips,linspace(1,numBkgClips,numBkgFiles+1));
Xbkg = zeros(size(XTrain,1),size(XTrain,2),1,numBkgClips,'single');
bkgAll = readall(adsBkg);
ind = 1;

for count = 1:numBkgFiles
    bkg = bkgAll{count};
    idxStart = randi(numel(bkg)-fs,numClipsPerFile(count),1);
    idxEnd = idxStart+fs-1;
    gain = 10.^((volumeRange(2)-volumeRange(1))*rand(numClipsPerFile(count),1) + volumeRange(1));
    for j = 1:numClipsPerFile(count)

        x = bkg(idxStart(j):idxEnd(j))*gain(j);

        x = max(min(x,1),-1);

        Xbkg(:,:,:,ind) = extract(afe,x);

        if mod(ind,1000)==0
            disp("Processed " + string(ind) + " background clips out of " + string(numBkgClips))
        end
        ind = ind + 1;
    end
end
Processed 1000 background clips out of 4000
Processed 2000 background clips out of 4000
Processed 3000 background clips out of 4000
Processed 4000 background clips out of 4000
Xbkg = log10(Xbkg + epsil);

Split the spectrograms of background noise among the training, validation, and test sets. Because the _background_noise_ folder contains only about five and a half minutes of background noise, the background samples in the different data sets are highly correlated. To increase the variation in the background noise, you can create your own background files and add them to the folder. To increase the robustness of the network to noise, you can also try mixing background noise into the speech files.

numTrainBkg = floor(0.85*numBkgClips);
numValidationBkg = floor(0.15*numBkgClips);

XTrain(:,:,:,end+1:end+numTrainBkg) = Xbkg(:,:,:,1:numTrainBkg);
YTrain(end+1:end+numTrainBkg) = "background";

XValidation(:,:,:,end+1:end+numValidationBkg) = Xbkg(:,:,:,numTrainBkg+1:end);
YValidation(end+1:end+numValidationBkg) = "background";

Create Target Object

Create a target object for your target device that has a vendor name and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to program the device.

hT = dlhdl.Target('Xilinx', Interface = 'Ethernet');

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained series network trainedAudioNet as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board. The bitstream uses a single data type.

hW = dlhdl.Workflow(Network = trainedNet, Bitstream = 'zcu102_single', Target = hT);

Compile trainedAudioNet Network

To compile the trainedAudioNet series network, run the compile function of the dlhdl.Workflow object.

compile(hW)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### The network includes the following layers:
     1   'imageinput'    Image Input             98×50×1 images with 'zerocenter' normalization                (SW Layer)
     2   'conv_1'        Convolution             12 3×3×1 convolutions with stride [1  1] and padding 'same'   (HW Layer)
     3   'batchnorm_1'   Batch Normalization     Batch normalization with 12 channels                          (HW Layer)
     4   'relu_1'        ReLU                    ReLU                                                          (HW Layer)
     5   'maxpool_1'     Max Pooling             3×3 max pooling with stride [2  2] and padding 'same'         (HW Layer)
     6   'conv_2'        Convolution             24 3×3×12 convolutions with stride [1  1] and padding 'same'  (HW Layer)
     7   'batchnorm_2'   Batch Normalization     Batch normalization with 24 channels                          (HW Layer)
     8   'relu_2'        ReLU                    ReLU                                                          (HW Layer)
     9   'maxpool_2'     Max Pooling             3×3 max pooling with stride [2  2] and padding 'same'         (HW Layer)
    10   'conv_3'        Convolution             48 3×3×24 convolutions with stride [1  1] and padding 'same'  (HW Layer)
    11   'batchnorm_3'   Batch Normalization     Batch normalization with 48 channels                          (HW Layer)
    12   'relu_3'        ReLU                    ReLU                                                          (HW Layer)
    13   'maxpool_3'     Max Pooling             3×3 max pooling with stride [2  2] and padding 'same'         (HW Layer)
    14   'conv_4'        Convolution             48 3×3×48 convolutions with stride [1  1] and padding 'same'  (HW Layer)
    15   'batchnorm_4'   Batch Normalization     Batch normalization with 48 channels                          (HW Layer)
    16   'relu_4'        ReLU                    ReLU                                                          (HW Layer)
    17   'conv_5'        Convolution             48 3×3×48 convolutions with stride [1  1] and padding 'same'  (HW Layer)
    18   'batchnorm_5'   Batch Normalization     Batch normalization with 48 channels                          (HW Layer)
    19   'relu_5'        ReLU                    ReLU                                                          (HW Layer)
    20   'maxpool_4'     Max Pooling             13×1 max pooling with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    21   'dropout'       Dropout                 20% dropout                                                   (HW Layer)
    22   'fc'            Fully Connected         12 fully connected layer                                      (HW Layer)
    23   'softmax'       Softmax                 softmax                                                       (HW Layer)
    24   'classoutput'   Classification Output   Weighted cross entropy                                        (SW Layer)
                                                                                                             
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into 'imageinput' and 'imageinput_norm'.
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'weightedClassificationLayer' is implemented in software.
### Compiling layer group: conv_1>>relu_5 ...
### Compiling layer group: conv_1>>relu_5 ... complete.
### Compiling layer group: maxpool_4 ...
### Compiling layer group: maxpool_4 ... complete.
### Compiling layer group: fc ...
### Compiling layer group: fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "4.0 MB"        
    "OutputResultOffset"        "0x00400000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00800000"     "4.0 MB"        
    "SystemBufferOffset"        "0x00c00000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02800000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x02c00000"     "4.0 MB"        
    "FCWeightDataOffset"        "0x03000000"     "4.0 MB"        
    "EndOffset"                 "0x03400000"     "Total: 52.0 MB"

### Network compilation complete.
ans = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {{}  [4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 4.4809 0 0 0 … ]}

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file.The function also downloads the network weights and biases. The deploy function verifies the Xilinx Vivado tool and the supported tool version. It then starts programming the FPGA device by using the bitstream, displays progress messages, and the time it takes to deploy the network.

deploy(hW)
### Programming FPGA Bitstream using Ethernet...
Downloading target FPGA device configuration over Ethernet to SD card ...
# Copied /tmp/hdlcoder_rd to /mnt/hdlcoder_rd
# Copying Bitstream hdlcoder_system.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/hdlcoder_system.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'

Downloading target FPGA device configuration over Ethernet to SD card done. The system will now reboot for persistent changes to take effect.


System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 11-Nov-2021 15:15:18
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 11-Nov-2021 15:15:18

Run Prediction on Audio Files

Classify five inputs from the validation data set and compare the prediction results to the classification results from the Deep Learning Toolbox™. YPred is the classification result from the Deep learning Toolbox™. The fpga_prediction variable is the classification result from the FPGA.

numtestFrames = size(XValidation,4);
numView = 5;
listIndex = randperm(numtestFrames,numView);
testDataBatch = XValidation(:,:,:,listIndex);
YPred = classify(trainedNet,testDataBatch);
[scores,speed] = predict(hW,testDataBatch, Profile ='on');
### Finished writing input activations.
### Running in multi-frame mode with 5 inputs.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     353130                  0.00161                       5            1573112            699.3
    imageinput_norm          52668                  0.00024 
    conv_1                   21136                  0.00010 
    maxpool_1                47686                  0.00022 
    conv_2                   37475                  0.00017 
    maxpool_2                45278                  0.00021 
    conv_3                   21260                  0.00010 
    maxpool_3                38857                  0.00018 
    conv_4                   16171                  0.00007 
    conv_5                   27011                  0.00012 
    maxpool_4                27632                  0.00013 
    fc                       17923                  0.00008 
 * The clock frequency of the DL processor is: 220MHz
[~,idx] = max(scores,[],2);
fpga_prediction = trainedNet.Layers(end).Classes(idx);

Compare the prediction results from Deep Learning Toolbox™ and the FPGA side by side. The prediction results from the FPGA match the prediction results from Deep Learning Toolbox™. In this table, the ground truth prediction is the Deep Learning Toolbox™ prediction.

fprintf('%12s %24s\n','Ground Truth','FPGA Prediction');for i= 1:size(fpga_prediction,1)
    fprintf('%s %24s\n',YPred(i),fpga_prediction(i)); end
Ground Truth          FPGA Prediction
no                       no
unknown                  unknown
yes                      yes
no                       no
yes                      yes

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available at http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/licenses/by/4.0/legalcode.

MathWorks, Inc.

See Also

| | | | |

Related Topics