Main Content

Voice Activity Detection in Noise Using Deep Learning

In this example, you perform batch and streaming voice activity detection (VAD) in a low SNR environment using a pretrained deep learning model. For details about the model and how it was trained, see Train Voice Activity Detection in Noise Model Using Deep Learning.

Load and Inspect Data

Read in an audio file that consists of words spoken with pauses between and listen to it. Use resample to resample the signal to the sample rate to 16 kHz. Use detectSpeech on the clean signal to determine the ground-truth speech regions.

fs = 16e3;
[speech,fileFs] = audioread("Counting-16-44p1-mono-15secs.wav");
speech = resample(speech,fs,fileFs);
speech = speech./max(abs(speech));

sound(speech,fs)

detectSpeech(speech,fs,Window=hamming(0.04*fs,"periodic"),MergeDistance=round(0.5*fs))

Load a noise signal and resample to the audio sample rate.

[noise,fileFs] = audioread("WashingMachine-16-8-mono-200secs.mp3");
noise = resample(noise,fs,fileFs);

Use the supporting function mixSNR to corrupt the clean speech signal with washing machine noise at a desired SNR level in dB. Listen to the corrupted audio. The network was trained under -10 dB SNR conditions.

SNR = -10;
noisySpeech = mixSNR(speech,noise,SNR);

sound(noisySpeech,fs)

The algorithm-based VAD, detectSpeech, fails under these noisy conditions.

detectSpeech(noisySpeech,fs,Window=hamming(0.04*fs,"periodic"),MergeDistance=round(0.5*fs))

Download Pretrained Network

Download and load a pretrained network and a configured audioFeatureExtractor object. The network was trained to detect speech in low SNR environments given features output from the audioFeatureExtractor object.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","vadbilsmtnet.zip");
dataFolder = tempdir;
netFolder = fullfile(dataFolder,"vadbilsmtnet");
unzip(downloadFolder,netFolder)
pretrainedNetwork = load(fullfile(netFolder,"voiceActivityDetectionExample.mat"));

afe = pretrainedNetwork.afe;
net = pretrainedNetwork.speechDetectNet;

The audioFeatureExtractor object is configured to extract features from 256-sample windows with 128 samples overlap between windows. At a 16 kHz sample rate, features are extracted from 16 ms windows with 8 ms overlap. From each window, the audioFeatureExtractor object extracts nine features: spectral centroid, spectral crest, spectral entropy, spectral flux, spectral kurtosis, spectral rolloff point, spectral skewness, spectral slope, and harmonic ratio.

afe
afe = 
  audioFeatureExtractor with properties:

   Properties
                     Window: [256×1 double]
              OverlapLength: 128
                 SampleRate: 16000
                  FFTLength: []
    SpectralDescriptorInput: 'linearSpectrum'
        FeatureVectorLength: 9

   Enabled Features
     spectralCentroid, spectralCrest, spectralEntropy, spectralFlux, spectralKurtosis, spectralRolloffPoint
     spectralSkewness, spectralSlope, harmonicRatio

   Disabled Features
     linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
     mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralDecrease, spectralFlatness
     spectralSpread, pitch, zerocrossrate, shortTimeEnergy


   To extract a feature, set the corresponding property to true.
   For example, obj.mfcc = true, adds mfcc to the list of enabled features.

The network consists of two bidirectional LSTM layers, each with 200 hidden units, and a classification output that returns either class 0 corresponding to no voice activity detected or class 1 corresponding to voice activity detected.

net.Layers
ans = 
  5×1 Layer array with layers:

     1   'sequenceinput'   Sequence Input    Sequence input with 9 dimensions
     2   'biLSTM_1'        BiLSTM            BiLSTM with 200 hidden units
     3   'biLSTM_2'        BiLSTM            BiLSTM with 200 hidden units
     4   'fc'              Fully Connected   2 fully connected layer
     5   'softmax'         Softmax           softmax

Perform Voice Activity Detection

Extract features from the speech data and then standardize them. Orient the features so that time is across columns.

features = extract(afe,noisySpeech);
features = (features - mean(features,1))./std(features,[],1);
features = features';

Pass the features through the speech detection network to classify each feature vector as belonging to a frame of speech or not.

scores = predict(net,features.');
decisionsCategorical = scores2label(scores,categorical([0 1]));

Each decision corresponds to an analysis window analyzed by the audioFeatureExtractor. Replicate the decisions so that they are in one-to-one correspondence with the audio samples. Use detectSpeech with no output arguments to plot the ground truth. Use signalMask and plotsigroi to plot the predicted VAD.

decisions = (double(decisionsCategorical) - 1);
decisionsPerSample = [decisions(1:round(numel(afe.Window)/2));repelem(decisions,numel(afe.Window)-afe.OverlapLength,1)];

tiledlayout(2,1)

nexttile
detectSpeech(speech,fs,Window=hamming(0.04*fs,"periodic"),MergeDistance=round(0.5*fs))
title("Ground Truth VAD")
xlabel("")

nexttile
mask = signalMask(decisionsPerSample,SampleRate=fs,Categories="Activity");
plotsigroi(mask,noisySpeech,true)
title("Predicted VAD")

Perform Streaming Voice Activity Detection

The audioFeatureExtractor object is intended for batch processing and does not retain state between calls. Use generateMATLABFunction to create a streaming-friendly feature extractor. You can use the trained VAD network in a streaming context using classifyAndUpdateState (Deep Learning Toolbox).

generateMATLABFunction(afe,"featureExtractor",IsStreaming=true)

To simulate a streaming environment, save the speech and noise signals as WAV files. To simulate streaming input, you will use dsp.AudioFileReader to read frames from the files and mix them at a desired SNR. You can also use audioDeviceReader so that your microphone is the speech source.

audiowrite("Speech.wav",speech,fs)
audiowrite("Noise.wav",noise,fs)

Define parameters for the streaming voice activity detection in noise demonstration:

  • signal - Signal source, specified as either the speech file previously recorded, or your microphone.

  • noise - Noise source, specified as a noise sound file to mix with the signal.

  • SNR - Signal-to-noise ratio to mix the signal and noise, specified in dB.

  • testDuration - Test duration, specified in seconds.

  • playbackSource - Playback source, specified as either the original clean signal, the noisy signal, or the detected speech. An audioDeviceWriter object is used to play the audio to your speakers.

signal = "Speech.wav";
noise = "Noise.wav";
SNR = -10; % dB

testDuration = 20; % seconds
playbackSource = "noisy";

Call the supporting function streamingDemo to observe the performance of the VAD network on streaming audio. The parameters you set using the live controls do not interrupt the streaming example. After the streaming demo is complete, you can modify parameters of the demonstration, then run the streaming demo again.

streamingDemo(net,afe, ...
    signal,noise,SNR, ...
    testDuration,playbackSource);

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license

Supporting Functions

Streaming Demo

function streamingDemo(net,afe,signal,noise,SNR,testDuration,playbackSource)
% streamingDemo(net,afe,signal,noise,SNR,testDuration,playbackSource) runs
% a real-time VAD demo.

% Create dsp.AudioFileReader objects to read speech and noise files frame
% by frame. If the speech signal is specified as Microphone, use an
% audioDeviceReader as the source.
if strcmpi(signal,"Microphone")
    speechReader = audioDeviceReader(afe.SampleRate);
else
    speechReader = dsp.AudioFileReader(signal,PlayCount=inf);
end
noiseReader = dsp.AudioFileReader(noise,PlayCount=inf,SamplesPerFrame=speechReader.SamplesPerFrame);
fs = speechReader.SampleRate;

% Create a dsp.MovingStandardDeviation object and a dsp.MovingAverage
% object. You will use these to determine the standard deviation and mean
% of the audio features for standardization. The statistics should improve
% over time.
movSTD = dsp.MovingStandardDeviation(Method="Exponential weighting",ForgettingFactor=1);
movMean = dsp.MovingAverage(Method="Exponential weighting",ForgettingFactor=1);

% Create a dsp.MovingMaximum object. You will use it to standardize the
% audio.
movMax = dsp.MovingMaximum(SpecifyWindowLength=false);

% Create a dsp.MovingRMS object. You will use this to determine the signal
% and noise mix at the desired SNR. This object is only useful for example
% purposes where you are artificially adding noise.
movRMS = dsp.MovingRMS(Method="Exponential weighting",ForgettingFactor=1);

% Create three dsp.AsyncBuffer objects. One to buffer the input audio, one
% to buffer the extracted features, and one to buffer the output audio so
% that VAD decisions correspond to the audio signal. The output buffer is
% only necessary for visualizing the decisions in real time.
audioInBuffer = dsp.AsyncBuffer(2*speechReader.SamplesPerFrame);
featureBuffer = dsp.AsyncBuffer(ceil(2*speechReader.SamplesPerFrame/(numel(afe.Window)-afe.OverlapLength)));
audioOutBuffer = dsp.AsyncBuffer(2*speechReader.SamplesPerFrame);

% Create a time scope to visualize the original speech signal, the noisy
% signal that the network is applied to, and the decision output from the
% network.
scope = timescope(SampleRate=fs, ...
    TimeSpanSource="property", ...
    TimeSpan=3, ...
    BufferLength=fs*3*3, ...
    TimeSpanOverrunAction="Scroll", ...
    AxesScaling="updates", ...
    MaximizeAxes="on", ...
    AxesScalingNumUpdates=20, ...
    NumInputPorts=3, ...
    LayoutDimensions=[3,1], ...
    ChannelNames=["Noisy Speech","Clean Speech (Original)","Detected Speech"], ...
    ...
    ActiveDisplay = 1, ...
    ShowGrid=true, ...
    ...
    ActiveDisplay = 2, ...
    ShowGrid=true, ...
    ...
    ActiveDisplay=3, ...
    ShowGrid=true); 
setup(scope,{1,1,1})

% Create an audioDeviceWriter object to play either the original or noisy
% audio from your speakers.
deviceWriter = audioDeviceWriter(SampleRate=fs);

% Initialize variables used in the loop.
windowLength = numel(afe.Window);
hopLength = windowLength - afe.OverlapLength;

% Run the streaming demonstration.
loopTimer = tic;
while toc(loopTimer) < testDuration

    % Read a frame of the speech signal and a frame of the noise signal
    speechIn = speechReader();
    noiseIn = noiseReader();

    % Mix the speech and noise at the specified SNR
    energy = movRMS([speechIn,noiseIn]);
    noiseGain = 10^(-SNR/20) * energy(end,1) / energy(end,2);
    noisyAudio = speechIn + noiseGain*noiseIn;

    % Update a running max to scale the audio
    myMax = movMax(abs(noisyAudio));
    noisyAudio = noisyAudio/myMax(end);

    % Write the noisy audio and speech to buffers
    write(audioInBuffer,[noisyAudio,speechIn]);

    % If enough samples are in the audio buffer to calculate a feature
    % vector, read the samples, normalize them, extract the feature
    % vectors, and write the latest feature vector to the features buffer.
    while (audioInBuffer.NumUnreadSamples >= hopLength)
        x = read(audioInBuffer,numel(afe.Window),afe.OverlapLength);
        write(audioOutBuffer,x(end-hopLength+1:end,:));
        noisyAudio = x(:,1);
        features = featureExtractor(noisyAudio);
        write(featureBuffer,features');
    end

    if featureBuffer.NumUnreadSamples >= 1
        % Read the audio data corresponding to the number of unread
        % feature vectors.
        audioHop = read(audioOutBuffer,featureBuffer.NumUnreadSamples*hopLength);

        % Read all unread feature vectors.
        features = read(featureBuffer);

        % Use only the new features to update the standard deviation and
        % mean. Normalize the features.
        rmean = movMean(features);
        rstd = movSTD(features);
        features = (features - rmean(end,:)) ./ rstd(end,:);

        % Network inference
        [score,state] = predict(net,features); 
        net.State = state; 
        [~,decision] = max(score,[],2);
        decision = decision-1;

        % Convert the decisions per feature vector to decisions per sample
        decision = repelem(decision,hopLength,1);

        % Apply a mask to the noisy speech for visualization
        vadResult = audioHop(:,1);
        vadResult(decision==0) = 0;

        % Listen to the speech or speech+noise
        switch playbackSource
            case "clean"
                deviceWriter(audioHop(:,2));
            case "noisy"
                deviceWriter(audioHop(:,1));
            case "detectedSpeech"
                deviceWriter(vadResult);
        end

        % Visualize the speech+noise, the original speech, and the voice
        % activity detection.
        scope(audioHop(:,1),audioHop(:,2),vadResult)

    end
end
end

Mix SNR

function [noisySignal,requestedNoise] = mixSNR(signal,noise,ratio)
% [noisySignal,requestedNoise] = mixSNR(signal,noise,ratio) returns a noisy
% version of the signal, noisySignal. The noisy signal has been mixed with
% noise at the specified ratio in dB.

numSamples = size(signal,1);

% Convert noise to mono
noise = mean(noise,2);

% Trim or expand noise to match signal size
if size(noise,1)>=numSamples
    % Choose a random starting index such that you still have numSamples
    % after indexing the noise.
    start = randi(size(noise,1) - numSamples + 1);
    noise = noise(start:start+numSamples-1);
else
    numReps = ceil(numSamples/size(noise,1));
    temp = repmat(noise,numReps,1);
    start = randi(size(temp,1) - numSamples - 1);
    noise = temp(start:start+numSamples-1);
end

signalNorm = norm(signal);
noiseNorm = norm(noise);

goalNoiseNorm = signalNorm/(10^(ratio/20));
factor = goalNoiseNorm/noiseNorm;

requestedNoise = noise.*factor;
noisySignal = signal + requestedNoise;

noisySignal = noisySignal./max(abs(noisySignal));
end