detectspeechnn

Detect boundaries of speech in audio signal using AI

Since R2023a

collapse all in page

Syntax

roi = detectspeechnn(audioIn,fs)

roi = detectspeechnn(audioIn,fs,Name=Value)

[roi,probs] = detectspeechnn(___)

detectspeechnn(___)

Description

roi = detectspeechnn(audioIn,fs) returns indices corresponding to the beginning and end of speech within the audio signal.

example

roi = detectspeechnn(audioIn,fs,Name=Value) specifies options using one or more name-value arguments. For example, detectspeechnn(audioIn,fs,MergeThreshold=0.5) merges speech regions that are separated by 0.5 seconds or less.

example

[roi,probs] = detectspeechnn(___) also returns the probability of voice activity per sample in the input audio signal.

example

detectspeechnn(___) with no output arguments plots the input signal and the detected speech regions.

This function requires both Audio Toolbox™ and Deep Learning Toolbox™.

example

Examples

collapse all

Detect Speech in Audio Signal

This example uses:

Open Live Script

Read in an audio signal containing speech and music and listen to the sound.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
sound(audioIn,fs)

Call detectspeechnn on the signal to obtain the regions of interest (ROIs), in samples, containing speech.

roi = detectspeechnn(audioIn,fs)

roi = 2×2

           1       63120
       83600      150000

Convert the ROIs from samples to seconds.

roiSeconds = (roi-1)/fs

roiSeconds = 2×2

         0    3.9449
    5.2249    9.3749

Plot the audio waveform with the speech regions.

detectspeechnn(audioIn,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

Refine Speech Regions with Energy-Based VAD

This example uses:

Open Live Script

Read in an audio signal containing a speaker repeating the phrase "volume up".

[audioIn,fs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");

Compare detected speech regions by calling detectspeechnn with and without the application of an energy-based voice activity detector (VAD) in postprocessing.

tiledlayout(2,1)
nexttile()
detectspeechnn(audioIn,fs)
nexttile()
detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)

Figure contains 2 axes objects. Axes object 1 with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 4 objects of type line, constantline, patch. Axes object 2 with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 16 objects of type line, constantline, patch.

Adjust Postprocessing Parameters for Detecting Speech

This example uses:

Open Live Script

Read in an audio signal.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");

Call detectspeechnn with no output arguments to display a plot of the detected speech regions.

detectspeechnn(audioIn,fs);

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

Modify the parameters used in the postprocessing algorithm and see how they affect the detected speech regions. For more information about the VAD postprocessing algorithm, see Postprocessing.

mergeThreshold = 1.3 ; % seconds

lengthThreshold = 0.25; % seconds

activationThreshold = 0.5; % probability

deactivationThreshold = 0.25 ; % probability

applyEnergyVAD = false ;

detectspeechnn(audioIn,fs,MergeThreshold=mergeThreshold, ...
    LengthThreshold=lengthThreshold, ...
    ActivationThreshold=activationThreshold, ...
    DeactivationThreshold=deactivationThreshold)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 5 objects of type line, constantline, patch.

Get Probability of Voice Activity per Sample of Audio

Open Live Script

Read in an audio signal containing speech and music.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");

Call detectspeechnn with an additional output variable to get the probabilities of speech in each sample of the signal.

[roi,probs] = detectspeechnn(audioIn,fs);

Plot the audio signal along with the voice activity probability.

t = (0:length(audioIn)-1)/fs;
plot(t,audioIn,t,probs,"r")
legend("Audio signal","Probability of speech",Location="best")
xlabel("Time (s)")
title("Voice Activity Probability")

Figure contains an axes object. The axes object with title Voice Activity Probability, xlabel Time (s) contains 2 objects of type line. These objects represent Audio signal, Probability of speech.

Detect Speech in Streaming Audio

This example uses:

Open Live Script

Use detectspeechnn to detect the presence of speech in a streaming audio signal.

Create a dsp.AudioFileReader object to stream an audio file for processing. Set the SamplesPerFrame property to read 100 ms nonoverlapping chunks from the signal.

afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg");
analysisDuration = 0.1; % seconds
afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);

The neural network architecture of detectspechnn does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use detectspeechnn in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.

Create a timescope object to plot the audio signal and the detected speech regions. Create an audioDeviceWriter to play the audio as you stream it.

scope = timescope(NumInputPorts=2, ...
    SampleRate=afr.SampleRate, ...
    TimeSpanSource="property",TimeSpan=5, ...
    YLimits=[-1.2,1.2], ...
    ShowLegend=true,ChannelNames=["Audio","Detected Speech"]);
adw = audioDeviceWriter(afr.SampleRate);

In a streaming loop:

Read in a 100 ms chunk from the audio file.
Use detectspeechnn to detect any regions of speech in the frame. Use sigroi2binmask to convert the region indices to a binary mask.
Plot the audio signal and the detected speech.
Play the audio with the device writer.

while ~isDone(afr)
    audioIn = afr();
    segments = detectspeechnn(audioIn,afr.SampleRate,LengthThreshold=0.01);
    mask = sigroi2binmask(segments,afr.SamplesPerFrame);
    scope(audioIn,mask)
    adw(audioIn);
end

Input Arguments

collapse all

`audioIn` — Audio input
column vector

Audio input signal, specified as a column vector (single channel).

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types: single | double

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)

`MergeThreshold` — Merge threshold
`0.25` (default) | nonnegative scalar

Merge threshold in seconds, specified as a nonnegative scalar. The function merges speech regions that are separated by a duration less than or equal to the specified threshold. Set the threshold to Inf to not merge any detected regions.

`LengthThreshold` — Length threshold
`0.25` (default) | nonnegative scalar

Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.

`ActivationThreshold` — Probability threshold to start a speech segment
`0.5` (default) | scalar in the range [0, 1]

Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].

Data Types: single | double

`DeactivationThreshold` — Probability threshold to end a speech segment
`0.25` (default) | scalar in the range [0, 1]

Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].

Data Types: single | double

`ApplyEnergyVAD` — Apply energy-based voice activity detector
`false` (default) | `true`

Apply energy-based voice activity detector (VAD) to the speech regions detected by the neural network, specified as true or false.

Data Types: logical

Output Arguments

collapse all

`roi` — Speech regions
N-by-2 matrix

Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.

`probs` — Probability of speech per sample
column vector

Probability of speech per sample of the input audio signal, returned as a column vector with the same size as the input signal.

Algorithms

collapse all

Preprocessing

The detectspeechnn function preprocesses the audio data using the following steps.

Resample the audio to 16kHz.
Compute a centered short-time Fourier transform (STFT) using a 25 ms periodic Hamming window and 10 ms hop length. Pad the signal so that the first window is centered at 0 s.
Convert the STFT to a power spectrogram.
Apply a mel filter bank with 40 bands to obtain a mel spectrogram.
Convert the mel spectrogram to a log scale.
Standardize each of the mel bands to have zero mean and standard deviation of 1.

Neural Network Inference

The preprocessed data is passed to a pretrained VAD neural network. The network outputs represent the probability of speech in each frame of audio in the input spectrogram.

The neural network is a ported version of the vad-crdnn-libriparty pretrained model provided by SpeechBrain [1], which combines convolutional, recurrent, and fully connected layers.

Postprocessing

The detectspeechnn function postprocesses the VAD network output using the following steps.

Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.
Optionally, apply energy-based VAD to refine the detected speech regions.
Merge speech regions that are close to each other according to the merge threshold.
Remove speech regions that are shorter than or equal to the length threshold.

References

[1] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624

Extended Capabilities

expand all

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

Variable-size input is not supported.
The sample rate fs must be constant.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2023a

expand all

R2024b: Output probabilities of voice activity

Use an additional output argument to get the per-sample probabilities of voice activity in a signal.

detectspeechnn

Syntax

Description

Examples

Detect Speech in Audio Signal

Refine Speech Regions with Energy-Based VAD

Adjust Postprocessing Parameters for Detecting Speech

Get Probability of Voice Activity per Sample of Audio

Detect Speech in Streaming Audio

Input Arguments

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

Name-Value Arguments

`MergeThreshold` — Merge threshold
`0.25` (default) | nonnegative scalar

`LengthThreshold` — Length threshold
`0.25` (default) | nonnegative scalar

`ActivationThreshold` — Probability threshold to start a speech segment
`0.5` (default) | scalar in the range [0, 1]

`DeactivationThreshold` — Probability threshold to end a speech segment
`0.25` (default) | scalar in the range [0, 1]

`ApplyEnergyVAD` — Apply energy-based voice activity detector
`false` (default) | `true`

Output Arguments

`roi` — Speech regions
N-by-2 matrix

`probs` — Probability of speech per sample
column vector

Algorithms

Preprocessing

Neural Network Inference

Postprocessing

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024b: Output probabilities of voice activity

See Also

Functions

Objects

Blocks

Topics

detectspeechnn

Syntax

Description

Examples

Detect Speech in Audio Signal

Refine Speech Regions with Energy-Based VAD

Adjust Postprocessing Parameters for Detecting Speech

Get Probability of Voice Activity per Sample of Audio

Detect Speech in Streaming Audio

Input Arguments

audioIn — Audio input column vector

fs — Sample rate (Hz) positive scalar

Name-Value Arguments

MergeThreshold — Merge threshold 0.25 (default) | nonnegative scalar

LengthThreshold — Length threshold 0.25 (default) | nonnegative scalar

ActivationThreshold — Probability threshold to start a speech segment 0.5 (default) | scalar in the range [0, 1]

DeactivationThreshold — Probability threshold to end a speech segment 0.25 (default) | scalar in the range [0, 1]

ApplyEnergyVAD — Apply energy-based voice activity detector false (default) | true

Output Arguments

roi — Speech regions N-by-2 matrix

probs — Probability of speech per sample column vector

Algorithms

Preprocessing

Neural Network Inference

Postprocessing

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024b: Output probabilities of voice activity

See Also

Functions

Objects

Blocks

Topics

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`MergeThreshold` — Merge threshold
`0.25` (default) | nonnegative scalar

`LengthThreshold` — Length threshold
`0.25` (default) | nonnegative scalar

`ActivationThreshold` — Probability threshold to start a speech segment
`0.5` (default) | scalar in the range [0, 1]

`DeactivationThreshold` — Probability threshold to end a speech segment
`0.25` (default) | scalar in the range [0, 1]

`ApplyEnergyVAD` — Apply energy-based voice activity detector
`false` (default) | `true`

`roi` — Speech regions
N-by-2 matrix

`probs` — Probability of speech per sample
column vector

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.