Audio Event Classification Using TensorFlow Lite on Raspberry Pi
This example demonstrates audio event classification using a pretrained deep neural network, YAMNet, from TensorFlow™ Lite library on Raspberry Pi™. You load the TensorFlow Lite model and predict the class for the given audio frame on Raspberry Pi using a processor-in-the-loop (PIL) workflow. To generate code on Raspberry Pi, you use Embedded Coder®, MATLAB® Support Package for Raspberry Pi Hardware and Deep Learning Toolbox Interface for TensorFlow Lite. Refer to Audio Classification and yamnet classification for more details on the YAMNet model description.
Third-Party Prerequisites
Raspberry Pi hardware
TensorFlow Lite library (on the target ARM® hardware)
Pretrained TensorFlow Lite Model
Download YAMNet
Download and unzip the yamnet
(Audio Toolbox).
component = "audio"; filename = "yamnet.zip"; localfile = matlab.internal.examples.downloadSupportFile(component,filename); downloadFolder = fileparts(localfile); if exist(fullfile(downloadFolder,"yamnet"),"dir") ~= 7 unzip(localfile,downloadFolder) end addpath(fullfile(downloadFolder,"yamnet"))
Read Audio Data and Classify the Sounds
Use audioread
to read the audio file data and listen to it using sound
function.
[audioIn, fs] = audioread("multipleSounds-16-16-mono-18secs.wav");
sound(audioIn,fs)
Call classifySound
(Audio Toolbox) to detect the different sounds present in the given audio.
detectedSounds = classifySound(audioIn,fs)
detectedSounds = 1×5 string
"Stream" "Machine gun" "Snoring" "Bark" "Meow"
You detected the different sounds in the pre-recorded audio in offline mode. The later sections of this example demonstrates the audio event classification in the real-time scenario where you process one audio frame at a time.
Load TensorFlow Lite Model and Audio Event Classes
You load the TFLite YAMNet using loadTFLiteModel
.
As mentioned in TFLiteModel
page, you set the Mean
and Variance
parameter of the TFLite model to 0
and 1
, respectively, because the input to YAMNet is not already normalized.
modelFileName = "lite-model_yamnet_classification_tflite_1.tflite"; modelFullPath = fullfile(downloadFolder,"yamnet",modelFileName); TFLiteYAMNet = loadTFLiteModel(modelFullPath); TFLiteYAMNet.Mean = 0; TFLiteYAMNet.StandardDeviation = 1;
Use yamnetGraph
(Audio Toolbox) to load all the audio event classes supported by YAMNet, as an array of strings.
[~, audioEventClasses] = yamnetGraph;
Set the sample rate (in Hertz), the length of input audio frame and the frame duration in seconds, supported by YAMNet.
modelSamplingRate = 16000; frameDimension = TFLiteYAMNet.InputSize{1}; frameLength = frameDimension(2); frameDuration = frameLength/modelSamplingRate;
Set the classificationRate
i.e. the number of classifications per second. As the number of hops per second must be equal to the classification rate, set the hopDuration
to the reciprocal of classificationRate
.
classificationRate = 10; hopDuration = 1/classificationRate; hopLength = floor(modelSamplingRate*hopDuration); overlapLength = frameLength - hopLength;
Read Input Audio
You use dropdown
control to list the different input audio files. Use dsp.AudioFileReader
(DSP System Toolbox) to read the audio file data.
afr = dsp.AudioFileReader("multipleSounds-16-16-mono-18secs.wav");
audioInSamplingRate = afr.SampleRate;
audioFileInfo = audioinfo(afr.Filename);
Set the SamplesPerFrame
corresponding to one hop.
audioInFrameLength = floor(audioInSamplingRate*hopDuration); afr.SamplesPerFrame = audioInFrameLength;
Setup the FIFO Buffers
Create two dsp.AsyncBuffer
(DSP System Toolbox) objects audioBufferYamnet
and audioClassBuffer
to buffer the resampled audio samples and the indices of predicted audio classes. You set the length of the audioClassBuffer
corresponding to predictedAudiolassesDuration
seconds. You initialize the audioClassBuffer
with the index corresponding to the Silence
audio class.
predictedAudiolassesDuration = 1;
audioClassBufferLength = floor(predictedAudiolassesDuration*classificationRate);
audioClassBuffer = dsp.AsyncBuffer(audioClassBufferLength);
audioBufferYamnet = dsp.AsyncBuffer(2*frameLength);
indexOfSilenceAudioClass = find(audioEventClasses == "Silence");
write(audioClassBuffer,ones(audioClassBufferLength,1)*indexOfSilenceAudioClass);
Create a timescope
(DSP System Toolbox) object to visualize the audio.
timeScope = timescope("SampleRate", modelSamplingRate, ... "YLimits",[-1 1], ... "Name","Audio Event Classification Using TensorFlow Lite YAMNet", ... "TimeSpanSource","Property", ... "TimeSpan",audioFileInfo.Duration);
Run TFLite YAMNet in MATLAB to Perform Audio Event Classification
Setup a dsp.SampleRateConverter
(DSP System Toolbox) system object to convert the sampling rate of the input audio to 16000 Hz, as YAMNet is trained using audio signals sampled at 16000 Hz sampling rate.
src = dsp.SampleRateConverter('InputSampleRate',audioInSamplingRate,... 'OutputSampleRate',modelSamplingRate,... 'Bandwidth',10000);
You feed one audio frame at a time to represent the system as it would be deployed in a real-time embedded system. In the streaming loop, you first load one hop of audio samples and fed them to the dsp.SampleRateConverter
(DSP System Toolbox) to convert the sampling rate to 16000 Hz. The resampled frame is written in a FIFO buffer, audioBufferYamnet
, you load the overlapping frames of length frameLength
from this buffer and fed it to the YAMNet. The TensorFlow Lite YAMNet model outputs the predicted score vector that contains a score for each audio event class. You calculate the index of the maximum score in the score vector and write it in the FIFO buffer, audioClassBuffer
. The predicted index is the statistical mode
of the contents of the audioClassBuffer
. The predicted audio event class is the value of audioEventClasses
array at the predicted index. You visualize the resampled audio frame in the time scope and print the predicted audio event class as the title of the time scope.
while ~isDone(afr) audioInFrame = afr(); resampledAudioInFrame = src(audioInFrame); write(audioBufferYamnet,resampledAudioInFrame); audioInYamnetFrame = read(audioBufferYamnet,frameLength,overlapLength); scoresTFLite = TFLiteYAMNet.predict(audioInYamnetFrame'); [~, audioClassIndex] = max(scoresTFLite); write(audioClassBuffer,audioClassIndex); preditedSoundClass = audioEventClasses(mode(audioClassBuffer.peek(audioClassBufferLength))); timeScope(resampledAudioInFrame); timeScope.Title = char(preditedSoundClass); drawnow end hide(timeScope) reset(timeScope) reset(afr)
Prepare MATLAB Code for Deployment
You prepare a MATLAB function predictAudioClassUsingYAMNET
that performs audio class prediction for the input audio frames. It buffers the indices of the predicted audio class in a FIFO buffer. The predicted audio class index is the statistical mode
of the contents of this FIFO buffer.
type predictAudioClassUsingYAMNET.m
function preditedAudioClassIndex = predictAudioClassUsingYAMNET(audioIn, audioClassHistoryBufferLength,indexSilenceAudioClass) % predictAudioClassUsingYAMNET Predicts the audio class of input audio by % using a pre-trained TensorFlow Lite YAMNET model. % % Input Arguments: % audioIn - Audio frame of length 1x15600 with % sampling rate of 16000 samples per % second % audioClassHistoryBufferLength - Length of the audio class FIFO buffer % to contain predicted audio class % indices. The index of the predicted % audio class is the statistical mode % of the contents of this buffer. % % Output Arguments: % preditedAudioClassIndex - Index of the predicted audio class. % % % Copyright 2022 The MathWorks, Inc. %#codegen persistent TFLiteYAMNETModel AudioClassBuffer if isempty(TFLiteYAMNETModel) TFLiteYAMNETModel = loadTFLiteModel("lite-model_yamnet_classification_tflite_1.tflite"); TFLiteYAMNETModel.NumThreads = 4; TFLiteYAMNETModel.Mean = 0; TFLiteYAMNETModel.StandardDeviation = 1; % Create and initialize a FIFO buffer with index of the 'Silence' AudioClassBuffer = dsp.AsyncBuffer(audioClassHistoryBufferLength); write(AudioClassBuffer,ones(audioClassHistoryBufferLength,1)*indexSilenceAudioClass); end scores = predict(TFLiteYAMNETModel,audioIn); [~, audioClassIndex] = max(scores); write(AudioClassBuffer,audioClassIndex); predictedAudioClassHistory = peek(AudioClassBuffer,audioClassHistoryBufferLength); preditedAudioClassIndex = mode(predictedAudioClassHistory); end
Generate Code for Audio Event Classifier on Raspberry Pi
Create Code Generation Configuration
cfg = coder.config("lib", "ecoder", true); cfg.TargetLang = 'C++'; cfg.VerificationMode = "PIL";
Set Up Connection with Raspberry Pi
Use the Raspberry Pi Support Package function, raspi
, to create a connection to your Raspberry Pi. In the following code, replace:
raspiname
with the name of your Raspberry Pipi
with your user namepassword
with your password
if ~(exist("r","var")) r = raspi("raspiname","pi","password"); end
Configure Code Generation Hardware Parameters for Raspberry Pi
Create a coder.hardware
(MATLAB Coder) object for Raspberry Pi and attach it to the code generation configuration object.
hw = coder.hardware("Raspberry Pi");
cfg.Hardware = hw;
Specify the build folder on Raspberry Pi.
buildDir = "~/remoteBuildDir";
cfg.Hardware.BuildDir = buildDir;
Copy TensorFlow Lite Model to the Target Hardware and the Current Directory
Copy the TensorFlow Lite model to the Raspberry Pi board. On the hardware board, set the environment variable TFLITE_MODEL_PATH to the location of the TensorFlow Lite model. For more information on setting environment variables, see Prerequisites for Deep Learning with TensorFlow Lite Models.
Use putFile
method of the raspi
object to copy the TFLite model to Raspberry Pi.
putFile(r,char(modelFullPath),'/home/pi')
Copy the model to the current directory as it is required by codegen
(MATLAB Coder) during code generation.
copyfile(modelFullPath)
Generate PIL MEX
You use coder.Constant
(MATLAB Coder) to make the constant input arguments, compile time constants in the generated code. Run the codegen
(MATLAB Coder) command to generate a PIL MEX function predictAudioClassUsingYAMNET_pil
.
codegen -config cfg predictAudioClassUsingYAMNET -args {ones(1,15600,"single"), coder.Constant(audioClassBufferLength), coder.Constant(indexOfSilenceAudioClass)} -silent
### Connectivity configuration for function 'predictAudioClassUsingYAMNET': 'Raspberry Pi'
Predict Audio Class on Raspberry Pi Using PIL Workflow
You call the generated PIL function predictAudioClassUsingYAMNET_pil to
stream one audio frame at a time to represent the system as it would be deployed in a real-time embedded system.
show(timeScope) while ~isDone(afr) audioInFrame = afr(); resampledAudioInFrame = src(audioInFrame); write(audioBufferYamnet,resampledAudioInFrame); audioInYamnetFrame = read(audioBufferYamnet,frameLength,overlapLength); predictedSoundClassIndex = predictAudioClassUsingYAMNET_pil(single(audioInYamnetFrame'),audioClassBufferLength, indexOfSilenceAudioClass); preditedSoundClass = audioEventClasses(predictedSoundClassIndex); timeScope(resampledAudioInFrame) timeScope.Title = char(preditedSoundClass); drawnow end
### Starting application: 'codegen\lib\predictAudioClassUsingYAMNET\pil\predictAudioClassUsingYAMNET.elf' To terminate execution: clear predictAudioClassUsingYAMNET_pil ### Launching application predictAudioClassUsingYAMNET.elf...
hide(timeScope)
Terminate the PIL execution
clear predictAudioClassUsingYAMNET_pil
### Host application produced the following standard output (stdout) and standard error (stderr) messages:
Evaluate Raspberry Pi Execution Time
You use PIL workflow to profile the predictAudioClassUsingYAMNET
function. You enable profiling in the code generation configuration and generate the PIL function that keeps a log of execution profile.
cfg.CodeExecutionProfiling = true; codegen -config cfg predictAudioClassUsingYAMNET -args {ones(1,15600,"single"), coder.Constant(audioClassBufferLength), coder.Constant(indexOfSilenceAudioClass)} -silent
### Connectivity configuration for function 'predictAudioClassUsingYAMNET': 'Raspberry Pi'
You call the generated PIL function multiple times to get the average execution time.
numCalls = 100; for k = 1:numCalls x = pinknoise(1,15600,"single"); scores = predictAudioClassUsingYAMNET_pil(x,audioClassBufferLength,indexOfSilenceAudioClass); end
### Starting application: 'codegen\lib\predictAudioClassUsingYAMNET\pil\predictAudioClassUsingYAMNET.elf' To terminate execution: clear predictAudioClassUsingYAMNET_pil ### Launching application predictAudioClassUsingYAMNET.elf... Execution profiling data is available for viewing. Open Simulation Data Inspector. Execution profiling report available after termination.
Terminate the PIL execution.
clear predictAudioClassUsingYAMNET_pil
### Host application produced the following standard output (stdout) and standard error (stderr) messages: Execution profiling report: coder.profile.show(getCoderExecutionProfile('predictAudioClassUsingYAMNET'))
Generate an execution profile report to evaluate execution time.
executionProfile = getCoderExecutionProfile('predictAudioClassUsingYAMNET'); report(executionProfile, ... 'Units','Seconds', ... 'ScaleFactor','1e-03', ... 'NumericFormat','%0.4f');
In the code execution profiling report, you find that the average execution time taken by predictAudioClassUsingYAMNET
is 24.29 ms
which is within the budget of 100 ms
. You calculate the budget as the reciprocal of the classification rate. The performance is measured on Raspberry Pi 3 Model B Plus Rev 1.2.
Release buffers, timescope and other system objects used in the example.
release(audioBufferYamnet) release(audioClassBuffer) release(timeScope) release(src) release(afr)
See Also
yamnet
(Audio Toolbox) | classifySound
(Audio Toolbox)