Make sure you have the latest version of the packages:
install.packages("bioacoustics") # The bioacoustics package may also be installed from GitHub using devtools as follows: install.packages("devtools") devtools::install_github("wavx/bioacoustics") # For the latest, unstable version install.packages("warbleR") install.packages("randomForest")
# Load the packages library(warbleR) library(bioacoustics) library(tools) library(randomForest)
We can use the
quer_xc() function from warbleR to download bird vocalizations from Xeno-Canto: https://www.xeno-canto.org/
We are going to choose calls from Catharus bicknelli, songs from Passerella iliaca and Setophaga magnolia recorded in the United States and Canada.
We will filter only “A” quality recordings, then, pick up only the first nine, and merge all the metadata into a single “df” data frame. This data frame will be used to download MP3 files in your working directory directly from Xeno-Canto with the
df1 = quer_xc(qword ='Catharus bicknelli type:call cnt:"United States"', download = FALSE) df1 = df1[df1$Vocalization_type=="call",] df1 = df1[df1$Quality=="A",] df1 = df1[1:9,]
df2 = quer_xc(qword ='Setophaga magnolia type:song cnt:"Canada"', download = FALSE) df2 = df2[df2$Quality=="A",] df2 = df2[1:9,]
df3 = quer_xc(qword ='Passerella iliaca type:song cnt:"Canada"', download = FALSE) df3 = df3[df3$Vocalization_type=="song",] df3 = df3[df3$Quality %in% c("A", "B"),] df3 = df3[1:9,] df = rbind(df1,df2,df3) rm(df1,df2,df3)
# Visualize your data frame View(df) # We will work in the R temp directory wd <- tempdir() # Create a data directory if it does not exist data_dir <- file.path(wd, "data") if(!dir.exists(data_dir)) dir.create(data_dir) # Download the MP3 files into your data directory quer_xc(X = df, download = TRUE, path = data_dir)
Now that we have recordings stored in the data directory, we can read one of them to look at its structure and content. Let's use the
read_audio() function in bioacoustics to manually read a recording of Catharus bicknelli:
CATBIC <- read_audio(file.path(data_dir, "Catharus-bicknelli-54864.mp3")) CATBIC
We can see that the MP3 file has been converted into a Wave object with 7551360 samples, a duration of 157.32 seconds, a sampling rate of 48000 Hz, a bit depth of 16 bits, and that it contains one channel (mono, stereo being two channels).
Remember that you just have to divide the number of samples by the sampling rate to retrieve the duration (s) of an audio file.
GUANO stands for the “Grand Unified Acoustic Notation Ontology” and means to be a universal, extensible, open metadata format for bat (ultrasonic) and non-bat (audible, infrasonic) acoustic recordings more here. GUANO format is now embeded directly in the WAV files generated from most Pettersson, Wildlife Acoustics, Titley Scientific acoustic recorders. It is possible to extract the metadata and GUANO embedded in the WAV file by using the
Now that we have explored a WAV file, we will use the Fast Fourrier Transform (FFT) to compute a frequency-time representation of the recording, called a spectrogram. A spectrogram being the representation of the spectrum of frequencies in a recording as they vary through time. This representation, although not optimal, is still commonly used to detect animal vocalizations and extract acoustic features useful for classification with the purpose of animal identification.
There are several options to display animal vocalizations in audio files with R. You can use both
fspec() functions to generate spectrograms with bioacoustics.
fspec() generates only a matrix of the spectrogram, and thus has to be used with the
image() function to display the spectrogram. It is also possible to use the
spectro() function in Seewave.
Next, we will search manually and display an audio event (here, a bird vocalization) from a recording of Catharus bicknelli. To display a Region Of Interest (ROI) of the recording we will use temporal and frequency filters. Let's start with a temporal slice from 1 to 10 secs and a FFT size of 512 samples.
# Set plot margins to 0 par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0)) # Display with spectro() ticks <- c(from = 1000, to = 20000, by = 1000) # frequency tick marks from 1 to # 20 kHz, and steps at each 1 kHz temp_slice <- c(from = 1, to = 10) # in seconds spectro(CATBIC, tlim = temp_slice, FFT_size = 512, ticks_y = ticks)
Let's display spectrograms with various time / frequency limits (with
flim= arguments). You can also play with other arguments in
fspec() functions such as the percent of overlap between two FFTs (with
FFT_overlap=) and various FFT resolutions (with
FFT_size=). Note that the arguments are briefly explained in the documentation of each function:
# Access the arguments of the spectro function ?spectro ?fspec
First, let's shorten the temporal axis from 2 to 3.5 secs to work on a shorter time window and compare the outputs from
fspec() functions. Note that spectrograms can also be generated automatically while using the detection functions from bioacoustics. We will explore that in details in section 4.1.
# Set plot margins to 0 par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0)) # Display the spectrogram with spectro() ticks <- c(from = 1000, to = 20000, by = 1000) # frequency tick marks from 1 to # 20 kHz, 1 kHz steps temp_slice <- c(from = 2, to = 3.5) # in seconds spectro(CATBIC, tlim = temp_slice, FFT_size = 512, ticks_y = ticks) # fspec() gives you the spectrogram matrix with energy values (dB) spec_mx <- fspec(CATBIC, tlim = temp_slice, FFT_size = 512, rotate = TRUE) # You can display the spectrogram with image() image(spec_mx, xaxt = "n", yaxt = "n")
The tick marks on the (frequency) y-axis were defined in the
spectro() function starting from 1 to 20 kHz with an interval at each 1 kHz. The FFT size was 512 samples with an overlap between two FFT windows set by default at 0.875. Now try these settings: FTT size = 256, 1024 and 2048; FFT overlap = 0.3, 0.6, 0.9…
Another interesting thing to perform with the fspec outputs, is to implement your own set of filters. Let's try to reduce the background noise from a spectrogram with a narrower time and frequency bandwidth:
temp_slice <- c(from = 2.5, to = 3.5) freq_slice <- c(from = 1500, to = 20000) spec_o <- fspec(CATBIC, tlim = temp_slice, flim = freq_slice, FFT_size = 512, rotate = TRUE) ## min and max (range) dB intensity range(spec_o) # -120 (min) to 0 dB (max) # Note that the tolerance of your recorders depends on the number of bits. # 16-bit recorders offer only around -96 dB tolerance and sound pressure above # this level is clipped to 0 dB. ## Let's try a filter by mean + sd intensity spec_f <- fspec(CATBIC, tlim = temp_slice, flim = freq_slice, FFT_size = 512, rotate = TRUE) spec_f[spec_f < mean(spec_f) + sd(spec_f)] <- -120 # Works well with high intensity audio events, but leads to # false negatives (missed events) otherwise. par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0)) image(spec_o, xaxt="n", yaxt="n") image(spec_f, xaxt="n", yaxt="n")
The functions used to detect and extract audio events in a recording also rely on “generic” filters based on frequency and duration, along with other “specific” filters. Let's take a quick look at the generic filters available in the
LPF=arguments in the
TBE=arguments in both
Let's start with the
threshold_detection() function on a recording containing calls from Catharus bicknelli. This function is an amplitude threshold detector that picks up audio events above the Signal to Noise Ratio (SNR). It combines several algorithms for detection, filtering and audio feature extraction. We will play with the arguments of this function to understand their implication in the detection and extraction of audio events (here, calls of Catharus bicknelli).
# Access the arguments of the threshold_detection function ?threshold_detection
# Set each argument according to the targeted audio events TD <- threshold_detection( CATBIC, # Either a path to an audio file (see ?read_audio), or a Wave object threshold = 12, # 12 dB SNR sensitivity for the detection algorithm time_exp = 1, # Time expansion factor of 1. Only needed for bat recordings. min_dur = 140, # Minimum duration threshold of 140 milliseconds (ms) max_dur = 440, # Maximum duration threshold of 440 ms min_TBE = 10, # Minimum time window between two audio events of 10 milliseconds max_TBE = 5000, # Maximum time window between two audio events, here 5 seconds EDG = 0.996, # Temporal masking with Exponential Decay Gain from 0 to 1 LPF = 10000, # Low-Pass Filter of 10 kHz HPF = 1000, # High-Pass Filter of 1 kHz FFT_size = 256, # Size of the Fast Fourrier Transform (FFT) window FFT_overlap = 0.875, # Percentage of overlap between two FFT windows start_thr = 25, # 25 dB threshold at the start of the audio event end_thr = 30, # 30 dB threshold at the end of the audio event SNR_thr = 10, # 10 dB SNR threshold at which the extraction of the audio event stops angle_thr = 45, # 45° of angle at which the extraction of the audio event stops duration_thr = 440, # Noise estimation is resumed after 440 ms NWS = 1000, # Time window length of 1 s used for background noise estimation KPE = 1e-05, # Process Error parameter of the Kalman filter (for smoothing) KME = 1e-04, # Measurement Error parameter of the Kalman filter (for smoothing) settings = FALSE, # Save on a list the above parameters set with this function acoustic_feat = TRUE, # Extracts the acoustic and signal quality parameters metadata = FALSE, # Extracts on a list the metadata embedded with the Wave file spectro_dir = file.path(tempdir(), "Spectros"), # Directory where to save the spectrograms time_scale = 1, # Time resolution of 2 ms for spectrogram display ticks = TRUE # Tick marks and their intervals are drawn on the y-axis (frequencies) ) # Get the number of extracted audio events nrow(TD$data$event_data)
Let the HTML page open with the 57 spectrograms (each representing an extracted audio event). These settings will be our benchmark for the number of audio events that can be extracted with the
threshold_detection() function. In the following exercise, you will try to reach or beat this number by exploring different combinations of parameters for each argument of the function.
# Let's try various settings, starting with 1024 FFT size instead of 256. TD <- threshold_detection( CATBIC, threshold = 12, time_exp = 1, min_dur = 140, max_dur = 440, min_TBE = 10, max_TBE = 5000, EDG = 0.996, LPF = 10000, HPF = 1000, FFT_size = 1024, FFT_overlap = 0.875, start_thr = 25, end_thr = 30, SNR_thr = 10, angle_thr = 45, duration_thr = 440, NWS = 1000, KPE = 1e-05, KME = 1e-04, settings = FALSE, acoustic_feat = TRUE, metadata = FALSE, spectro_dir = file.path(tempdir(), "Spectros"), time_scale = 1, ticks = c(1000, 10000, 1000) # Tick marks from 1 to 10 kHz with 1 kHz interval ) # Take a look at the spectrograms and compare them with the previous extraction. nrow(TD$data$event_data) # Only three audio events!
We will play with various detection thresholds: end_thr, SNR_thr, angle_thr, KPE and KME parameters. Try to reach 66 spectrograms extracted with a contour (i.e., Kalman curve) that best matches the audio event (answer below). The FFT size will be set at 256 samples.
CATBIC <- read_audio(file.path(data_dir, "Catharus-bicknelli-54864.mp3")) TD <- threshold_detection( CATBIC, threshold = 12, time_exp = 1, min_dur = 140, max_dur = 440, min_TBE = 10, max_TBE = Inf, EDG = 0.996, LPF = 10000, HPF = 1000, FFT_size = 256, FFT_overlap = 0.875, start_thr = 22, end_thr = 30, SNR_thr = 10, angle_thr = 125, duration_thr = 440, NWS = 1000, KPE = 1e-05, KME = 1e-05, settings = FALSE, acoustic_feat = TRUE, metadata = FALSE )
Let's take a look at the extracted audio features. Note that all the features are described and explained in the package vignette (
# Acoustic features are stored in a data frame called event_data, # stored by order of detection. View(TD$data$event_data) # Contains the filename and the time of detection in the # recording, and 26 extracted features.
The location (in number of samples) of the audio event in the recording is saved in a list.
# Start and end of the 5th extracted audio event (in samples) c(TD$data$event_start[], TD$data$event_end[]) # Remember you just have to divide by the sample rate to retrieve the time (s) c(TD$data$event_start[], TD$data$event_end[]) / slot(CATBIC, "samp.rate")
The amplitude (dB) and frequency (Hz) tracks (or bins) are also saved in a list. These can be used to build your own acoustic features.
par(mar = c(1,1, 1, 1), oma = c(1, 1, 1, 1)) # Amplitude track of the 5th audio event plot(TD$data$amp_track[], type = "l") # Frequency track of the 5th audio event plot(TD$data$freq_track[], type = "l")
The whole energy and frequency content can also be used to classify audio events instead of using acoustic features that may result in a loss of information. We will get there soon, but first, let's discover another detection function, here applied on echolocation calls of bats.
blob_detection() function will be used on a recording containing 10 bat echolocation calls from the Myotis genus. This function combines several image processing, filtering and image feature extraction. A blur and contrast boost is applied after mean background subtraction to increase the SNR of the audio event. The blob detection algorithm is applied on the processed spectrogram to detect the ROI (i.e., each preprocessed audio event). The blob detector simultaneously labels the connected FFT values and their contours in the spectrogram. Labelling is done in a single pass over the spectrogram, while contour points are revisited more than once and up to four times (see Chang et al., 2004). We will play with the arguments of this function to extract bat echolocation calls.
# Access the arguments of the blob_detection function ?blob_detection
# Use the bat recording stored in the package data(myotis) # Set each argument according to the targeted audio events BD <- blob_detection( myotis, # Either a path to an audio file (see ?read_audio), or a Wave object time_exp = 10, # Time expansion factor of 10 for time expanded recordings. min_dur = 1.5, # Minimum duration threshold of 1.5 milliseconds (ms) max_dur = 80, # Maximum duration threshold of 80 ms min_area = 40, # minimum number of 40 pixels in the blob min_TBE = 20, # Minimum time window between two audio events of 20 milliseconds EDG = 0.996, # Temporal masking with Exponential Decay Gain from 0 to 1 LPF = slot(myotis, "samp.rate") * 10 / 2, # Low-Pass Filter at the Nyquist frequency HPF = 16000, # High-Pass Filter of 16 kHz FFT_size = 256, # Size of the Fast Fourrier Transform (FFT) window FFT_overlap = 0.875, # Percentage of overlap between two FFT windows blur = 2, # Gaussian smoothing function with a factor of 2 for blurring the spectrogram bg_substract = 20, # Foreground extraction with a mean filter applied on the spectrogram contrast_boost = 20, # Edge contrast enhancement filter of the spectrogram contour settings = FALSE, # Save on a list the above parameters set with this function acoustic_feat = TRUE, # Extracts the acoustic and signal quality parameters metadata = FALSE, # Extracts on a list the metadata embedded with the Wave file spectro_dir = file.path(tempdir(), "Spectros"), # HTML page with spectrograms by order of detection time_scale = 0.1, # Time resolution of 2 ms for spectrogram display ticks = TRUE # Tick marks and their intervals are drawn on y-axis (frequencies) ) # Get the number of extracted audio events nrow(BD$data$event_data)
Do not close the HTML page and tune the FFT size at 512. Let's play with the blur, contrast boost and background subtraction parameters to retrieve a number of 10 extracted echolocation calls.
# Let's try various settings, starting with 512 FFT size instead of 256. BD <- blob_detection( myotis, time_exp = 10, FFT_size = 512, settings = FALSE, acoustic_feat = TRUE, metadata = FALSE, spectro_dir = file.path(tempdir(), "Spectros"), time_scale = 0.1, ticks = TRUE ) # Take a look at the spectrograms and compare them with the previous extraction. nrow(BD$data$event_data) # Only 6 audio events!
Let's take a look at the extracted audio features. All the features are described and explained in the package vignette.
# Acoustic features head(BD$data)
This data frame is, for now, the only available set of acoustic features with the
blob_detection() function. However, it combines well with the
fspec() to make image analysis.
Now that we have played with both detection functions with bird and bat vocalizations, let's go back to birds to explore batch analysis (i.e., with several recordings) and audio event classification.
In this section, we will learn how to analyze several recordings at the same time and train a simple classifier (with training set) that will be used to classify new data (i.e., the test set).
We will work with 27 recordings of Catharus-bicknelli (n = 9), Passerella iliaca (n = 9), and Setophaga-magnolia (n = 9). We will split the extracted audio events in a 70 % training set (called “Train”) and 30 % test set (called “Test”).
Our target audio events are calls of Catharus-bicknelli. We will use the threshold detector previously configured for this species (see section 4.1.1).
# Get the filepath for each MP3 file files <- dir(data_dir, recursive = TRUE, full.names = TRUE, pattern = "[.]mp3$") # Detect and extract audio events TDs <- setNames( lapply( files, threshold_detection, threshold = 12, min_dur = 140, max_dur = 440, min_TBE = 50, max_TBE = Inf, LPF = 8000, HPF = 1500, FFT_size = 256, start_thr = 30, end_thr = 20, SNR_thr = 10, angle_thr = 125, duration_thr = 400, spectro_dir = NULL, NWS = 2000, KPE = 0.00001, time_scale = 2, EDG = 0.996 ), basename(file_path_sans_ext(files)) ) # Keep only files with data in it TDs <- TDs[lapply(TDs, function(x) length(x$data)) > 0] # Keep the extracted feature and merge in a single data frame for further analysis Event_data <- do.call("rbind", c(lapply(TDs, function(x) x$data$event_data), list(stringsAsFactors = FALSE))) nrow(Event_data) # 355 audio events extracted # Compute the number of extracted CATBIC calls sum(startsWith(Event_data$filename, "Cat")) # Add a "Class" column: "CATBIC" vs. other species of birds "OTHERS" classes <- as.factor(ifelse(startsWith(Event_data$filename, "Cat"), "CATBIC", "OTHERS")) Event_data <- cbind(data.frame(Class = classes), Event_data) # Get rid of the filename and time in the recording Event_data$filename <- Event_data$starting_time <- NULL
We now have the necessary dataset to train a classifier: we will train a Random Forest on the training set and validate the results on the test set.
# Split the data in 60% Training / 40% Test sets train <- sample(1:nrow(Event_data), round(nrow(Event_data) * .6)) Train <- Event_data[train,] test <- setdiff(1:nrow(Event_data), train) Test <- Event_data[test,] # Train a random forest classifier set.seed(666) rf <- randomForest(Class ~ duration + freq_max_amp + freq_max + freq_min + bandwidth + freq_start + freq_center + freq_end + freq_knee + fc + freq_bw_knee_fc + bin_max_amp + pc_freq_max_amp + pc_freq_max + pc_freq_min + pc_knee + temp_bw_knee_fc + slope + kalman_slope + curve_neg + curve_pos_start + curve_pos_end + mid_offset + smoothness + snr + hd + smoothness, data = Train, importance = FALSE, proximity = FALSE, replace = TRUE, ntree = 4000, mtry = 4) # Look at the confusion matrix of the training set rf$confusion # looks good, but... # Let's make predictions with our classifier on a test set table(Test[,1], predict(rf, Test[,-1], type = "response")) # not bad! # To look at the predictions head(predict(rf, Test[,-1], type = "prob"))
We are now able to use this simple, but proven robust, classifier to detect new calls of your target species.
We will use Keras in R which requires to install several packages in Python Guidelines to install Keras properly in R are available here
Let's now explore a ConvNet approach available on Keras. We will follow the approach of Hatami et al. (2017) to analyze time series as images with 2D ConvNets. The difference is that we will only perform max pooling at the last layer before activation and add batch normalization with dropouts at each layer.
# Run if keras is installed on your machine library(keras) # Build the training set Y_train <- to_categorical(as.integer(Train[,1]) - 1) # One hot encoding # X as matrix X_train <- as.matrix(Train[,-1]) # Build the test set Y_test <- to_categorical(as.integer(Test[,1]) - 1) Y_test <- Y_test[,-1] X_test <- as.matrix(Test[,-1]) # Build the sequential model mod0 <- keras_model_sequential() mod0 %>% # Input shape layer = c(samples, rows, cols, channels) layer_reshape(input_shape=ncol(X_train),target_shape=c(1,1,ncol(X_train))) %>% # First conv 2d layer with 128 neurons, kernel size of 8 x 8 and stride of 1 x 1 layer_conv_2d(128, c(8,8), c(1,1), padding='same') %>% layer_batch_normalization() %>% layer_activation("relu") %>% layer_dropout(0.2) %>% # Second conv 2d layer with 256 neurons, kernel size of 5 x 5 and stride of 1 x 1 layer_conv_2d(256, c(5,5), c(1,1), padding='same') %>% layer_batch_normalization() %>% layer_activation("relu") %>% layer_dropout(0.2) %>% # Third conv 2d layer with 128 neurons, kernel size of 3 x 3 and stride of 1 x 1 layer_conv_2d(128, c(3,3), c(1,1), padding='same') %>% layer_batch_normalization() %>% layer_activation("relu") %>% layer_dropout(0.2) %>% # Average pooling layer layer_global_average_pooling_2d() %>% # Activation output layer with 2 classes layer_dense(units = ncol(Y_train), activation='softmax') # Model compile mod0 %>% compile(loss = 'categorical_crossentropy', optimizer = "adam", metrics = "categorical_accuracy") # Add a callback to reduce the learning rate when reaching the plateau reduce_lr <- callback_reduce_lr_on_plateau(monitor = 'loss', factor = 0.5, patience = 50, min_lr = 0.0001) # Start learning mod0 %>% fit(X_train, Y_train, batch_size = 32, epochs = 50, validation_data = list(X_test, Y_test), verbose = 1, callbacks = reduce_lr) # Score on the test set score <- mod0 %>% evaluate(X_test, Y_test, batch_size = 32) score
Let's work a bit with the output to build a confusion matrix and use the predict function on the test set.
# Look at predictions and build a confusion matrix Pred <- as.factor(predict_classes(mod0, X_test, batch_size = 32, verbose = 1)) table(Y_test[,2], Pred) # To look at the prediction values Prob <- round(predict_proba(mod0, X_test, batch_size = 32, verbose = 1), 2)
We obtained a val_loss < 0.2 and val_categorical_accuracy > 0.94 which is acceptable, but not better than the simplest RF approach we used in section 3.2. Using only 26 acoustic features as model inputs instead of the whole spectrogram content (energy and frequency distribution, and harmonics) probably reduced the performances of the CNN model.
This tutorial is now complete. Comments and feedback are welcome:
Francois: [email protected]
Jean: [email protected]