Training an LSTM on non-aligned annotation and speech data

Speech data available for model development usually fits in one of these two categories: 1) little but meticulously annotated speech (and probably not publicly available) and 2) a lot but largely unlabeled speech data (maybe publicly available). My experiment below explores the use of Voxforge’s publicly available and free speech data, including their simple annotations of that speech.

Voxforge hosts several large, publicly available speech databases for multiple languages. Included in their speech databases are annotation files associated with individual recordings. I had already downloaded several speech databases in aims of exploring how to train neural networks on MFCC data. Therefore, I decided to see how I could apply the annotations to the training data for an LSTM neural network. Could I build a speech to text model?

Generally speaking, I would like to see how speech recognition models can be built without aligned annotations, meaning annotations that do not exactly correspond with the timing of speech sounds. There are models built on either relatively small amounts of professionally annotated speech data and also on large amounts (i.e. big data) of unannotated speech data. Both have their role in model development and can be used together to build more complex and telling models (this blog explores this topic in regards to text analysis; this paper by Schuller, 2015 reviews data available today, and ways to best analyze them.)

I would like to know if speech data with approximate annotations offer any sort of contribution to the analysis of speech. If so, this may improve the effectiveness of some speech data, even though the speech data is not professionally annotated. This was the purpose of my experiment with Voxforge’s English speech database. While I would like to build multilinguistic models, I explored first if I could develop working models with one language. Click here for the repository.

Step one: collect and store the data

The first step in collecting the data I needed was to understand the architecture of the database(s), specifically where the annotations and corresponding wave files were stored. Once I determined a pattern, I wrote a script that unzipped each zip-file individually, extracted the data I needed, and then deleted the unzipped contents. I stored the annotations and the MFCC data in the same database but in different tables. I used the zipfile and wave file names as keys to match the right annotation with the right set of MFCC values.

The data I extracted from these files, which I will explain further below, included 1) international phonetic alphabet (IPA) representations of the annotations and 2) MFCC data representing the audio data.

As I extracted the written annotation of each recording, I transcribed the annotation into characters of the IPA using Espeak, an open source speech synthesizer software. The reason I wanted the transcripts in the IPA was because the IPA was developed to represent all sounds in langauges. In a language’s alphabet, often times a single sound is represented by several letters. For example, in German, sch - pronounced sh as in shoe - and dsch - pronounced j as in jungle - are sounds with a lot of letters that can be represented in IPA with fewer characters: ʃ and dʒ, respectively.

I also extracted MFCC data from each recording. While extracting, I applied my voice activity detection (VAD) function to avoid collecting MFCC data during the beginning and ending silences of recordings. Note: I extracted the MFCC data twice, once with added noise, as I found that aided model generalizability, and once without added noise, mainly for comparison.

Table 1: Filenames, annotation, IPA transcription information

Imgur

Table 2: MFCC values (columns 0 - 39) with additional necessary information.

Imgur

The data were collected with Python scripts and saved in a SQL database. The purpose of this experiment was to serve as a test round. Therefore, I only collected a total of 100 recordings, which worked out to be appx. 5 minutes of speech data. For code relevant for these tasks, please see ‘collect_speechdata.py’ for the functions, ‘sql_data.py’ for how I saved/loaded data, and ‘collect_IPA_MFCC_English_tgz.py’ for the main module.

Step two: combine data/prep data for training models

The IPA and MFCC data needed to be paired so the trained model could classify the MFCC data as certain IPA characters. Given my experience with both simple artificial (ANN) and long short term memory (LSTM) neural networks, I knew I wanted to feed the data to an LSTM network. Therefore, I prepared the data so that I could feed it to the network in sequences. The model would likely learn IPA characters if it trained on MFCC data in short sequences which would capture small speech sounds. For further clarification on this topic, please refer to this post.

Furthermore, for good practice, I randomly assigned each recording sample/annotation to either the training, validation, or testing datasets, with a ratio of 6-2-2, respectively.

In order to align the IPA and MFCC data for each recording, I calculated the length of the IPA characters, which should have better represented length of the utterance than the original alphabet would. I then calculated the number of MFCC rows linked to the annotation and divided that number by the number of IPA characters. This would serve as a guideline for how many MFCC sequences would represent each IPA character. Note: one dataset included only the IPA letters (characters that signify the sounds only: ‘k’, ‘ᵻ’, ‘ʃ’, ‘ɔ’, etc.) and another dataset, IPA letters as well as stress markers (e.g. ‘ ˌ ‘, ‘ ˈ ‘, ‘ ː ‘. These designate secondary stress, primary stress, and long vowel length, respectively). I wanted to compare which worked better classifying speech sounds as IPA characters.

Additionally, I had to decide how many MFCCs I would use to train the network to classify IPA characters. I decided that classifying three IPA characters would be ideal, as when speech sounds are produced, they are not produced in isolation: their sounds are influenced by the sounds produced before and after. (To see for yourself, see how the letter j (dʒ in IPA) changes when you say joke (dʒəʊk) versus judge (dʒʌdʒ). Your lips are already getting round while saying j when pronouncing joke while your lips stay more relaxed with the latter.) This meant that I would feed the LSTM network with samples in sequences of MFCCs that corresponded to three IPA characters. Not every speaker speaks at the same speed; therefore sometimes more MFCCs were necessary to represent IPA characters and sometimes fewer. On average 6 MFCC samples represented three IPA characters, so I limited the MFCC sequences to 20, and zero padded the sequences (what was most) that were less than that.

For the data to be processed by a LSTM network, I had to turn the label data (i.e. three IPA characters) from strings into integers. To do so, I generated a list of all 3-letter IPA combinations possible, including letter repeats, which worked out to be a total of 704,881 combinations (which is a lot of potential class labels!). I then applied the indexes of these as labels for the MFCC sequence data.

Table 3: MFCC data with IPA label

Imgur

First column (not visualized) is the dataset assignment (train, validate, test); the next 40 columns pertain to the 40 MFCC coefficients; the last column on the right pertains to the IPA label. The first set of MFCC data needs 15 samples to identify 3 IPA characters. In order to make sure each set of samples had the same lengths, any samples less than 20 were padded with zeros. You can see the next sample following the zeros had a different 3 letter IPA label.

I prepared the scripts to allow for adjustment of two key variables: 1) the IPA window shift/ overlap of sounds and 2) which IPA characters were excluded from analysis. I built a total of eight models, trained on data with a mix of the following options:

1) with or without added noise to the MFCC data

2) with or without IPA stress symbols: e.g. ˌ ˈ ː

3) with or without overlap

To give you an idea of what the IPA stress symbol and window shift/ overlap conditions look like, let’s take the saying forever young as an example. Here is what the IPA transcription looks like:

Original transcription (i.e. with a space between the words)

fəˈrɛvər jʌŋ

Then with the space removed (this is how the condition with stress characters would look like):

fəˈrɛvərjʌŋ

And finally without the stress character(s) (how the condition without stress markers would look like):

fərɛvərjʌŋ

Let’s have a look at how the IPA labels would look like if they had overlap vs no overlap (for the condition with IPA stress).

‘3-letter labels’ with overlap:

fəˈ, əˈr, ˈrɛ, rɛv, ɛvə, vər, ərj, rjʌ, jʌŋ

Without overlap:

fəˈ, rɛv, ərj

Every label required a full 3-letter representation; if three letters were not avaiable at the end of the utterance, those letters/MFCC data were not included in the training (hence the disappearance of ʌŋ in the non-overlap variable example above).

To see the code I wrote to prepare the data, please refer to ‘batch_prep.py’ for the functions and ‘combine_align_ipa_mfcc_data.py’ for the main module.

Step three: build and train the model

Even though I only used 5 minutes of speech data, that worked out to be 26,280 samples of MFCC data for the condition without overlap and 76,640 samples of MFCC data for the condition with overlap. Keep in mind, because I used 40 MFCCs, each sample had 40 features. I could not feed the data all at once to the LSTM model and instead developed a generator function that fed the network 20 sequences of MFCC data at a time, which corresponded to the MFCC samples with 3 IPA labels (see Table 3). To see the code relevant for the building of the LSTM, please refer to the script ‘batch2model.py’, which imports the ‘sql_data.py’ and ‘batch_prep.py’ scripts mentioned earlier.

At this stage of experimentation, I used only training and test datasets. If I continue experimenting with this type of data I would also include a validation dataset.

I found most staggering differences in accuracy rates between the models that use the stress characters and those that didn’t. Otherwise, the accuracy rates remained at approximately the same levels.

Table 4: LSTM Model Performance

Imgur

Model types: A = trained without overlap and with added noise; B = trained with overlap and with added noise; C = trained without overlap and without added noise; D = trained with overlap and without added noise.

Clearly 27-28% accuracy does not look very high; however, when considering the number of labels the models had with which to classify the MFCC samples (704,881 labels to be exact), that percentage looks a lot more impressive. It is interesting to compare the difference between the accuracy of those models, which were trained on MFCC data that was aligned with IPA characters with stress markers, and the accuracy of the models that were aligned with IPA characters only: the stress markers were not included in the alignment stage. To better understand what this accuracy means, I would like to apply the models to brand new data and see how the models classify the speech.

I would be curious to further explore training models with these loosely aligned speech and annotation data; however, the computation costs of such development is high. To apply this technique to all of the available English data on Voxforge, the MFCC extraction alone would take at least 9 hours. The following steps of MFCC and IPA alignment and then the training of the LSTM model, several days would likely be needed. In addition to that, memory costs would also be very high. I would first rerun this experiment, but instead of using 40 MFCCs as features, only the 2nd-13th coefficents, as these are most relevant to speech sounds. Reducing the coefficents by more than a third would likely reduce both calculation and memory costs.

If I did further explore a model, I would likely explore a model trained on MFCC data with noise added and aligned with IPA characters with stress markers. In past experiences with neural networks, I found the models trained on data with added noise generalized better to realworld data. Also, even though there was a slight increase in accuracy for the models trained with overlapping MFCC data, that increase is not large enough to make up for the more than double amount of data necessary.

In sum, the findings of this small experiment reveal that IPA characters - when used with their suprasegmental markers - can be used to create loose alignments with speech productions and result in potentially reliable training data. The findings also reveal the reliability of Espeak’s software which ‘translated’ the English text to IPA letters. My next steps include developing an application to collect new English speech and apply these models to see which handles realworld data best. Additionally, I will explore whether adding an embedding layer might decrease memory/computation costs and improve model performance.

Resources:

Blog on NLP model development with small data with annotations vs big data
Schuller, B.W. (2015). Speech Analysis in the Big Data Era. TSD.
Very helpful tutorial on using Keras to build LSTM neural networks.

Aislyn Rose

Training an LSTM on non-aligned annotation and speech data

Step one: collect and store the data

Table 1: Filenames, annotation, IPA transcription information

Table 2: MFCC values (columns 0 - 39) with additional necessary information.

Step two: combine data/prep data for training models

Table 3: MFCC data with IPA label

Step three: build and train the model

Table 4: LSTM Model Performance

Model types: A = trained without overlap and with added noise; B = trained with overlap and with added noise; C = trained without overlap and without added noise; D = trained with overlap and without added noise.

Resources: