Teaching computers to find speech sounds

Our system records your voice and uses several different tools to get information from it. For some of the courses, we look at how vowels are pronounced by measuring their formant values. For others, we measure the pitch.

To do all of these things, we need to find where in the recording each individual sound is, which we do using a tool called a forced aligner. There are several available, but the one our system uses is the Penn Forced Aligner, created by the Linguistics Department at the University of Pennsylvania. Forced aligners take recordings of speech and a written transcript of the recording, and mark out where each word and each sound in the word occurs in the recording, like this (click on the images to hear your audio):

" class="fancy audio center w-50">

Finding individual sounds and words is something your brain does all the time, without you putting effort into it. In fact, once you know a language, you can't decide to listen but not understand speech in that language! But for a computer, this is a surprisingly hard task that has taken many years of research.

How does it work?

A forced aligner has two main parts. First, it has a dictionary, which maps written words to sound symbols. Here is a excerpt from the dictionary for our aligner:

SPEECH S P IY1 CH
SPEECHES S P IY1 CH AH0 Z
SPEECHLESS S P IY1 CH L AH0 S
SPEECHWRITER S P IY1 CH R AY2 T ER0
SPEECHWRITERS S P IY1 CH R AY2 T ER0 Z
SPEED S P IY1 D
SPEEDBOAT S P IY1 D B OW2 T
SPEEDBOATS S P IY1 D B OW2 T S

Each sound symbol stands for a sound or phoneme of English. The first thing the aligner does is "translate" the written transcript into a list of sounds. So the transcript of the beginning of one of our word lists starts like this:

peace
hoot
tool
hit

And is converted to:

P IY1 S
H UW1 T
T UW1 L
H IH1 T

The aligner knows that sometimes there are pauses between words and sometimes there aren't, so between each word it considers either possibility and chooses the one that seems to match the recording best.

Once it has the list of sounds (including pauses) to look for, the aligner uses a hidden markov model tool from the HTK toolkit to generate a best guess for where each sound begins and ends. Because it's a forced aligner, it's not allowed to give up and decide that a given sound is missing. If it's on the list, it has to be marked!

This particular aligner outputs these guess into something called a TextGrid, which can be opened in the free software program Praat.

Here are a few pictures and recordings that the aligner marked for your list. How did it do?

" class="center w-50 fancy audio"> " class="center w-50 fancy audio"> " class="center w-50 fancy audio"> " class="center w-50 fancy audio">

Check it out for yourself

Download the ">sound file and the ">TextGrid for your recording. Open them in Praat and check the alignment. What did it do well on? Where did it break?