Trans Voice Visualized: How to Read Spectrograms

What are they?

Spectrograms look like the following:

In essence they are a particular graphical way of representing sound. They have several properties which make them useful for voice training or for exploring different techniques. You should have a spectrogram ready and running; it will make it easier to understand the rest of this guide.

Apps

Each platform has several different spectrogram apps. These have been recommended:

  • For Android: Spectroid is a good choice with reasonable default settings. It works also on Android smartwatches, with the caveat that the menu button is in the top right corner and won’t be accessible if you have a round watchface.
  • For iOS: Spectrogram Pro ($3) or Visual Audio (free).
  • For PC (GNU/Linux, macOS, Windows): Friture has a widget called “2D spectrogram” which is what you want.
  • Special mention: inFormant, the so-called Clo’s app. With a good microphone and silent background it shows the first few formants. Modifying the location of these is one way to shift the perceived gender of the voice.

Colors and lines!

The picture below shows the spectrogram in Friture. What you would see is the colored part moving to the left and new colors being generated on the right. In some spectrograms this moves from left to right, from down to up, or up to down (e.g. in Spectroid). To understand what this all means you need to understand each one of the three axis shown: Frequency, Time, and Power Spectral Density, or PSD for short.

In essence the spectrogram tells you how much energy (PSD) there is at a given frequency at a given time. In the picture above the black region on the left corresponds to the time before the spectrogram was activated. The part with purple and pink haze represents the background noise from my device (a laptop with noisy fans), fridge, wind and other noise that reached the microphone. Then, the orange / yellow lines were produced by pronouncing a vowel. I did a pitch slide, starting with a low pitch (in the middle of the picture) and raising it and then keeping the pitch fixed high.

Frequency

This represents how many times per second an element producing sound vibrates. Each sound, no matter how simple or complicated can be split up into frequencies, just as a picture can be split up into pixels. A sound that consists of a single pure frequency is a sine wave. Such a wave of frequency 360Hz sounds like this. A speaker making that sound would have its element going back and forth 360 times a second in a pattern given by the sine function. Believe it or not, any sound, including voices, can be split into sine waves of different frequencies and loudness.

Time

The time-axis at the bottom simply tells how long ago something happened. The right side of the spectrogram in Friture is the only part that shows what’s happening NOW. The rest is what happened before. The time-axis of Friture is a bit wonky: 10 seconds on the axis corresponds to the sound that’s being recorded currently, and so 6 seconds on the axis represents what was recorded 4 seconds ago. In the picture, you can see that I started speaking at roughly 4.5 seconds, which means it was 10-4.5=5.5 seconds before I took the screenshot.

PSD (power spectral density)

The various colors in the spectrogram tell how loud a given frequency is at a given time. Usually the colder colors (black, blue, purple, dark pink) represent sounds that are less loud than those represented by warmer colors (bright pink, red, orange, yellow, white). This is shown in the axis on the right. Loundess is measured in units of decibels, dB. The higher the number the louder it is. However, negative values are used often, so pay attention.

So, how to read the spectrogram?

Once you know the meaning of the three axis you can understand what each individual pixel in the spectrogram means at any instant. Let us look at what happens at the pixel where the cursor is located in the picture below. I was doing some pitch slides up and down and up and down…

The label kindly tells us that the cursor is located at 9.65s on the time-axis and 275Hz on the frequency axis. The vertical and horizontal lines show these also. Most spectrogram apps display these helpful lines, at least for frequency. Then there’s the third axis: the color. It looks like it’s orange, so maybe around -50dB on the color axis displayed to the right. This means that the frequency 275Hz had loudness -50dB at time 9.65s (which is 0.35 seconds ago because the spectrogram slides to the left in Friture).

The actual value of -50dB is not important. This loudness value depends on the type of microphone you use, how far from the microphone you are talking and whether you are talking towards the microphone or away from it. Instead you should compare it to other colors around! The color below the cursor is orange which is much higher than the black or purple of the background. This means that my voice is doing something at 275Hz!

What is all of this good for?

Reading spectrograms is not about looking at the individual pixels, much like when watching a video you don’t think: “Great! The pixel at 500×750 was red 24 seconds after the movie started!” To get anything useful you need to focus on the various shapes the spectrogram shows. Here’s a few voice parameters that can be read from a spectrogram.

Pitch

The pitch of the voice is given by the middle of the lowest line that your voice makes. The picture below shows my voice at 124Hz, 180Hz, and, lastly, 254Hz. Note that the lines become thicker in the frequency-axis as the frequency becomes lower. It is normal, and is caused by the logarithmic frequency axis. The lines would all have the same thickness under a linear axis. Experiment and choose whichever axis type you prefer in the settings.

Harmonics and breathiness

The pitch of a voice is often called the fundamental frequency or the 1st harmonic. The 2nd harmonic is the line right above the 1st harmonic, and its frequency is twice the frequency of the 1st harmonic. The 3rd harmonic is the third line and so forth. Harmonic number N has frequency N times that of the fundamental frequency. This applies to any “periodic” sound, i.e. those that are produced by a vibrating or rotating element, e.g. the vocal folds. Background noise, the wind, or breath is not. That’s why you don’t see these lines of concentrated energy or loudness in those sounds.

If you see some color between the harmonics it can mean a few different things. The most likely scenarios are that a) there is background or microphone noise, or b) there is breathiness in the voice. The picture below shows a breathy voice first and then no breathiness. Note that, compared to earlier pictures, I changed the frequency axis to linear to see details between frequencies up to 2kHz better than in the logarithmic scale.

Larynx height

It is possible to keep track of how high or low you can move the larynx using a spectrogram. The pictures below show how the sound of the big dog small dog (BDSD) and whisper siren exercises look.

If you want to feminize the voice, you want to raise the larynx, so move the bright spots above higher. For masculinization you lower the larynx and move the spots lower. A drill for raising the larynx is the BDSD, where you start by panting like a big dog, then try to pant like a small dog. The picture on the left shows this; the black stripes between the purple stripes are the moments in time between the pants, where barely any sound is produced. You can notice two brighter spots in the vertical purple thick lines. These are where the first and second resonances of your vocal tract are on the frequency scale. Raising the larynx raises these resonances. NOTE: other movements, such as tongue position, might also move the resonance. So be sure to isolate the larynx as much as possible if your goal is to track the progress with larynx. On my first pant (time 0.4s in the picture) the first resonance was at 700Hz. On the last pant (time 4.2s) that resonance was at around 900Hz. If I train more raising the larynx I could probably reach higher than 900Hz while keeping other parts of the vocal tract constant.

The picture on the right is the whisper siren, in which you start whispering some vowel, and try to change the perceived pitch of that whisper. Just as for BDSD, the brighter spots vertically show the frequencies which are the first two resonances (you will see more of them if your frequency axis has a bigger range). The first resonance was around 600Hz at the beginning, and it ended up at roughly 900Hz.

Spectrogram configuration

Various spectrogram apps have different settings. In general modifying the following is useful in certain situations:

  • Frequency axis: minimum and maximum frequency, logarithmic or linear (or other) scale. These will help with making the various harmonic lines clearer, by zooming in or out, or packing the frequencies close together to see more global properties.
  • Time axis: what can be changed here is just the length of time in the past that’s being displayed. Making this shorter makes the spectrogram move faster and vice versa.
  • Loudness / energy / PSD axis: Here it’s often possible to decide which loudness or power level correspond to the maximum and minimum color. The main use is for removing background noise from the display or to make the peaks of the harmonics brighter or more clearly defined.

There are other options too, for example technical parameters related to how the spectrogram is calculated. This is an advanced topic with very particular applications for voice practice, so they will not be described here.

Okay, here’s the effects of just one technical parameter: Friture and Spectroid have an option called “FFT size”. Increasing this makes the spectrogram more precise in frequency, but less responsive in time. This is useful for analyzing situations where there’s no change in the voice in time. Decreasing it does the opposite. You cannot have ultimate precision both in time and frequency. It’s just like Heisenberg’s uncertainty principle.

Conclusion

The spectrogram shows how the energy of your voice is distributed among the different frequencies contained in the voice, and shows how that has changed in the past several seconds. The various shapes displayed by it can reveal the pitch of the voice, breathiness (or background noise) and changes in larynx height. Many other voice parameters can be analyzed too. The main way to do this is to do an unmodified voice, and right after that do the modified one. Look at the difference in the spectrogram, and see if the difference increases when the voice is further modified. Not all voice modifications can be seen in the spectrogram as easily as the above, but many can be seen with enough practice.