SQL Window Functions explained with example
Understanding Transposed Convolution with Python example
Show all

Understanding the basics of audio data with Python code

36 mins read


A huge amount of audio data is being generated every day in almost every organization. Audio data yields substantial strategic insights when it is easily accessible to data scientists for fuelling AI engines and analytics. Organizations that have already realized the power and importance of the information coming from the audio data are leveraging the AI (Artificial Intelligence) transcribed conversations to improve their staff training, and customer services and enhance overall customer experience.

On the other hand, there are organizations that are not able to put their audio data to better use because of the following barriers: 1. They are not capturing it. 2. The quality of the data is bad. These barriers can limit the potential of Machine Learning solutions (AI engines) they are going to implement. It is really important to capture all possible data and also in good quality.

This article provides a step-wise guide to start with audio data processing. Though this will help you get started with basic analysis, It is never a bad idea to get a basic understanding of the sound waves and basic signal processing techniques before jumping into this field.

Basics of sound waves

Humans are born with incredible abilities. Hearing sound is one of the most awesome abilities we have. The sound makes our life easy, It makes us aware of our surroundings, and possible dangers. Blind individuals use their ears to see the world. The sound makes communication easy for us. Listening to music keeps us entertained. Let’s learn more about the sound wave.


A disturbance traveling through a medium is called a wave. Disturbance can be understood as something which can change the orientations and positions of particles (initially at equilibrium) of a particular medium. Medium is something that carries a wave (disturbance from one place to another). The medium could be any substance or material like Water, Air, Steel, etc. Remember, the medium is not responsible for creating waves, it just helps the waves to transfer energy from one place to another.

Image for post

There are many different types of waves like Mechanical waves, electromagnetic waves, matter waves, etc. Electromagnetic waves are a special type of wave that does not require any medium to travel, although they can travel through different types of mediums also. For example, light waves are electromagnetic waves that can pass through air, water as well as vacuum.

Looking at their way of propagation in the medium, waves can be classified into three major categories:

  1. Transverse Waves
  2. Longitudinal Waves
  3. Surface Waves

Transverse Waves

Image for post

If the disturbance caused in the medium is perpendicular to the direction of wave propagation, the wave is called a transverse wave. Medium particles oscillate at 90 degrees to the direction of the wave. Examples are light waves and radio waves.

Longitudinal Waves

Image for post

If the disturbance of the medium is in the parallel (same or opposite) direction to the direction of wave propagation, the wave is said to be a longitudinal wave. Medium particles oscillate in the same or opposite direction to the direction of the wave. Examples are sound waves and ultrasound waves.

Surface Waves

Surface waves travel along with the interface between two different mediums. They often travel in a circular manner over the interface. Examples are ocean waves generated by gravity on the surface of the water and waves generated by throwing a stone in still water.

Sound Waves

Sound waves are mechanical waves as they require a medium. Sound waves transfer energy from one place to another. The flow of energy is always in the same direction as the wave. Sound waves can travel through air, liquid, and also solid mediums. Sound travels slowest in the air, much faster in the liquids, and fastest through the solids as medium particles are much closer (tightly-packed) in liquids and solids compared to air particles.

Sound waves travel as longitudinal waves in the ‘air and liquid’ mediums. While in solid mediums, it can travel as a longitudinal wave as well as a transverse wave. Moving forward, we will only talk about the sound waves passing through the air.

Sound Waves in the Air

When we speak, we change the pressure of the air closest to our mouth. This change in the pressure (disturbance of the medium for this case) travels in the atmosphere and is called a sound wave. Because this change of pressure in the air travels in the same direction as the sound wave, it makes it a longitudinal wave.

------Three important properties of Sound Waves-------
1. Mechanical wave
2. Pressure wave
3. Longitudinal wave
Image for post

The definition of a sound wave would be :

1. Sound is energy carried by vibrations in the air. 

2. Sound is a longitudinal pressure wave. It is made up of compressions and de-compressions (also called rarefaction) which travel in the atmosphere.

Sound Generation

To generate a sound wave, you need to compress/put pressure on the air. Anything which vibrates or is capable of changing the air pressure can create a sound wave. For example, when we clap, slap or smash, we disturb the air nearby and this disturbance moves as a sound wave.

Sound Speed in Air

When a sound wave is generated in the air, the pressure disturbance (compressions and decompressions in the air) travels at ~330 to ~340 metres per second speed. This is called the speed of sound in the air. The speed of sound highly depends upon the atmospheric temperature. It increases with the increase in atmospheric temperature. Because the gas molecules get more freedom and energy at high temperatures.

Sound Frequency

The number of air compression-decompression pairs done by the disturbance in one second is called the frequency of a given sound wave. Something vibrating at a certain frequency generates a sound wave with equal frequency. But in reality, it’s hard to find a wave with just a single frequency because there are multiple unknown factors that cause pressure-change in real life. This kind of additional pressure change due to unknown (unwanted) factors is termed noise.

Human Speech

When we speak (talk), we generate multiple sound waves with multiple frequencies simultaneously. A collective pressure change caused by all these waves and surrounding noise travels in the atmosphere. When this collective pressure wave reaches our ears, we are able to hear it.

Recording Sound Waves

We know that sound is nothing but a continuous change in air pressure. If we want to record it, all we need to do is measure and record the air pressure in the atmosphere with time. Now there are two challenges in doing that:

  1. How to measure it → Microphones
  2. How frequently do we need to measure it → Sampling Rate


Image for post
inside view of a microphone

Microphones are devices capable of converting the mechanical energy of sound waves into electric energy. When a pressure wave (sound wave) hits the diaphragm (usually made of thin plastic) inside the microphone, the diaphragm also moves along with the disturbance with the same frequency. A metal coil, which is in contact with a fixed magnet and also attached to the diaphragm, moves back and forth with it. Because this movement of the coil cuts through the magnetic field generated by the fixed magnet, an electric current flows through the coil. Now we can record this electric current, and our job is done.

Pulse Code Modulation

Now that we have a way to record sound waves as electric current using microphones. Storing this wave is still a problem as this is a continuous wave. In order to store it, we need to convert this continuous signal into a discrete signal. Once we have discrete values of electric voltages at regular time intervals, we can directly write them into a file and save them. This way of storing an analog signal as a digital signal is called Pulse Code Modulation (PCM).

Image for post

Sampling Rate / Sampling Frequency

The frequency at which we capture these electric voltages (amplitudes), is called the sampling rate of the sound file. In other words, Number of electric voltage values noted down in one second is called the sampling rate or sampling frequency of the recorded file. How much sampling rate is good for recording songs? Nyquist’s theorem will answer.

Nyquist-sampling theorem

According to Nyquist Your sampling rate should be at least twice to the maximum frequency you want to capture from the given signal. In other words, If you are given a signal having frequencies ranging from 1 to f Hz, and you don’t want to lose any information (frequency), your sampling rate (F) should be (F ≥2* f).

Image for post

Because the human hearing range is (~20 to 20,000 Hz) there is no point in capturing frequencies greater than 20kHz. So, according to Nyquist, if we sample it at ≥ 40k sampling rate, we won’t lose any information which is in our hearing range. This is the reason, most songs are sampled at a 44.1kHz sampling rate.

Playing Recorded Sound

Remember when we recorded the sound, we converted vibrations of the air into electric current. Here we need to do just the opposite. We need to take electric current as input and convert it into vibrations. A loudspeaker (formally called a speaker) is the perfect device that does all this.


Image for post
Inside view of a loudspeaker

Speakers are made up of three basic things: a coil, a magnet, and a diaphragm. When changing electric current is passed through the coil of metal wire, it creates a magnetic field around it. This magnetic field comes in contact with the magnetic field created by a fixed magnet. Because the magnet is fixed, the only coil moves back and forth( because of the clash of two magnetic fields). The diaphragm disc connected to the coil also moves with the coil. This movement of the diaphragm pushes the air back and forth and results in a sound wave that we can hear.

Storing Sound Efficiently

Human voices or songs are sampled at a very high frequency. Every second we get a few thousands of amplitudes (equal to the sampling rate) to store. As we know each amplitude will take 8 bits of memory if stored in a single byte or 2 bytes if stored as a 16-bit integer. So for a sampling rate of 44k (generally used to record songs), each minute of audio will take ~5MB of memory (considering 2 Bytes per amplitude value), which is huge.

Looking at the recorded audio, we see that there is a good percentage of amplitudes having 0 (silence) values. As a ‘zero’ can be represented in a single bit, why are we wasting the other 15 bits? Idea is to design an algorithm that can store audio data using the minimum number of bits without losing the quality of audio.

Image for post
Popular audio codecs

There are quite a few algorithms to store audio data efficiently. These algorithms are known as audio codecs, for example, MP3, WMA, etc. These codecs can be classified into two categories:

  1. Lossless Codecs
  2. Lossy Codecs

Lossless Codecs

An audio codec is said to be lossless if it preserves all the information of the original audio. In other words, when compressed data is decompressed it produces the same quality as it was before compression. Examples:

  1. Free Lossless Audio Codec (FLAC)
  2. Windows Media Lossless (from Microsoft)
  3. Apple Lossless

Lossy Codecs

Lossy codecs discard some information in order to make compressed audio smaller. Human ears possibly won’t notice the difference in the quality. These codecs make the file smaller, thus storage, as well as transfer over the internet, becomes really fast. Popular lossy codecs are:

  1. MP3 (MPEG — Motion Picture Experts Group)
  2. Windows Media Audio (WMA — from Microsoft)
  3. Advanced Audio Codec (AAC)

Processing Audio data

Some key terms in audio processing.

  • Amplitude — Perceived as loudness
  • Frequency — Perceived as pitch
  • Sample rate — It is how many times the sample is taken of a sound file if it says sample rate as 22000 Hz it means 22000 samples are taken in each second.
  • Bit depth — It represents the quality of sound recorded, It is just like pixels in an image. So 24 Bit sound is of better quality than 16 Bit.

Here I have used the sound of a piano key from freesound.org

signal, sample_rate = librosa.load(file, sr=22050)
librosa.display.waveplot(signal, sample_rate, alpha=0.4)
plt.xlabel("Time (s)")
plt.savefig(‘waveform.png’, dpi=100)

To move a wave from the time domain to the frequency domain we need to perform Fast Fourier Transform on data. Basically what we do with the Fourier transform is the process of decomposing a periodic sound into a sum of sine waves which all vibrate oscillate at different frequencies. It is quite incredible so we can describe a very complex sound as long as it’s periodic as a sum as the superimposition of a bunch of different sine waves at different frequencies.

Below I have shown how two sine waves of different amplitude and frequency are combined into one.

# perform Fourier transform
fft = np.fft.fft(signal)
# calculate abs values on complex numbers to get magnitude
spectrum = np.abs(fft)
# create frequency variable
f = np.linspace(0, sample_rate, len(spectrum))
# take half of the spectrum and frequency
left_spectrum = spectrum[:int(len(spectrum)/2)]
left_f = f[:int(len(spectrum)/2)]
# plot spectrum
plt.plot(left_f, left_spectrum, alpha=0.4)
plt.title("Power spectrum")

By applying the Fourier transform we move in the frequency domain because here we have on the x-axis the frequency and the magnitude is a function of the frequency itself but by this we lose information about time so it’s as if this a special power spectrum here was a snapshot of all the elements which concur to form this sound, so basically what this spectrum is telling us is that these different frequencies have different powers but throughout all of them all of the sound here so it’s a snapshot it’s a static which could be seen as a problem because obviously audio data alike is a time series right so things change in time and so we want to know about how things change in time and it seems that with the Fourier transform we we can’t really do that so we are missing on a lot of information right but obviously there’s a solution to that and the solution it’s called the Short Time Fourier Transform or STFT and so what the short time Fourier transform does it computes several Fourier transforms at different intervals and in doing so it preserves information about time and the way sound evolved it’s over time right and so the different intervals at which we perform the Fourier transform is given by the frame size and so a frame is a bunch of samples and so we fix the number of samples and we say let’s consider only for example 200 samples and do the Fourier transform there and then let’s move on to let’s shift and move on to to the rest lack of the waveform and what happens here is that we get a spectogram which gives us information of (time + frequency + magnitude)

# STFT -> spectrogram
hop_length = 512 # in num. of samples
n_fft = 2048 # window in num. of samples
# calculate duration hop length and window in seconds
hop_length_duration = float(hop_length)/sample_rate
n_fft_duration = float(n_fft)/sample_rate
print("STFT hop length duration is:{}s".format(hop_length_duration))
print("STFT window duration is: {}s".format(n_fft_duration))
# perform stft
stft = librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)
# calculate abs values on complex numbers to get magnitude
spectrogram = np.abs(stft)
# display spectrogram
librosa.display.specshow(spectrogram, sr=sample_rate, hop_length=hop_length)
# apply logarithm to cast amplitude to Decibels
log_spectrogram = librosa.amplitude_to_db(spectrogram)
librosa.display.specshow(log_spectrogram, sr=sample_rate,
plt.colorbar(format="%+2.0f dB")
plt.title("Spectrogram (dB)")

we have time here on the x-axis but we also have frequency on the y-axis and we have a third axis which is given by the color and the color is telling us how much a given frequency is present in the sound at a given time so for example here we see that low-frequency sound is more in the most of the audio.

Mel Frequency Cepstral Spectrogram in short MFCCs capture many aspects of sound so if you have for example a guitar or flute playing the same melody you would have potentially the same frequency and same rhythm more or less there depending on the performance but what would change is the quality of sound and the MFCC’s are capable of capturing that information. To extract MFCCs we perform a Fourier transform and we move from the time domain to the frequency domain. So MFCCs are basically frequency domain features but the great advantage of MFCCs over spectrograms is that they approximate the human auditory system. They try to model the way we perceive frequency right. That’s very important if we want to do deep learning stuff to have some data that represent the way we kind of process audio. The results of extracting MFCCs are a bunch of coefficients. It’s an MFCC vector and so you can specify the number of different coefficients. Usually, in all music applications, we use between 13 to 39 coefficients, and then again we are going to calculate all of these coefficients at each frame. So MFCCs are evolving over time right.

# extract 13 MFCCs
MFCCs = librosa.feature.mfcc(signal, sample_rate, n_fft=n_fft, hop_length=hop_length, n_mfcc=13)
# display MFCCs
librosa.display.specshow(MFCCs, sr=sample_rate,
plt.ylabel("MFCC coefficients")

So here I have 13 MFCC coefficients represented on the y-axis, time on the x-axis, and the more the red, the more the value of that coefficient in that time frame.

MFCCs are used for a number of audio applications. Originally they have been introduced for speech recognition, but it also has uses in music recognition, music instrument classification, and music genre classification. Code for FFT, STFT, and MFCC’s

Reading Audio Files


LibROSA is a python library that has almost every utility you are going to need while working on audio data. This rich library comes up with a large number of different functionalities. Here is a quick light on the features:

  1. Loading and displaying characteristics of an audio file.
  2. Spectral representations
  3. Feature extraction and Manipulation
  4. Time-Frequency conversions
  5. Temporal Segmentation
  6. Sequential Modeling

As this library is huge, we are not going to talk about all the features it carries. We are just going to use a few common features for our understanding. Here is how you can install this library real quick:

# pypi: 
pip install librosa
# conda: 
conda install -c conda-forge librosa

Loading Audio into Python

Librosa supports lots of audio codecs. Although .wav(lossless) is widely used when audio data analysis is concerned. Once you have successfully installed and imported libROSA in your jupyter notebook. You can read a given audio file by simply passing the file_path to librosa.load() function.

librosa.load() —> function returns two things: 1. An array of amplitudes. 2. Sampling rate. The sampling rate refers to the ‘sampling frequency’ used while recording the audio file. If you keep the argument sr = None, it will load your audio file at its original sampling rate. (Note: You can specify your custom sampling rate as per your requirement, libROSA can upsample or downsample the signal for you).

sampling_rate = 16k says that this audio was recorded (sampled) with a sampling frequency of 16k. In other words, while recording this file we were capturing 16000 amplitudes every second. Thus, If we want to know the duration of the audio, we can simply divide the number of samples (amplitudes) by the sampling rate as shown below:

You can play the audio inside your jupyter-notebook. IPython gives us a widget to play audio files through the notebook.

Visualizing Audio

We have got amplitudes and the sampling rate from librosa. We can easily plot these amplitudes with time. LibROSA provides a utility function waveplot() as shown below:

This visualization is called the time-domain representation of a given signal. This shows us the loudness (amplitude) of the sound wave changing with time. Here amplitude = 0 represents silence. (From the definition of sound waves this amplitude is actually the amplitude of air particles that are oscillating because of the pressure change in the atmosphere due to sound).

These amplitudes are not very informative, as they only talk about the loudness of the audio recording. To better understand the audio signal, it is necessary to transform it into the frequency domain. The frequency-domain representation of a signal tells us what different frequencies are present in the signal. Fourier Transform is a mathematical concept that can convert a continuous signal from time-domain to frequency-domain. Let’s learn more about Fourier Transform.

Fourier Transform (FT)

An audio signal is a complex signal composed of multiple ‘single-frequency sound waves’ which travel together as a disturbance(pressure-change) in the medium. When sound is recorded we only capture the resultant amplitudes of those multiple waves. Fourier Transform is a mathematical concept that can decompose a signal into its constituent frequencies. Fourier transform does not just give the frequencies present in the signal, It also gives the magnitude of each frequency present in the signal.

Inverse Fourier Transform is just the opposite of the Fourier Transform. It takes the frequency-domain representation of a given signal as input and does mathematically synthesize the original signal. Let’s see how we can use Fourier transformation to convert our audio signal into its frequency components.

Fast Fourier Transform (FFT)

Fast Fourier Transformation(FFT) is a mathematical algorithm that calculates the Discrete Fourier Transform(DFT) of a given sequence. The only difference between FT(Fourier Transform) and FFT is that FT considers a continuous signal while FFT takes a discrete signal as input. DFT converts a sequence (discrete signal) into its frequency constituents just like FT does for a continuous signal. In our case, we have a sequence of amplitudes that were sampled from a continuous audio signal. DFT or FFT algorithm can convert this time-domain discrete signal into a frequency-domain.

FFT algorithm overview

Simple Sine Wave to Understand FFT

To understand the output of FFT, let’s create a simple sine wave. The following piece of code creates a sine wave with a sampling rate = 100, amplitude = 1, and frequency = 3. Amplitude values are calculated every 1/100th second (sampling rate) and stored into an array called y1. We will pass these discrete amplitude values to calculate the DFT of this signal using the FFT algorithm.

If you plot these discrete values(y1) keeping the sample number on the x-axis and amplitude value on the y-axis, it generates a nice sine wave plot as the following screenshot shows:

Now we have a sequence of amplitudes stored in list y1. We will pass this sequence to the FFT algorithm implemented by scipy. This algorithm returns a list yf of complex-valued amplitudes of the frequencies found in the signal. The first half of this list returns positive-frequency-terms, and the other half returns negative-frequency-terms which are similar to the positive ones. You can pick out any one half and calculate absolute values to represent the frequencies present in the signal. The following function takes samples as input and plots the frequency graph:

In the following graph, we have plotted the frequencies for our sine wave using the above fft_plot function. You can see this plot clearly shows the single frequency value present in our sine wave, which is 3. Also, it shows amplitude related to this frequency which we kept 1 for our sine wave.

To check out the output of FFT for a signal having more than one frequency, Let’s create another sine wave. This time we will keep sampling rate = 100, amplitude = 2 and frequency value = 11. The following code generates this signal and plots the sine wave:

Generated sine wave looks like the below graph. It would have been smoother if we had increased the sampling rate. We have kept the sampling rate = 100 because later we are going to add this signal to our old sine wave.

Obviously, FFT function will show a single spike with frequency = 11 for this wave also. But we want to see what happens if we add these two signals of the same sampling rate but the different frequency and amplitude values. Here sequence y3 will represent the resultant signal.

If we plot the signal y3, it looks something like this:

If we pass this sequence (y3) to our fft_plot function. It generates the following frequency graph for us. It shows two spikes for the two frequencies present in our resultant signal. So the presence of one frequency does not affect the other frequency in the signal. Also, one thing to notice is that the magnitudes of the frequencies are in line with our generated sine waves.

FFT on our Audio signal

Now that we have seen how this FFT algorithm gives us all the frequencies in a given signal. let’s try to pass our original audio signal into this function. We are using the same audio clip we loaded earlier into the python with a sampling rate = 16000.

Now, look at the following frequency plot. This ‘3-second long’ signal is composed of thousands of different frequencies. Magnitudes of frequency values > 2000 are very small as most of these frequencies are probably due to noise. We are plotting frequencies ranging from 0 to 8kHz because our signal was sampled at a 16k sampling rate and according to the Nyquist sampling theorem, it should only possess frequencies ≤ 8000Hz (16000/2).

Strong frequencies are ranging from 0 to 1kHz only because this audio clip was human speech. We know that in a typical human speech this range of frequencies dominates.

We got frequencies But where is the Time information?


why spectrogram

Suppose you are working on a Speech Recognition task. You have an audio file in which someone is speaking a phrase (for example: How are you). Your recognition system should be able to predict these three words in the same order (1. ‘how’, 2. ‘are’, 3. ‘you’). If you remember, in the previous exercise we broke our signal into its frequency values which will serve as features for our recognition system. But when we applied FFT to our signal, it gave us only frequency values and we lost the track of time information. Now our system won’t be able to tell what was spoken first if we use these frequencies as features. We need to find a different way to calculate features for our system such that it has frequency values along with the time at which they were observed. Here Spectrograms come into the picture.

Visual representation of frequencies of a given signal with time is called Spectrogram. In a spectrogram representation plot one axis represents the time, the second axis represents frequencies and the colors represent the magnitude (amplitude) of the observed frequency at a particular time. The following screenshot represents the spectrogram of the same audio signal we discussed earlier. Bright colors represent strong frequencies. Similar to the earlier FFT plot, smaller frequencies ranging from (0–1kHz) are strong(bright).

Creating and Plotting the spectrogram

Idea is to break the audio signal into smaller frames(windows) and calculate DFT (or FFT) for each window. This way we will be getting frequencies for each window and the window number will represent the time. As window 1 comes first, window 2 next…and so on. It’s a good practice to keep these windows overlapping otherwise we might lose a few frequencies. Window size depends upon the problem you are solving.

For a typical speech recognition task, a window of 20 to 30ms long is recommended. A human can’t possibly speak more than one phoneme in this time window. So keeping the window this much smaller we won’t lose any phonemes while classifying. The frame (window) overlap can vary from 25% to 75% as per your need, generally, it is kept 50% for speech recognition.

In our spectrogram calculation, we will keep the window duration 20ms and an overlap of 50% among the windows. Because our signal is sampled at 16k frequency, each window is going to have (16000 * 20 * 0.001) = 320 amplitudes. For an overlap of 50%, we need to go forward by (320/2) = 160 amplitude values to get to the next window. Thus our stride value is 160.

Have a look at the spectrogram function in the following image. In line 18 we are making a weighting window (Hanning \) and multiplying it with amplitudes before passing it to the FFT function in line 20. A weighting window is used here to handle the discontinuity of this small signal (small signal from a single frame) before passing it to the DFT algorithm. To learn more about why the weighting window is necessary click here.

A python function to calculate spectrogram features:

def spectrogram(samples, sample_rate, stride_ms = 10.0, 
                          window_ms = 20.0, max_freq = None, eps = 1e-14):

    stride_size = int(0.001 * sample_rate * stride_ms)
    window_size = int(0.001 * sample_rate * window_ms)

    # Extract strided windows
    truncate_size = (len(samples) - window_size) % stride_size
    samples = samples[:len(samples) - truncate_size]
    nshape = (window_size, (len(samples) - window_size) // stride_size + 1)
    nstrides = (samples.strides[0], samples.strides[0] * stride_size)
    windows = np.lib.stride_tricks.as_strided(samples, 
                                          shape = nshape, strides = nstrides)
    assert np.all(windows[:, 1] == samples[stride_size:(stride_size + window_size)])

    # Window weighting, squared Fast Fourier Transform (fft), scaling
    weighting = np.hanning(window_size)[:, None]
    fft = np.fft.rfft(windows * weighting, axis=0)
    fft = np.absolute(fft)
    fft = fft**2
    scale = np.sum(weighting**2) * sample_rate
    fft[1:-1, :] *= (2.0 / scale)
    fft[(0, -1), :] /= scale
    # Prepare fft frequency list
    freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
    # Compute spectrogram feature
    ind = np.where(freqs <= max_freq)[0][-1] + 1
    specgram = np.log(fft[:ind, :] + eps)
    return specgram

The output of the FFT algorithm is a list of complex numbers (size = window_size /2) which represent amplitudes of different frequencies within the window. For our window of size 320, we will get a list of 160 amplitudes of frequency bins which represent frequencies from 0 Hz — 8kHz (as our sampling rate is 16k) in our case.

Going forward, absolute values of those complex-valued amplitudes are calculated and normalized. The resulting 2D matrix is your spectrogram. In this matrix rows and columns represent window frame number and frequency bin while values represent the strength of the frequencies.

Speech Recognition using Spectrogram Features

We know how to generate a spectrogram now, which is a 2D matrix representing the frequency magnitudes along with time for a given signal. Now think of this spectrogram as an image. You have converted your audio file into the following image.

This reduces it to an image classification problem. This image represents your spoken phrase from left to right in a timely manner. Or consider this as an image where your phrase is written from left to right, and all you need to do is identify those hidden English characters.

Given a parallel corpus of English text, we can train a deep learning model and build a speech recognition system of our own. Here are two well-known open-source datasets to try out:

Popular open source datasets:
1. LibriSpeech ASR corpus
2. Common Voice Massively-Multilingual Speech Corpus

Popular choices of deep learning architectures can be understood from the following nice research papers:

  1. Wave2Lettter (Facebook Research)
  2. Deep SpeechDeep Speech 2, and Deep Speech 3 (Baidu Research)
  3. Listen, Attend and Spell (Google Brain)

Feature extraction from Audio signal

Every audio signal consists of many features. However, we must extract the characteristics that are relevant to the problem we are trying to solve. The process of extracting features to use them for analysis is called feature extraction. Let us study a few of the features in detail.

The spectral features (frequency-based features), which are obtained by converting the time-based signal into the frequency domain using the Fourier Transform, like fundamental frequency, frequency components, spectral centroid, spectral flux, spectral density, spectral roll-off, etc.

1. Spectral Centroid

The spectral centroid indicates at which frequency the energy of a spectrum is centered upon or in other words It indicates where the ” center of mass” for a sound is located. This is like a weighted mean:

where S(k) is the spectral magnitude at frequency bin k, f(k) is the frequency at bin k.

librosa.feature.spectral_centroid computes the spectral centroid for each frame in a signal:

import sklearn
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
# Computing the time variable for visualization
plt.figure(figsize=(12, 4))
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
# Normalising the spectral centroid for visualisation
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)
#Plotting the Spectral Centroid along the waveform
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='b')

.spectral_centroid will return an array with columns equal to the number of frames present in your sample.

There is a rise in the spectral centroid in the beginning.

2. Spectral Rolloff

It is a measure of the shape of the signal. It represents the frequency at which high frequencies decline to 0. To obtain it, we have to calculate the fraction of bins in the power spectrum where 85% of its power is at lower frequencies.

librosa.feature.spectral_rolloff computes the rolloff frequency for each frame in a signal:

spectral_rolloff = librosa.feature.spectral_rolloff(x+0.01, sr=sr)[0]
plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')

3. Spectral Bandwidth

The spectral bandwidth is defined as the width of the band of light at one-half the peak maximum (or full width at half maximum [FWHM]) and is represented by the two vertical red lines and λSB on the wavelength axis.

librosa.feature.spectral_bandwidth computes the order-p spectral bandwidth:

spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr)[0]
spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=3)[0]
spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=4)[0]
plt.figure(figsize=(15, 9))
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_bandwidth_2), color='r')
plt.plot(t, normalize(spectral_bandwidth_3), color='g')
plt.plot(t, normalize(spectral_bandwidth_4), color='y')
plt.legend(('p = 2', 'p = 3', 'p = 4'))

4. Zero-Crossing Rate

A very simple way of measuring the smoothness of a signal is to calculate the number of zero-crossing within a segment of that signal. A voice signal oscillates slowly — for example, a 100 Hz signal will cross zero 100 per second — whereas an unvoiced fricative can have 3000 zero crossings per second.

It usually has higher values for highly percussive sounds like those in metal and rock. Now let us visualize it and see how we calculate the zero crossing rate.

x, sr = librosa.load('/../../gruesome.wav')
#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
# Zooming in
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))

Zooming in

n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))

There appear to be 16 zero crossings. Let’s verify it with Librosa.

zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)

5. Mel-Frequency Cepstral Coefficients(MFCCs)

The Mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice.

mfccs = librosa.feature.mfcc(x, sr=fs)
(20, 97)
#Displaying  the MFCCs:
plt.figure(figsize=(15, 7))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

6. Chroma feature

chroma feature or vector is typically a 12-element feature vector indicating how much energy of each pitch class, {C, C#, D, D#, E, …, B}, is present in the signal. In short, It provides a robust way to describe a similarity measure between music pieces.

librosa.feature.chroma_stft is used for the computation of Chroma features.

chromagram = librosa.feature.chroma_stft(x, sr=sr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')

Now that we understood how we can play around with audio data and extract important features using python. In the following section, we are going to use these features and build an ANN model for music genre classification.

Music genre classification using ANN

This dataset was used for the well-known paper in genre classification “Musical genre classification of audio signals” by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050 Hz monophonic 16-bit audio files in .wav format.

The dataset can be downloaded from marsyas website.

The dataset consists of 10 genres i.e

  • Blues
  • Classical
  • Country
  • Disco
  • Hip-hop
  • Jazz
  • Metal
  • Pop
  • Reggae
  • Rock

Each genre contains 100 songs. Total dataset: 1000 songs.

Before moving ahead, I would recommend using Google Colab for doing everything related to Neural networks because it is free and provides GPUs and TPUs as runtime environments.


First of all, we need to convert the audio files into PNG format images(spectrograms). From these spectrograms, we have to extract meaningful features, i.e. MFCCs, Spectral Centroid, Zero Crossing Rate, Chroma Frequencies, and Spectral Roll-off.

Once the features have been extracted, they can be appended into a CSV file so that ANN can be used for classification.

So let’s begin.

1. Extract and load your data to google drive then mount the drive in Colab.


Google Colab directory structure after data is loaded.

2. Import all the required libraries.

import librosa
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
from PIL import Image
import pathlib
import csvfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScalerimport keras
from keras import layers
from keras import layers
import keras
from keras.models import Sequentialimport warnings

3. Now convert the audio data files into PNG format images or basically extract the Spectrogram for every Audio.

cmap = plt.get_cmap('inferno')
genres = 'blues classical country disco hiphop jazz metal pop reggae rock'.split()
for g in genres:
    pathlib.Path(f'img_data/{g}').mkdir(parents=True, exist_ok=True)
    for filename in os.listdir(f'./drive/My Drive/genres/{g}'):
        songname = f'./drive/My Drive/genres/{g}/{filename}'
        y, sr = librosa.load(songname, mono=True, duration=5)
        plt.specgram(y, NFFT=2048, Fs=2, Fc=0, noverlap=128, cmap=cmap, sides='default', mode='default', scale='dB');
        plt.savefig(f'img_data/{g}/{filename[:-3].replace(".", "")}.png')

Sample spectrogram of a song having genre as blues.


spectrogram of a song having genre as Blues

Now since all the audio files got converted into their respective spectrograms it’s easier to extract features.

4. Creating a header for our CSV file.

header = 'filename chroma_stft rmse spectral_centroid spectral_bandwidth rolloff zero_crossing_rate'
for i in range(1, 21):
    header += f' mfcc{i}'
header += ' label'
header = header.split()

5. Extracting features from Spectrogram: We will extract Mel-frequency cepstral coefficients (MFCC), Spectral Centroid, Zero Crossing Rate, Chroma Frequencies, and Spectral Roll-off.

file = open('dataset.csv', 'w', newline='')
with file:
    writer = csv.writer(file)
genres = 'blues classical country disco hiphop jazz metal pop reggae rock'.split()
for g in genres:
    for filename in os.listdir(f'./drive/My Drive/genres/{g}'):
        songname = f'./drive/My Drive/genres/{g}/{filename}'
        y, sr = librosa.load(songname, mono=True, duration=30)
        rmse = librosa.feature.rmse(y=y)
        chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)
        spec_cent = librosa.feature.spectral_centroid(y=y, sr=sr)
        spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=sr)
        rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
        zcr = librosa.feature.zero_crossing_rate(y)
        mfcc = librosa.feature.mfcc(y=y, sr=sr)
        to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'    
        for e in mfcc:
            to_append += f' {np.mean(e)}'
        to_append += f' {g}'
        file = open('dataset.csv', 'a', newline='')
        with file:
            writer = csv.writer(file)

6. Data preprocessing: It involves loading CSV data, label encoding, feature scaling and data split into training and test set.

data = pd.read_csv('dataset.csv')
# Dropping unneccesary columns
data = data.drop(['filename'],axis=1)
#Encoding the Labels
genre_list = data.iloc[:, -1]
encoder = LabelEncoder()
y = encoder.fit_transform(genre_list)#Scaling the Feature columns
scaler = StandardScaler()
X = scaler.fit_transform(np.array(data.iloc[:, :-1], dtype = float))
#Dividing data into training and Testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

7. Building an ANN model.

model = Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(X_train.shape[1],)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))model.compile(optimizer='adam',

8. Fit the model

classifier = model.fit(X_train,

After 100 epochs, Accuracy: 0.67


This article shows how to deal with audio data and a few audio analysis techniques from scratch. Also, it gives a starting point for building speech recognition systems. Although the above research shows very promising results for the recognition systems, still many don’t see speech recognition as a solved problem because of the following pitfalls:

  1. Speech recognition models presented by the researchers are really big (complex), which makes them hard to train and deploy.
  2. These systems don’t work well when multiple people are talking.
  3. These systems don’t work well when the quality of the audio is bad.
  4. They are really sensitive to the accent of the speaker and thus require training for every different accent.

There are huge opportunities in this field of research. Improvements can be done from the data preparation point of view (by creating better features) and also from the model architecture point of view (by presenting a more robust and scalable deep learning architecture).






Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.