C/C++ implements librosa audio processing library melspectrogram and mfcc

Table of contents

1. Project structure

2. Depending on the environment

3. C++ librosa audio processing library implementation

(1) Align and read audio files

(2) Align melspectrogram

(3) Align MFCCs

4. Demo running

5. librosa library C++ source code download

In deep learning speech processing, the audio processing library librosa is often used, but librosa currently only has a python version; in the development of speech recognition algorithms, melspectrogram ( Mel-spectrogram) and MFCC (Mel spectrogram ) are often used. Frequency cepstral coefficient ) these audio information, so it is necessary to implement the C/C++ version melspectrogram and MFCC; there are already many versions of C/C++ melspectrogram and MFCC on the Internet, but the test found that there is a big difference between the processing results of Python's librosa; after After multiple optimization tests, this project has implemented the functions of load, melspectrogram and mfcc in the C/C++ version of the audio processing library librosa. The project is basically completely aligned with the three functions of the Python audio processing library librosa:

librosa.load: implement speech reading
librosa.feature.melspectrogram: implements the computational Melspectrogram melspectrogram
librosa.feature.mfcc: implements the calculation of Mel frequency cepstral coefficient MFCC

[Respect originality, please indicate the source for reprinting] https://blog.csdn.net/guyuealian/article/details/132077896

1. Project structure

2. Depending on the environment

The project needs to install Python and C/C++ related dependency packages

Python depends on the library, just use pip install

numpy==1.16.3
matplotlib==3.1.0
Pillow==6.0.0
easydict==1.9
opencv-contrib-python==4.5.2.52
opencv-python==4.5.1.48
pandas==1.1.5
PyYAML==5.3.1
scikit-image==0.17.2
scikit-learn==0.24.0
scipy==1.5.4
seaborn==0.11.2
tqdm==4.55.1
xmltodict==0.12.0
pybaseutils==0.7.6
librosa==0.8.1
pyaudio==0.2.11
pydub==0.23.1

C++ dependent library, mainly used Eigen3 and OpenCV

Eigen3 : used for matrix calculation, the project already supports Eigen3, no need to install
OpenCV : used to display images, please refer to Ubuntu18.04 to install opencv and opencv_contrib for the installation method

3. C++ librosa audio processing library implementation

Eigenvalues commonly used in speech processing: Mel Spectrogram (Mel Spectrogram) and Mel Frequency Cepstrum Coefficient (MFCC), reference article: https://www.cnblogs.com/Ge-ronimo/p/17281385 .html

(1) Align and read audio files

Audio files can be read using librosa.load in Python

data, sr = librosa.load(path, sr, mono)

Python implements reading audio files:

# -*-coding: utf-8 -*-
import numpy as np
import librosa


def read_audio(audio_file, sr=16000, mono=True):
    """
    默认将多声道音频文件转换为单声道，并返回一维数组；
    如果你需要处理多声道音频文件，可以使用 mono=False,参数来保留所有声道，并返回二维数组。
    :param audio_file:
    :param sr: sampling rate
    :param mono: 设置为true是单通道，否则是双通道
    :return:
    """
    audio_data, sr = librosa.load(audio_file, sr=sr, mono=mono)
    audio_data = audio_data.T.reshape(-1)
    return audio_data, sr


def print_vector(name, data):
    np.set_printoptions(precision=7, suppress=False)
    print("------------------------%s------------------------\n" % name)
    print("{}".format(data.tolist()))


if __name__ == '__main__':
    sr = None
    audio_file = "data/data_s1.wav"
    data, sr = read_audio(audio_file, sr=sr, mono=False)
    print("sr         = %d, data size=%d" % (sr, len(data)))
    print_vector("audio data", data)

C/C++ read audio file: It needs to be decoded according to the audio data format, refer to: C language parsing wav file format , this project has realized the C/C++ version of reading audio data, which can support monophonic and dual-channel audio data (mono)

/**
 * 读取音频文件,目前仅支持wav格式文件
 * @param filename wav格式文件
 * @param out 输出音频数据
 * @param sr 输出音频采样率
 * @param mono 设置为true是单通道，否则是双通道
 * @return
 */
int read_audio(const char *filename, vector<float> &out, int *sr, bool mono = true);

#include <iostream>
#include <vector>
#include <algorithm>
#include "librosa/audio_utils.h"
#include "librosa/librosa.h"

using namespace std;

int main() {
    int sr = -1;
    string audio_file = "../data/data_s1.wav";
    vector<float> data;
    int res = read_audio(audio_file.c_str(), data, &sr, false);
    if (res < 0) {
        printf("read wav file error: %s\n", audio_file.c_str());
        return -1;
    }
    printf("sr         = %d, data size=%d\n", sr, data.size());
    print_vector("audio data", data);
    return 0;
}

Test and compare Python and C++ versions to read audio file data. After several rounds of tests, the difference in the audio values read between the two is very small, and the librosa.load() function of the python librosa library has basically been aligned.

	Numerical comparison
C++ version
Python version

(2) Align the Mel spectrogram melspectrogram

For the relevant principles of the melspectrogram Mel spectrum , please refer to Audio Signal Classification and Recognition Based on Mel Spectrum (Pytorch)

Python's librosa library provides the librosa.feature.melspectrogram() function, which returns a two-dimensional array that can be displayed using OpenCV

def librosa_feature_melspectrogram(y,
                                   sr=16000,
                                   n_mels=128,
                                   n_fft=2048,
                                   hop_length=256,
                                   win_length=None,
                                   window="hann",
                                   center=True,
                                   pad_mode="reflect",
                                   power=2.0,
                                   fmin=0.0,
                                   fmax=None,
                                   **kwargs):
    """
    计算音频梅尔频谱图(Mel Spectrogram)
    :param y: 音频时间序列
    :param sr: 采样率
    :param n_mels: number of Mel bands to generate产生的梅尔带数
    :param n_fft:  length of the FFT window FFT窗口的长度
    :param hop_length: number of samples between successive frames 帧移(相邻窗之间的距离)
    :param win_length: 窗口的长度为win_length，默认win_length = n_fft
    :param window:
    :param center: 如果为True，则填充信号y，以使帧 t以y [t * hop_length]为中心。
                   如果为False，则帧t从y [t * hop_length]开始
    :param pad_mode:
    :param power: 幅度谱的指数。例如1代表能量，2代表功率，等等
    :param fmin: 最低频率（Hz）
    :param fmax: 最高频率(以Hz为单位),如果为None,则使用fmax = sr / 2.0
    :param kwargs:
    :return: 返回Mel频谱shape=(n_mels,n_frames),n_mels是Mel频率的维度(频域),n_frames为时间帧长度(时域)
    """
    mel = librosa.feature.melspectrogram(y=y,
                                         sr=sr,
                                         S=None,
                                         n_mels=n_mels,
                                         n_fft=n_fft,
                                         hop_length=hop_length,
                                         win_length=win_length,
                                         window=window,
                                         center=center,
                                         pad_mode=pad_mode,
                                         power=power,
                                         fmin=fmin,
                                         fmax=fmax,
                                         **kwargs)
    return mel

According to the Python version of librosa.feature.melspectrogram(), the project implements the C++ version of melspectrogram

/***
 * compute mel spectrogram similar with librosa.feature.melspectrogram
 * @param x      input audio signal
 * @param sr     sample rate of 'x'
 * @param n_fft  length of the FFT size
 * @param n_hop  number of samples between successive frames
 * @param win    window function. currently only supports 'hann'
 * @param center same as librosa
 * @param mode   pad mode. support "reflect","symmetric","edge"
 * @param power  exponent for the magnitude melspectrogram
 * @param n_mels number of mel bands
 * @param fmin   lowest frequency (in Hz)
 * @param fmax    highest frequency (in Hz)
 * @return   mel spectrogram matrix
 */
static std::vector <std::vector<float>> melspectrogram(std::vector<float> &x, int sr,
                                                       int n_fft, int n_hop, const std::string &win, bool center,
                                                       const std::string &mode,
                                                       float power, int n_mels, int fmin, int fmax)

Test and compare the Python and C++ versions of melspectrogram, the difference in the return value of the two is already very small, and the visualized melspectrogram is basically the same.

Version

Numerical comparison

C++ version

Python version

(3) Aligned Mel frequency cepstral coefficient MFCC

The Python version can use librosa.feature.mfcc of the librosa library to implement MFCC (Mel-frequency cepstral coefficients)

def librosa_feature_mfcc(y,
                         sr=16000,
                         n_mfcc=128,
                         n_mels=128,
                         n_fft=2048,
                         hop_length=256,
                         win_length=None,
                         window="hann",
                         center=True,
                         pad_mode="reflect",
                         power=2.0,
                         fmin=0.0,
                         fmax=None,
                         dct_type=2,
                         **kwargs):
    """
    计算音频MFCC
    :param y: 音频时间序列
    :param sr: 采样率
    :param n_mfcc: number of MFCCs to return
    :param n_mels: number of Mel bands to generate产生的梅尔带数
    :param n_fft:  length of the FFT window FFT窗口的长度
    :param hop_length: number of samples between successive frames 帧移(相邻窗之间的距离)
    :param win_length: 窗口的长度为win_length，默认win_length = n_fft
    :param window:
    :param center: 如果为True，则填充信号y，以使帧 t以y [t * hop_length]为中心。
                   如果为False，则帧t从y [t * hop_length]开始
    :param pad_mode:
    :param power: 幅度谱的指数。例如1代表能量，2代表功率，等等
    :param fmin: 最低频率（Hz）
    :param fmax: 最高频率(以Hz为单位),如果为None,则使用fmax = sr / 2.0
    :param kwargs:
    :return: 返回MFCC shape=(n_mfcc,n_frames)
    """
    # MFCC 梅尔频率倒谱系数
    mfcc = librosa.feature.mfcc(y=y,
                                sr=sr,
                                S=None,
                                n_mfcc=n_mfcc,
                                n_mels=n_mels,
                                n_fft=n_fft,
                                hop_length=hop_length,
                                win_length=win_length,
                                window=window,
                                center=center,
                                pad_mode=pad_mode,
                                power=power,
                                fmin=fmin,
                                fmax=fmax,
                                dct_type=dct_type,
                                **kwargs)
    return mfcc

According to the Python version of librosa.feature.mfcc(), the project implements the C++ version of MFCC

/***
 * compute mfcc similar with librosa.feature.mfcc
 * @param x      input audio signal
 * @param sr     sample rate of 'x'
 * @param n_fft  length of the FFT size
 * @param n_hop  number of samples between successive frames
 * @param win    window function. currently only supports 'hann'
 * @param center same as librosa
 * @param mode   pad mode. support "reflect","symmetric","edge"
 * @param power  exponent for the magnitude melspectrogram
 * @param n_mels number of mel bands
 * @param fmin   lowest frequency (in Hz)
 * @param fmax   highest frequency (in Hz)
 * @param n_mfcc number of mfccs
 * @param norm   ortho-normal dct basis
 * @param type   dct type. currently only supports 'type-II'
 * @return mfcc matrix
 */
static std::vector<std::vector<float>> mfcc(std::vector<float> &x, int sr,
                                            int n_fft, int n_hop, const std::string &win, bool center, const std::string &mode,
                                            float power, int n_mels, int fmin, int fmax,
                                            int n_mfcc, bool norm, int type)

Test and compare the Python and C++ versions of MFCC, the difference in the return value of the two is very small, and the visualized MFCC diagrams are basically the same.

Version

Numerical comparison

C++ version

Python version

4. Demo running

The C++ version can be entered in the project root directory and terminal: bash build.sh to run the test demo

#!/usr/bin/env bash
if [ ! -d "build/" ];then
  mkdir "build"
else
  echo "exist build"
fi
cd build
cmake ..
make -j4
sleep 1

./main

main function

/****
 *   @Author : [email protected]
 *   @E-mail :
 *   @Date   :
 *   @Brief  : C/C++实现Melspectrogram和MFCC
 */
#include <iostream>
#include <vector>
#include <algorithm>
#include "librosa/audio_utils.h"
#include "librosa/librosa.h"
#include "librosa/cv_utils.h"

using namespace std;


int main() {
    int sr = -1;
    int n_fft = 400;
    int hop_length = 160;
    int n_mel = 64;
    int fmin = 80;
    int fmax = 7600;
    int n_mfcc = 64;
    int dct_type = 2;
    float power = 2.f;
    bool center = false;
    bool norm = true;
    string window = "hann";
    string pad_mode = "reflect";

    //string audio_file = "../data/data_d2.wav";
    string audio_file = "../data/data_s1.wav";
    vector<float> data;
    int res = read_audio(audio_file.c_str(), data, &sr, false);
    if (res < 0) {
        printf("read wav file error: %s\n", audio_file.c_str());
        return -1;
    }
    printf("n_fft      = %d\n", n_fft);
    printf("n_mel      = %d\n", n_mel);
    printf("hop_length = %d\n", hop_length);
    printf("fmin, fmax = (%d,%d)\n", fmin, fmax);
    printf("sr         = %d, data size=%d\n", sr, data.size());
    //print_vector("audio data", data);


    // compute mel Melspectrogram
    vector<vector<float>> mels_feature = librosa::Feature::melspectrogram(data, sr, n_fft, hop_length, window,
                                                                          center, pad_mode, power, n_mel, fmin, fmax);
    int mels_w = (int) mels_feature.size();
    int mels_h = (int) mels_feature[0].size();
    cv::Mat mels_image = vector2mat<float>(get_vector(mels_feature), 1, mels_h);
    print_feature("mels_feature", mels_feature);
    printf("mels_feature size(n_frames,n_mels)=(%d,%d)\n", mels_w, mels_h);
    image_show("mels_feature(C++)", mels_image, 10);

    // compute MFCC
    vector<vector<float>> mfcc_feature = librosa::Feature::mfcc(data, sr, n_fft, hop_length, window, center, pad_mode,
                                                                power, n_mel, fmin, fmax, n_mfcc, norm, dct_type);
    int mfcc_w = (int) mfcc_feature.size();
    int mfcc_h = (int) mfcc_feature[0].size();
    cv::Mat mfcc_image = vector2mat<float>(get_vector(mfcc_feature), 1, mfcc_h);
    print_feature("mfcc_feature", mfcc_feature);
    printf("mfcc_feature size(n_frames,n_mfcc)=(%d,%d)\n", mfcc_w, mfcc_h);
    image_show("mfcc_feature(C++)", mfcc_image, 10);


    cv::waitKey(0);
    printf("finish...");
    return 0;
}

The Python version can be entered in the project root directory and terminal: python main.py to run the test demo

# -*-coding: utf-8 -*-
"""
    @Author :
    @E-mail : 
    @Date   : 2023-08-01 22:27:56
    @Brief  :
"""
import cv2
import numpy as np
import librosa


def cv_show_image(title, image, use_rgb=False, delay=0):
    """
    调用OpenCV显示图片
    :param title: 图像标题
    :param image: 输入是否是RGB图像
    :param use_rgb: True:输入image是RGB的图像, False:返输入image是BGR格式的图像
    :param delay: delay=0表示暂停，delay>0表示延时delay毫米
    :return:
    """
    img = image.copy()
    if img.shape[-1] == 3 and use_rgb:
        img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)  # 将BGR转为RGB
    # cv2.namedWindow(title, flags=cv2.WINDOW_AUTOSIZE)
    cv2.namedWindow(title, flags=cv2.WINDOW_NORMAL)
    cv2.imshow(title, img)
    cv2.waitKey(delay)
    return img


def librosa_feature_melspectrogram(y,
                                   sr=16000,
                                   n_mels=128,
                                   n_fft=2048,
                                   hop_length=256,
                                   win_length=None,
                                   window="hann",
                                   center=True,
                                   pad_mode="reflect",
                                   power=2.0,
                                   fmin=0.0,
                                   fmax=None,
                                   **kwargs):
    """
    计算音频梅尔频谱图(Mel Spectrogram)
    :param y: 音频时间序列
    :param sr: 采样率
    :param n_mels: number of Mel bands to generate产生的梅尔带数
    :param n_fft:  length of the FFT window FFT窗口的长度
    :param hop_length: number of samples between successive frames 帧移(相邻窗之间的距离)
    :param win_length: 窗口的长度为win_length，默认win_length = n_fft
    :param window:
    :param center: 如果为True，则填充信号y，以使帧 t以y [t * hop_length]为中心。
                   如果为False，则帧t从y [t * hop_length]开始
    :param pad_mode:
    :param power: 幅度谱的指数。例如1代表能量，2代表功率，等等
    :param fmin: 最低频率（Hz）
    :param fmax: 最高频率(以Hz为单位),如果为None,则使用fmax = sr / 2.0
    :param kwargs:
    :return: 返回Mel频谱shape=(n_mels,n_frames),n_mels是Mel频率的维度(频域),n_frames为时间帧长度(时域)
    """
    mel = librosa.feature.melspectrogram(y=y,
                                         sr=sr,
                                         S=None,
                                         n_mels=n_mels,
                                         n_fft=n_fft,
                                         hop_length=hop_length,
                                         win_length=win_length,
                                         window=window,
                                         center=center,
                                         pad_mode=pad_mode,
                                         power=power,
                                         fmin=fmin,
                                         fmax=fmax,
                                         **kwargs)
    return mel


def librosa_feature_mfcc(y,
                         sr=16000,
                         n_mfcc=128,
                         n_mels=128,
                         n_fft=2048,
                         hop_length=256,
                         win_length=None,
                         window="hann",
                         center=True,
                         pad_mode="reflect",
                         power=2.0,
                         fmin=0.0,
                         fmax=None,
                         dct_type=2,
                         **kwargs):
    """
    计算音频MFCC
    :param y: 音频时间序列
    :param sr: 采样率
    :param n_mfcc: number of MFCCs to return
    :param n_mels: number of Mel bands to generate产生的梅尔带数
    :param n_fft:  length of the FFT window FFT窗口的长度
    :param hop_length: number of samples between successive frames 帧移(相邻窗之间的距离)
    :param win_length: 窗口的长度为win_length，默认win_length = n_fft
    :param window:
    :param center: 如果为True，则填充信号y，以使帧 t以y [t * hop_length]为中心。
                   如果为False，则帧t从y [t * hop_length]开始
    :param pad_mode:
    :param power: 幅度谱的指数。例如1代表能量，2代表功率，等等
    :param fmin: 最低频率（Hz）
    :param fmax: 最高频率(以Hz为单位),如果为None,则使用fmax = sr / 2.0
    :param kwargs:
    :return: 返回MFCC shape=(n_mfcc,n_frames)
    """
    # MFCC 梅尔频率倒谱系数
    mfcc = librosa.feature.mfcc(y=y,
                                sr=sr,
                                S=None,
                                n_mfcc=n_mfcc,
                                n_mels=n_mels,
                                n_fft=n_fft,
                                hop_length=hop_length,
                                win_length=win_length,
                                window=window,
                                center=center,
                                pad_mode=pad_mode,
                                power=power,
                                fmin=fmin,
                                fmax=fmax,
                                dct_type=dct_type,
                                **kwargs)
    return mfcc


def read_audio(audio_file, sr=16000, mono=True):
    """
    默认将多声道音频文件转换为单声道，并返回一维数组；
    如果你需要处理多声道音频文件，可以使用 mono=False,参数来保留所有声道，并返回二维数组。
    :param audio_file:
    :param sr: sampling rate
    :param mono: 设置为true是单通道，否则是双通道
    :return:
    """
    audio_data, sr = librosa.load(audio_file, sr=sr, mono=mono)
    audio_data = audio_data.T.reshape(-1)
    return audio_data, sr


def print_feature(name, feature):
    h, w = feature.shape[:2]
    np.set_printoptions(precision=7, suppress=True, linewidth=(11 + 3) * w)
    print("------------------------{}------------------------".format(name))
    for i in range(w):
        v = feature[:, i].reshape(-1)
        print("data[{:0=3d},:]={}".format(i, v))


def print_vector(name, data):
    np.set_printoptions(precision=7, suppress=False)
    print("------------------------%s------------------------\n" % name)
    print("{}".format(data.tolist()))


if __name__ == '__main__':
    sr = None
    n_fft = 400
    hop_length = 160
    n_mel = 64
    fmin = 80
    fmax = 7600
    n_mfcc = 64
    dct_type = 2
    power = 2.0
    center = False
    norm = True
    window = "hann"
    pad_mode = "reflect"
    audio_file = "data/data_s1.wav"
    data, sr = read_audio(audio_file, sr=sr, mono=False)
    print("n_fft      = %d" % n_fft)
    print("n_mel      = %d" % n_mel)
    print("hop_length = %d" % hop_length)
    print("fmin, fmax = (%d,%d)" % (fmin, fmax))
    print("sr         = %d, data size=%d" % (sr, len(data)))
    # print_vector("audio data", data)
    mels_feature = librosa_feature_melspectrogram(y=data,
                                                  sr=sr,
                                                  n_mels=n_mel,
                                                  n_fft=n_fft,
                                                  hop_length=hop_length,
                                                  win_length=None,
                                                  fmin=fmin,
                                                  fmax=fmax,
                                                  window=window,
                                                  center=center,
                                                  pad_mode=pad_mode,
                                                  power=power)
    print_feature("mels_feature", mels_feature)
    print("mels_feature size(n_frames,n_mels)=({},{})".format(mels_feature.shape[1], mels_feature.shape[0]))
    cv_show_image("mels_feature(Python)", mels_feature, delay=10)

    mfcc_feature = librosa_feature_mfcc(y=data,
                                        sr=sr,
                                        n_mfcc=n_mfcc,
                                        n_mels=n_mel,
                                        n_fft=n_fft,
                                        hop_length=hop_length,
                                        win_length=None,
                                        fmin=fmin,
                                        fmax=fmax,
                                        window=window,
                                        center=center,
                                        pad_mode=pad_mode,
                                        power=power,
                                        dct_type=dct_type)
    print_feature("mfcc_feature", mfcc_feature)
    print("mfcc_feature size(n_frames,n_mfcc)=({},{})".format(mfcc_feature.shape[1], mfcc_feature.shape[0]))
    cv_show_image("mfcc_feature(Python)", mfcc_feature, delay=10)

    cv2.waitKey(0)

5. librosa library C++ source code download

C/C++ implements librosa audio processing library melspectrogram and mfcc project code download address: C/C++ implements librosa audio processing library melspectrogram and mfcc

The project source code content includes:

Provide the C++ version of the read_audio() function to read audio files, currently only supports wav format files, and supports single/dual-channel audio reading

Provide the C++ version of librosa::Feature::melspectrogram() to realize the melspectrogram function

Provide the C++ version of librosa::Feature::mfcc() to realize the MFCC function

Provide OpenCV map display mode

The project demo comes with test data. After the build is completed, it can be run