Welcome to Norbert API Documentation!¶

Norbert is an implementation of the multichannel Wiener filter, that is a very popular way of filtering multichannel audio in the time-frequency domain for several applications, notably speech enhancement and source separation.

This filtering method assumes you have some way of estimating the (nonnegative) spectrograms for all the audio sources composing a mixture. If you only have a model for some target sources, and not for the rest, you may use norbert.contrib.residual_model() to let Norbert create a residual model for you.

Given all source spectrograms and the mixture time-frequency representation, this repository can build and apply the filter that is appropriate for separation, by optimally exploiting multichannel information (like in stereo signals). This is done in an iterative procedure called Expectation Maximization, where filtering and re-estimation of the parameters are iterated.

The core functions implemented in Norbert are:

`norbert.wiener`(v, x[, iterations, …])	Wiener-based separation for multichannel audio.
`norbert.contrib.residual_model`(v, x[, …])	Compute a model for the residual based on spectral subtraction.
`norbert.softmask`(v, x[, logit, eps])	Separates a mixture with a ratio mask, using the provided sources spectrograms estimates.

API documentation¶

norbert.expectation_maximization(y, x, iterations=2, verbose=0, eps=None)[source]¶

Expectation maximization algorithm, for refining source separation estimates.

This algorithm allows to make source separation results better by enforcing multichannel consistency for the estimates. This usually means a better perceptual quality in terms of spatial artifacts.

The implementation follows the details presented in [1], taking inspiration from the original EM algorithm proposed in [2] and its weighted refinement proposed in [3], [4]. It works by iteratively:

Re-estimate source parameters (power spectral densities and spatial covariance matrices) through get_local_gaussian_model().

Separate again the mixture with the new parameters by first computing the new modelled mixture covariance matrices with get_mix_model(), prepare the Wiener filters through wiener_gain() and apply them with apply_filter`().

Parameters

y: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_sources)]: initial estimates for the sources
x: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: complex STFT of the mixture signal
iterations: int [scalar]: number of iterations for the EM algorithm.
verbose: boolean: display some information if True
eps: float or None [scalar]: The epsilon value to use for regularization and filters. If None, the default will use the epsilon of np.real(x) dtype.

Returns

y: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_sources)]: estimated sources after iterations
v: np.ndarray [shape=(nb_frames, nb_bins, nb_sources)]: estimated power spectral densities
R: np.ndarray [shape=(nb_bins, nb_channels, nb_channels, nb_sources)]: estimated spatial covariance matrices

References

1: S. Uhlich and M. Porcu and F. Giron and M. Enenkl and T. Kemp and N. Takahashi and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
2: N.Q. Duong and E. Vincent and R.Gribonval. “Under-determined reverberant audio source separation using a full-rank spatial covariance model.” IEEE Transactions on Audio, Speech, and Language Processing 18.7 (2010): 1830-1840.
3: A. Nugraha and A. Liutkus and E. Vincent. “Multichannel audio source separation with deep neural networks.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.9 (2016): 1652-1664.
4: A. Nugraha and A. Liutkus and E. Vincent. “Multichannel music separation with deep neural networks.” 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2016.
5: A. Liutkus and R. Badeau and G. Richard “Kernel additive models for source separation.” IEEE Transactions on Signal Processing 62.16 (2014): 4298-4310.

norbert.wiener(v, x, iterations=1, use_softmask=True, eps=None)[source]¶

Wiener-based separation for multichannel audio.

The method uses the (possibly multichannel) spectrograms v of the sources to separate the (complex) Short Term Fourier Transform x of the mix. Separation is done in a sequential way by:

Getting an initial estimate. This can be done in two ways: either by directly using the spectrograms with the mixture phase, or by using softmask().
Refinining these initial estimates through a call to expectation_maximization().

This implementation also allows to specify the epsilon value used for regularization. It is based on [1], [2], [3], [4].

Parameters

v: np.ndarray [shape=(nb_frames, nb_bins, {1,nb_channels}, nb_sources)]

spectrograms of the sources. This is a nonnegative tensor that is usually the output of the actual separation method of the user. The spectrograms may be mono, but they need to be 4-dimensional in all cases.

x: np.ndarray [complex, shape=(nb_frames, nb_bins, nb_channels)]

STFT of the mixture signal.

iterations: int [scalar]

number of iterations for the EM algorithm

use_softmask: boolean

if False, then the mixture phase will directly be used with the spectrogram as initial estimates.
if True, a softmasking strategy will be used as described in softmask().

eps: {None, float}

Epsilon value to use for computing the separations. This is used whenever division with a model energy is performed, i.e. when softmasking and when iterating the EM. It can be understood as the energy of the additional white noise that is taken out when separating. If None, the default value is taken as np.finfo(np.real(x[0])).eps.

Returns

y: np.ndarray: [complex, shape=(nb_frames, nb_bins, nb_channels, nb_sources)]

STFT of estimated sources

References

1: S. Uhlich and M. Porcu and F. Giron and M. Enenkl and T. Kemp and N. Takahashi and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
2: A. Nugraha and A. Liutkus and E. Vincent. “Multichannel audio source separation with deep neural networks.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.9 (2016): 1652-1664.
3: A. Nugraha and A. Liutkus and E. Vincent. “Multichannel music separation with deep neural networks.” 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2016.
4: A. Liutkus and R. Badeau and G. Richard “Kernel additive models for source separation.” IEEE Transactions on Signal Processing 62.16 (2014): 4298-4310.

norbert.softmask(v, x, logit=None, eps=None)[source]¶

Separates a mixture with a ratio mask, using the provided sources spectrograms estimates. Additionally allows compressing the mask with a logit function for soft binarization. The filter does not take multichannel correlations into account.

The masking strategy can be traced back to the work of N. Wiener in the case of power spectrograms [1]. In the case of fractional spectrograms like magnitude, this filter is often referred to a “ratio mask”, and has been shown to be the optimal separation procedure under alpha-stable assumptions [2].

Parameters

v: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_sources)]: spectrograms of the sources
x: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: mixture signal
logit: {None, float between 0 and 1}: enable a compression of the filter. If not None, it is the threshold value for the logit function: a softmask above this threshold is brought closer to 1, and a softmask below is brought closer to 0.

Returns

ndarray, shape=(nb_frames, nb_bins, nb_channels, nb_sources): estimated sources

References

1: N. Wiener,”Extrapolation, Inerpolation, and Smoothing of Stationary Time Series.” 1949.
2: A. Liutkus and R. Badeau. “Generalized Wiener filtering with fractional power spectrograms.” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.

norbert.wiener_gain(v_j, R_j, inv_Cxx)[source]¶

Compute the wiener gain for separating one source, given all parameters. It is the matrix applied to the mix to get the posterior mean of the source as in [1]

Parameters

v_j: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: power spectral density of the target source.
R_j: np.ndarray [shape=(nb_bins, nb_channels, nb_channels)]: spatial covariance matrix of the target source
inv_Cxx: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_channels)]: inverse of the mixture covariance matrices

Returns

G: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_channels)]: wiener filtering matrices, to apply to the mix, e.g. through apply_filter() to get the target source estimate.

References

1: N.Q. Duong and E. Vincent and R.Gribonval. “Under-determined reverberant audio source separation using a full-rank spatial covariance model.” IEEE Transactions on Audio, Speech, and Language Processing 18.7 (2010): 1830-1840.

norbert.apply_filter(x, W)[source]¶

Applies a filter on the mixture. Just corresponds to a matrix multiplication.

Parameters

x: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: STFT of the signal on which to apply the filter.
W: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_channels)]: filtering matrices, as returned, e.g. by wiener_gain()

Returns

y_hat: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: filtered signal

norbert.get_mix_model(v, R)[source]¶

Compute the model covariance of a mixture based on local Gaussian models. simply adds up all the v[…, j] * R[…, j]

Parameters

v: np.ndarray [shape=(nb_frames, nb_bins, nb_sources)]: Power spectral densities for the sources
R: np.ndarray [shape=(nb_bins, nb_channels, nb_channels, nb_sources)]: Spatial covariance matrices of each sources

Returns

Cxx: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_channels)]: Covariance matrix for the mixture

norbert.get_local_gaussian_model(y_j, eps=1.0)[source]¶

Compute the local Gaussian model [1] for a source given the complex STFT. First get the power spectral densities, and then the spatial covariance matrix, as done in [1], [2]

Parameters

y_j: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: complex stft of the source.
eps: float [scalar]: regularization term

Returns

v_j: np.ndarray [shape=(nb_frames, nb_bins)]: power spectral density of the source
R_J: np.ndarray [shape=(nb_bins, nb_channels, nb_channels)]: Spatial covariance matrix of the source

References

1(1,2): N.Q. Duong and E. Vincent and R.Gribonval. “Under-determined reverberant audio source separation using a full-rank spatial covariance model.” IEEE Transactions on Audio, Speech, and Language Processing 18.7 (2010): 1830-1840.
2: A. Liutkus and R. Badeau and G. Richard. “Low bitrate informed source separation of realistic mixtures.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.

norbert.contrib.residual_model(v, x, alpha=1, autoscale=False)[source]¶

Compute a model for the residual based on spectral subtraction.

The method consists in two steps:

The provided spectrograms are summed up to obtain the input model for the mixture. This input model is scaled frequency-wise to best fit with the actual observed mixture spectrogram.
The residual model is obtained through spectral subtraction of the input model from the mixture spectrogram, with flooring to 0.

Parameters

v: np.ndarray [shape=(nb_frames, nb_bins, {1, nb_channels}, nb_sources)]: Estimated spectrograms for the sources
x: np.ndarray [shape=(nb_frames, nb_bins, nb_channels)]: complex mixture
alpha: float [scalar]: exponent for the spectrograms v. For instance, if alpha==1, then v must be homogoneous to magnitudes, and if alpha==2, v must homogeneous to squared magnitudes.
autoscale: boolean: in the case you know that the spectrograms will not have the right magnitude, it is important that the models are scaled so that the residual is correctly estimated.

Returns

v: np.ndarray [shape=(nb_frames, nb_bins, nb_channels, nb_sources+1)]: Spectrograms of the sources, with an appended one for the residual.

norbert.contrib.smooth(v, width=1, temporal=False)[source]¶

smoothes a ndarray with a Gaussian blur.

Parameters

v: np.ndarray [shape=(nb_frames, …)]: input array
sigma: int [scalar]: lengthscale of the gaussian blur
temporal: boolean: if True, will smooth only along time through 1d blur. Will use a multidimensional Gaussian blur otherwise.

Returns

result: np.ndarray [shape=(nb_frames, …)]: filtered array

norbert.contrib.reduce_interferences(v, thresh=0.6, slope=15)[source]¶

Reduction of interferences between spectrograms.

The objective of the method is to redistribute the energy of the input in order to “sparsify” spectrograms along the “source” dimension. This is motivated by the fact that sources are somewhat sparse and it is hence unlikely that they are all energetic at the same time-frequency bins.

The method is inspired from [1] with ad-hoc modifications.

Parameters

v: np.ndarray [shape=(…, nb_sources)]

non-negative data on which to apply interference reduction

thresh: float [scalar]: threshold for the compression, should be between 0 and 1. The closer to 1, the more reduction of the interferences, at the price of more distortion.
slope: float [scalar]: the slope at which binarization is done. The higher, the more brutal

Returns

v: np.ndarray [same shape as input]: v with reduced interferences

References

1: Thomas Prätzlich, Rachel Bittner, Antoine Liutkus, Meinard Müller. “Kernel additive modeling for interference reduction in multi- channel music recordings” Proc. of ICASSP 2015.

norbert.contrib.compress_filter(W, thresh=0.6, slope=15)[source]¶

Applies a logit compression to a filter. This enables to “binarize” a separation filter. This allows to reduce interferences at the price of distortion.

In the case of multichannel filters, decomposes them as the cascade of a pure beamformer (selection of one direction in space), followed by a single-channel mask. Then, compression is applied on the mask only.

Parameters

W: ndarray, shape=(…, nb_channels, nb_channels): filter on which to apply logit compression.
thresh: float: threshold for the compression, should be between 0 and 1. The closer to 1, the less interferences, but the more distortion.
slope: float: the slope at which binarization is done. The higher, the more brutal

Returns

W: np.ndarray [same shape as input]: Compressed filter

Welcome to Norbert API Documentation!¶

API documentation¶

Indices and tables¶

Citation¶