# Literature

In this article, we describe the basic ideas of existing methods for musical source separation (and specifically Lead/Accompaniment Separation) classified into three main categories: signal processing, audio modeling and probability theory. The interested reader is strongly encouraged to delve into the many online courses or textbooks available for a more detailed presentation of these topics, such as [12], [13] for signal processing, [9] for speech modeling, and [14], [15] for probability theory.

CITE

This article is based on a publication in the IEEE Journal of Transactions (opens new window). If you want to cite this article, please use the following reference.

@ARTICLE{rafii18,
  author={Z. Rafii and A. Liutkus and
          F. R. Stöter and S. I. Mimilakis
          and D. FitzGerald and B. Pardo},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={An Overview of Lead and Accompaniment Separation in Music},
  year={2018},
  volume={26},
  number={8},
  pages={1307-1335},
  doi={10.1109/TASLP.2018.2825440},
  ISSN={2329-9290},
  month={Aug}
}

# Signal processing

Sound is a series of pressure waves in the air. It is recorded as a waveform, a time-series of measurements of the displacement of the microphone diaphragm in response to these pressure waves. Sound is reproduced if a loudspeaker diaphragm is moved according to the recorded waveform. Multichannel signals simply consist of several waveforms, captured by more than one microphone. Typically, music signals are stereophonic, containing two waveforms.

Microphone displacement is typically measured at a fixed sampling frequency. In music processing, it is common to have sampling frequencies of (44.1) kHz (the sample frequency on a compact disc) or (48) kHz, which are higher than the typical sampling rates of (16) kHz or (8) kHz used for speech in telephony. This is because musical signals contain much higher frequency content than speech and the goal is aesthetic beauty in addition to basic intelligibility.

A time-frequency (TF) representation of sound is a matrix that encodes the time-varying spectrum of the waveform. Its entries are called TF bins and encode the varying spectrum of the waveform for all time frames and frequency channels. The most commonly-used TF representation is the short time Fourier transform (STFT) [16], which has complex entries: the angle accounts for the phase, i.e., the actual shift of the corresponding sinusoid at that time bin and frequency bin, and the magnitude accounts for the amplitude of that sinusoid in the signal. The magnitude (or power) of the STFT is called spectrogram. When the mixture is multichannel, the TF representation for each channel is computed, leading to a three-dimensional array: frequency, time and channel.

A TF representation is typically used as a first step in processing the audio because sources tend to be less overlapped in the TF representation than in the waveform [17]. This makes it easier to select portions of a mixture that correspond to only a single source. An STFT is typically used because it can be inverted back to the original waveform. Therefore, modifications made to the STFT can be used to create a modified waveform. Generally, a linear mixing process is considered, i.e., the mixture signal is equal to the sum of the source signals. Since the Fourier transform is a linear operation, this equality holds for the STFT. While that is not the case for the magnitude (or power) of the STFT, it is commonly assumed that the spectrograms of the sources sum to the spectrogram of the mixture.

In many methods, the separated sources are obtained by filtering the mixture. This can be understood as performing some equalization on the mixture, where each frequency is attenuated or kept intact. Since both the lead and the accompaniment signals change over time, the filter also changes. This is typically done using a TF mask, which, in its simplest form, is defined as the gain between (0) and (1) to apply on each element of the TF representation of the mixture (e.g., an STFT) in order to estimate the desired signal. Loosely speaking, it can be understood as an equalizer whose setting changes every few milliseconds. After multiplication of the mixture by a mask, the separated signal is recovered through an inverse TF transform. In the multichannel setting, more sophisticated filters may be designed that incorporate some delay and combine different channels; this is usually called beamforming. In the frequency domain, this is often equivalent to using complex matrices to multiply the mixture TF representation with, instead of just scalars between (0) and (1).

In practice, masks can be designed to filter the mixture in several ways. One may estimate the spectrogram for a single source or component, e.g., the accompaniment, and subtract it from the mixture spectrogram, e.g., in order to estimate the lead [18]. Another way would be to estimate separate spectrograms for both lead and accompaniment and combine them to yield a mask. For instance, a TF mask for the lead can be taken as the proportion of the lead spectrogram over the sum of both spectrograms, at each TF bin. Such filters are often called Wiener filters [19] or ratio masks. How they are calculated may involve some additional techniques like exponentiation and may be understood according to assumptions regarding the underlying statistics of the sources. For recent work in this area, and many useful pointers in designing such masks, the reader is referred to [20].

# Audio and speech modeling

It is typical in audio processing to describe audio waveforms as belonging to one of two different categories, which are sinusoidal signals — or pure tones — and noise. Actually, both are just the two extremes in a continuum of varying predictability: on the one hand, the shape of a sinusoidal wave in the future can reliably be guessed from previous samples. On the other hand, white noise is defined as an unpredictable signal and its spectrogram has constant energy everywhere. Different noise profiles may then be obtained by attenuating the energy of some frequency regions. This in turn induces some predictability in the signal, and in the extreme case where all the energy content is concentrated in one frequency, a pure tone is obtained.

A waveform may always be modeled as some filter applied on some excitation signal. Usually, the filter is assumed to vary smoothly across frequencies, hence modifying only what is called the spectral envelope of the signal, while the excitation signal comprises the rest. This is the basis for the source-filter model [21], which is of great importance in speech modeling, and thus also in vocal separation. As for speech, the filter is created by the shape of the vocal tract. The excitation signal is made of the glottal pulses generated by the vibration of the vocal folds. This results into voiced speech sounds made of time-varying harmonic/sinusoidal components. The excitation signal can also be the air flow passing through some constriction of the vocal tract. This results into unvoiced, noise-like, speech sounds. In this context, vowels are said to be voiced and tend to feature many sinusoids, while some phonemes such as fricatives are unvoiced and noisier.

A classical tool for dissociating the envelope from the excitation is the cepstrum [22]. It has applications for estimating the fundamental frequency [23], [24], for deriving the Mel-frequency cepstral coefficients (MFCC) [25], or for filtering signals through a so-called liftering operation [26] that enables modifications of either the excitation or the envelope parts through the source-filter paradigm.

An advantage of the source-filter model approach is indeed that one can dissociate the pitched content of the signal, embodied by the position of its harmonics, from its TF envelope which describes where the energy of the sound lies. In the case of vocals, it yields the ability to distinguish between the actual note being sung (pitch content) and the phoneme being uttered (mouth and vocal tract configuration), respectively. One key feature of vocals is they typically exhibit great variability in fundamental frequency over time. They can also exhibit larger vibratos (fundamental frequency modulations) and tremolos (amplitude modulations) in comparison to other instruments. A particularity of musical signals is that they typically consist of sequences of pitched notes. A sound gives the perception of having a pitch if the majority of the energy in the audio signal is at frequencies located at integer multiples of some fundamental frequency. These integer multiples are called harmonics. When the fundamental frequency changes, the frequencies of these harmonics also change, yielding the typical comb spectrograms of harmonic signals. Another noteworthy feature of sung melodies over simple speech is that their fundamental frequencies are, in general, located at precise frequency values corresponding to the musical key of the song. These very peculiar features are often exploited in separation methods. For simplicity reasons, we use the terms pitch and fundamental frequency interchangeably throughout the paper.

# Probability theory

Probability theory [14], [27] is an important framework for designing many data analysis and processing methods. Many of the methods described in this article use it and it is far beyond the scope of this paper to present it rigorously. For our purpose, it will suffice to say that the observations consist of the mixture signals. On the other hand, the parameters are any relevant feature about the source signal (such as pitch or time-varying envelope) or how the signals are mixed (e.g., the panning position). These parameters can be used to derive estimates about the target lead and accompaniment signals.

We understand a probabilistic model as a function of both the observations and the parameters: it describes how likely the observations are, given the parameters. For instance, a flat spectrum is likely under the noise model, and a mixture of comb spectrograms is likely under a harmonic model with the appropriate pitch parameters for the sources. When the observations are given, variation in the model depends only on the parameters. For some parameter value, it tells how likely the observations are. Under a harmonic model for instance, pitch may be estimated by finding the pitch parameter that makes the observed waveform as likely as possible. Alternatively, we may want to choose between several possible models such as voiced or unvoiced. In such cases, model selection methods are available, such as the Bayesian information criterion (BIC) [28].

Given these basic ideas, we briefly mention two models that are of particular importance. Firstly, the hidden Markov model (HMM) [15], [29] is relevant for time-varying observations. It basically defines several states, each one related to a specific model and with some probabilities for transitions between them. For instance, we could define as many states as possible notes played by the lead guitar, each one associated with a typical spectrum. The Viterbi algorithm is a dynamic programming method which actually estimates the most likely sequence of states given a sequence of observations [30]. Secondly, the Gaussian mixture model (GMM) [31] is a way to approximate any distribution as a weighted sum of Gaussians. It is widely used in clustering, because it works well with the celebrated Expectation-Maximization (EM) algorithm [32] to assign one particular cluster to each data point, while automatically estimating the clusters parameters. As we will see later, many methods work by assigning each TF bin to a given source in a similar way.

# Modeling the lead signal: harmonicity

# The approaches based on a harmonic assumption for vocals. In a first analysis step, the fundamental frequency of the lead signal is extracted. From it, a separation is obtained either by resynthesis (Section 3.1), or by filtering the mixture (Section 3.2).](figures/Figure2.pdf)

As mentioned in Section 2.2, one particularity of vocals is their production by the vibration of the vocal folds, further filtered by the vocal tract. As a consequence, sung melodies are mostly harmonic and therefore have a fundamental frequency. If one can track the pitch of the vocals, one can then estimate the energy at the harmonics of the fundamental frequency and reconstruct the voice. This is the basis of the oldest methods (as well as some more recent methods) we are aware of for separating the lead signal from a musical mixture.

Such methods are summarized in Figure [fig:methods_harmonicity]. In a first step, the objective is to get estimates of the time-varying fundamental frequency for the lead at each time frame. A second step in this respect is then to track this fundamental frequency over time, in other words, to find the best sequence of estimates, in order to identify the melody line. This can done either by a suitable pitch detection method, or by exploiting the availability of the score. Such algorithms typically assume that the lead corresponds to the harmonic signal with strongest amplitude. For a review on the particular topic of melody extraction, the reader is referred to [33].

From this starting point, we can distinguish between two kinds of approaches, depending on how they exploit the pitch information.

# Analysis-synthesis approaches

The first option to obtain the separated lead signal is to resynthesize it using a sinusoidal model. A sinusoidal model decomposes the sound with a set of sine waves of varying frequency and amplitude. If one knows the fundamental frequency of a pitched sound (like a singing voice), as well as the spectral envelope of the recording, then one can reconstruct the sound by making a set of sine waves whose frequencies are those of the harmonics of the fundamental frequency, and whose amplitudes are estimated from the spectral envelope of the audio. While the spectral envelope of the recording is generally not exactly the same as the spectral envelope of the target source, it can be a reasonable approximation, especially assuming that different sources do not overlap too much with each other in the TF representation of the mixture.

This idea allows for time-domain processing and was used in the earliest methods we are aware of. In 1973, Miller proposed in [34] to use the homomorphic vocoder [35] to separate the excitation function and impulse response of the vocal tract. Further refinements include segmenting parts of the signal as voiced, unvoiced, or silences using a heuristic program and manual interaction. Finally, cepstral liftering [26] was exploited to compensate for the noise or accompaniment.

Similarly, Maher used an analysis-synthesis approach in [36], assuming the mixtures are composed of only two harmonic sources. In his case, pitch detection was performed on the STFT and included heuristics to account for possibly colliding harmonics. He finally resynthesized each musical voice with a sinusoidal model.

Wang proposed instantaneous and frequency-warped techniques for signal parameterization and source separation, with application to voice separation in music [37], [38]. He introduced a frequency-locked loop algorithm which uses multiple harmonically constrained trackers. He computed the estimated fundamental frequency from a maximum-likelihood weighting of the tracking estimates. He was then able to estimate harmonic signals such as voices from complex mixtures.

Meron and Hirose proposed to separate singing voice and piano accompaniment [39]. In their case, prior knowledge consisting of musical scores was considered. Sinusoidal modeling as described in [40] was used.

Ben-Shalom and Dubnov proposed to filter an instrument or a singing voice out in such a way [41]. They first used a score alignment algorithm [42], assuming a known score. Then, they used the estimated pitch information to design a filter based on a harmonic model [43] and performed the filtering using the linear constraint minimum variance approach [44]. They additionally used a heuristic to deal with the unvoiced parts of the singing voice.

Zhang and Zhang proposed an approach based on harmonic structure modeling [45], [46]. They first extracted harmonic structures for singing voice and background music signals using a sinusoidal model [43], by extending the pitch estimation algorithm in [47]. Then, they used the clustering algorithm in [48] to learn harmonic structure models for the background music signals. Finally, they extracted the harmonic structures for all the instruments to reconstruct the background music signals and subtract them from the mixture, leaving only the singing voice signal.

More recently, Fujihara et al. proposed an accompaniment reduction method for singer identification [49], [50]. After fundamental frequency estimation using [51], they extracted the harmonic structure of the melody, i.e., the power and phase of the sinusoidal components at fundamental frequency and harmonics. Finally, they resynthesized the audio signal of the melody using the sinusoidal model in [52].

Similarly, Mesaros et al. proposed a vocal separation method to help with singer identification [53]. They first applied a melody transcription system [54] which estimates the melody line with the corresponding MIDI note numbers. Then, they performed sinusoidal resynthesis, estimating amplitudes and phases from the polyphonic signal.

In a similar manner, Duan et al. proposed to separate harmonic sources, including singing voices, by using harmonic structure models [55]. They first defined an average harmonic structure model for an instrument. Then, they learned a model for each source by detecting the spectral peaks using a cross-correlation method [56] and quadratic interpolation [57]. Then, they extracted the harmonic structures using BIC and a clustering algorithm [48]. Finally, they separated the sources by re-estimating the fundamental frequencies, re-extracting the harmonics, and reconstructing the signals using a phase generation method [58].

Lagrange et al. proposed to formulate lead separation as a graph partition problem [59], [60]. They first identified peaks in the spectrogram and grouped the peaks into clusters by using a similarity measure which accounts for harmonically related peaks, and the normalized cut criterion [61] which is used for segmenting graphs in computer vision. They finally selected the cluster of peaks which corresponds to a predominant harmonic source and resynthesized it using a bank of sinusoidal oscillators.

Ryynänen et al. proposed to separate accompaniment from polyphonic music using melody transcription for karaoke applications [62]. They first transcribed the melody into a MIDI note sequence and a fundamental frequency trajectory, using the method in [63], an improved version of the earlier method [54]. Then, they used sinusoidal modeling to estimate, resynthesize, and remove the lead vocals from the musical mixture, using the quadratic polynomial-phase model in [64].

# Comb-filtering approaches

Using sinusoidal synthesis to generate the lead signal suffers from a typical metallic sound quality, which is mostly due to discrepancies between the estimated excitation signals of the lead signal compared to the ground truth. To address this issue, an alternative approach is to exploit harmonicity in another way, by filtering out everything from the mixture that is not located close to the detected harmonics.

Li and Wang proposed to use a vocal/non-vocal classifier and a predominant pitch detection algorithm [65], [66]. They first detected the singing voice by using a spectral change detector [67] to partition the mixture into homogeneous portions, and GMMs on MFCCs to classify the portions as vocal or non-vocal. Then, they used the predominant pitch detection algorithm in [68] to detect the pitch contours from the vocal portions, extending the multi-pitch tracking algorithm in [69]. Finally, they extracted the singing voice by decomposing the vocal portions into TF units and labeling them as singing or accompaniment dominant, extending the speech separation algorithm in [70].

Han and Raphael proposed an approach for desoloing a recording of a soloist with an accompaniment given a musical score and its time alignment with the recording [71]. They derived a mask [72] to remove the solo part after using an EM algorithm to estimate its melody, that exploits the score as side information.

Hsu et al. proposed an approach which also identifies and separates the unvoiced singing voice [73], [74]. Instead of processing in the STFT domain, they use the perceptually motivated gammatone filter-bank as in [66], [70]. They first detected accompaniment, unvoiced, and voiced segments using an HMM and identified voice-dominant TF units in the voiced frames by using the singing voice separation method in [66], using the predominant pitch detection algorithm in [75]. Unvoiced-dominant TF units were identified using a GMM classifier with MFCC features learned from training data. Finally, filtering was achieved with spectral subtraction [76].

Raphael and Han then proposed a classifier-based approach to separate a soloist from accompanying instruments using a time-aligned symbolic musical score [77]. They built a tree-structured classifier [78] learned from labeled training data to classify TF points in the STFT as belonging to solo or accompaniment. They additionally constrained their classifier to estimate masks having a connected structure.

Cano et al. proposed various approaches for solo and accompaniment separation. In [79], they separated saxophone melodies from mixtures with piano and/or orchestra by using a melody line detection algorithm, incorporating information about typical saxophone melody lines. In [80]–[82], they proposed to use the pitch detection algorithm in [83]. Then, they refined the fundamental frequency and the harmonics, and created a binary mask for the solo and accompaniment. They finally used a post-processing stage to refine the separation. In [84], they included a noise spectrum in the harmonic refinement stage to also capture noise-like sounds in vocals. In [85], they additionally included common amplitude modulation characteristics in the separation scheme.

Bosch et al. proposed to separate the lead instrument using a musical score [86]. After a preliminary alignment of the score to the mixture, they estimated a score confidence measure to deal with local misalignments and used it to guide the predominant pitch tracking. Finally, they performed low-latency separation based on the method in [87], by combining harmonic masks derived from the estimated pitch and additionally exploiting stereo information as presented later in Section 7.

Vaneph et al. proposed a framework for vocal isolation to help spectral editing [88]. They first used a voice activity detection process based on a deep learning technique [89]. Then, they used pitch tracking to detect the melodic line of the vocal and used it to separate the vocal and background, allowing a user to provide manual annotations when necessary.

# Shortcomings

As can be seen, explicitly assuming that the lead signal is harmonic led to an important body of research. While the aforementioned methods show excellent performance when their assumptions are valid, their performance can drop significantly in adverse, but common situations.

Firstly, vocals are not always purely harmonic as they contain unvoiced phonemes that are not harmonic. As seen above, some methods already handle this situation. However, vocals can also be whispered or saturated, both of which are difficult to handle with a harmonic model.

Secondly, methods based on the harmonic model depend on the quality of the pitch detection method. If the pitch detector switches from following the pitch of the lead (e.g., the voice) to another instrument, the wrong sound will be isolated from the mix. Often, pitch detectors assume the lead signal is the loudest harmonic sound in the mix. Unfortunately, this is not always the case. Another instrument may be louder or the lead may be silent for a passage. The tendency to follow the pitch of the wrong instrument can be mitigated by applying constraints on the pitch range to estimate and by using a perceptually relevant weighting filter before performing pitch tracking. Of course, these approaches do not help when the lead signal is silent.

# Modeling the accompaniment: redundancy

In the previous section, we presented methods whose main focus was the modeling of a harmonic lead melody. Most of these studies did not make modeling the accompaniment a core focus. On the contrary, it was often dealt with as adverse noise to which the harmonic processing method should be robust to.

In this section, we present another line of research which concentrates on modeling the accompaniment under the assumption it is somehow more redundant than the lead signal. This assumption stems from the fact that musical accompaniments are often highly structured, with elements being repeated many times. Such repetitions can occur at the note level, in terms of rhythmic structure, or even from a harmonic point of view: instrumental notes are often constrained to have their pitch lie in a small set of frequencies. Therefore, modeling and removing the redundant elements of the signal are assumed to result in removal of the accompaniment.

In this paper, we identify three families of methods that exploit the redundancy of the accompaniment for separation.

# Grouping low-rank components

# The approaches based on a low-rank assumption. Non-negative matrix factorization (NMF) is used to identify components from the mixture, that are subsequently clustered into lead or accompaniment. Additional constraints may be incorporated.

The first set of approaches we consider is the identification of redundancy in the accompaniment through the assumption that its spectrogram may be well represented by only a few components. Techniques exploiting this idea then focus on algebraic methods that decompose the mixture spectrogram into the product of a few template spectra activated over time. One way to do so is via non-negative matrix factorization (NMF) [90], [91], which incorporates non-negative constraints. In Figure [fig:methods_low_rank], we picture methods exploiting such techniques. After factorization, we obtain several spectra, along with their activations over time. A subsequent step is the clustering of these spectra (and activations) into the lead or the accompaniment. Separation is finally performed by deriving Wiener filters to estimate the lead and the accompaniment from the mixture. For related applications of NMF in music analysis, the reader is referred to [92]–[94].

Vembu and Baumann proposed to use NMF (and also ICA [95]) to separate vocals from mixtures [96]. They first discriminated between vocal and non-vocal sections in a mixture by using different combinations of features, such as MFCCs [25], perceptual linear predictive (PLP) coefficients [97], and log frequency power coefficients (LFPC) [98], and training two classifiers, namely neural networks and support vector machines (SVM). They then applied redundancy reduction techniques on the TF representation of the mixture to separate the sources [99], by using NMF (or ICA). The components were then grouped as vocal and non-vocal by reusing a vocal/non-vocal classifier with MFCC, LFPC, and PLP coefficients.

Chanrungutai and Ratanamahatana proposed to use NMF with automatic component selection[100], [101]. They first decomposed the mixture spectrogram using NMF with a fixed number of basis components. They then removed the components with brief rhythmic and long-lasting continuous events, assuming that they correspond to instrumental sounds. They finally used the remaining components to reconstruct the singing voice, after refining them using a high-pass filter.

Marxer and Janer proposed an approach based on a Tikhonov regularization [102] as an alternative to NMF, for singing voice separation [103]. Their method sacrificed the non-negativity constraints of the NMF in exchange for a computationally less expensive solution for spectrum decomposition, making it more interesting in low-latency scenarios.

Yang et al. proposed a Bayesian NMF approach [104], [105]. Following the approaches in [106] and [107], they used a Poisson distribution for the likelihood function and exponential distributions for the model parameters in the NMF algorithm, and derived a variational Bayesian EM algorithm [32] to solve the NMF problem. They also adaptively determined the number of bases from the mixture. They finally grouped the bases into singing voice and background music by using a k-means clustering algorithm [108] or an NMF-based clustering algorithm.

In a different manner, Smaragdis and Mysore proposed a user-guided approach for removing sounds from mixtures by humming the target sound to be removed, for example a vocal track [109]. They modeled the mixture using probabilistic latent component analysis (PLCA) [110], another equivalent formulation of NMF. One key feature of exploiting user input was to facilitate the grouping of components into vocals and accompaniment, as humming helped to identify some of the parameters for modeling the vocals.

Nakamuray and Kameoka proposed an (L_p)-norm NMF [111], with (p) controlling the sparsity of the error. They developed an algorithm for solving this NMF problem based on the auxiliary function principle [112], [113]. Setting an adequate number of bases and (p) taken as small enough allowed them to estimate the accompaniment as the low-rank decomposition, and the singing voice as the error of the approximation, respectively. Note that, in this case, the singing voice was not explicitly modeled as a sparse component but rather corresponded to the error which happened to be constrained as sparse. The next subsection will actually deal with approaches that explicitly model the vocals as the sparse component.

# Low-rank accompaniment, sparse vocals

# The approaches based on a low-rank accompaniment, sparse vocals assumption. As opposed to methods based on NMF, methods based on robust principal component analysis (RPCA) assume the lead signal has a sparse and non-structured spectrogram.

The methods presented in the previous section first compute a decomposition of the mixture into many components that are sorted a posteriori as accompaniment or lead. As can be seen, this means they make a low-rank assumption for the accompaniment, but typically also for the vocals. However the spectrogram for the vocals do exhibit much more freedom than accompaniment, and experience shows they are not adequately described by a small number of spectral bases. For this reason, another track of research depicted in Figure [fig:methods_rpca] focused on using a low-rank assumption on the accompaniment only, while assuming the vocals are sparse and not structured. This loose assumption means that only a few coefficients from their spectrogram should have significant magnitude, and that they should not feature significant redundancy. Those ideas are in line with robust principal component analysis (RPCA) [114], which is the mathematical tool used by this body of methods, initiated by Huang et al. for singing voice separation [115] . It decomposes a matrix into a sparse and low-rank component.

Sprechmann et al. proposed an approach based on RPCA for online singing voice separation [116]. They used ideas from convex optimization [117], [118] and multi-layer neural networks [119]. They presented two extensions of RPCA and robust NMF models [120]. They then used these extensions in a multi-layer neural network framework which, after an initial training stage, allows online source separation.

Jeong and Lee proposed two extensions of the RPCA model to improve the estimation of vocals and accompaniment from the sparse and low-rank components [121]. Their first extension included the Schatten (p) and (\ell_{p}) norms as generalized nuclear norm optimizations [122]. They also suggested a pre-processing stage based on logarithmic scaling of the mixture TF representation to enhance the RPCA.

Yang also proposed an approach based on RPCA with dictionary learning for recovering low-rank components [123]. He introduced a multiple low-rank representation following the observation that elements of the singing voice can also be recovered by the low-rank component. He first incorporated online dictionary learning methods [124] in his methodology to obtain prior information about the structure of the sources and then incorporated them into the RPCA model.

Chan and Yang then extended RPCA to complex and quaternionic cases with application to singing voice separation [125]. They extended the principal component pursuit (PCP) [114] for solving the RPCA problem by presenting complex and quaternionic proximity operators for the (\ell_{1}) and trace-norm regularizations to account for the missing phase information.

# Repetitions within the accompaniment

While the rationale behind low-rank methods for lead-accompaniment separation is to exploit the idea that the musical background should be redundant, adopting a low-rank model is not the only way to do it. An alternate way to proceed is to exploit the musical structure of songs, to find repetitions that can be utilized to perform separation. Just like in RPCA-based methods, the accompaniment is then assumed to be the only source for which repetitions will be found. The unique feature of the methods described here is they combine music structure analysis [126]–[128] with particular ways to exploit the identification of repeated parts of the accompaniment.

# The approaches based on a repetition assumption for accompaniment. In a first analysis step, repetitions are identified. Then, they are used to build an estimate for the accompaniment spectrogram and proceed to separation.

Rafii et al. proposed the REpeating Pattern Extraction Technique (REPET) to separate the accompaniment by assuming it is repeating [129]–[131], which is often the case in popular music. This approach, which is representative of this line of research, is represented on Figure [fig:methods_repet]. First, a repeating period is extracted by a music information retrieval system, such as a beat spectrum [132] in this case. Then, this extracted information is used to estimate the spectrogram of the accompaniment through an averaging of the identified repetitions. From this, a filter is derived.

Seetharaman et al.[133] leveraged the two dimensional Fourier transform (2DFT) of the spectrogram to create an algorithm very similar to REPET. The properties of the 2DFT let them separate the periodic background from the non-periodic vocal melody by deleting peaks in the 2DFT. This eliminated the need to create an explicit model of the periodic audio and without the need to find the period of repetition, both of which are required in REPET.

Liutkus et al. adapted the REPET approach in [129], [130] to handle repeating structures varying along time by modeling the repeating patterns only locally [131], [134]. They first identified a repeating period for every time frame by computing a beat spectrogram as in [132]. Then they estimated the spectrogram of the accompaniment by averaging the time frames in the mixture spectrogram at their local period rate, for every TF bin. From this, they finally extracted the repeating structure by deriving a TF mask.

Rafii et al. further extended the REPET approaches in [129], [130] and [134] to handle repeating structures that are not periodic. To do this, they proposed the REPET-SIM method in [131], [135] to identify repeating frames for every time frame by computing a self-similarity matrix, as in [136]. Then, they estimated the accompaniment spectrogram at every TF bin by averaging the neighbors identified thanks to that similarity matrix. An extension for real-time processing was presented in [137] and a version exploiting user interaction was proposed in [138]. A method close to REPET-SIM was also proposed by FitzGerald in [139].

Liutkus et al. proposed the Kernel Additive modeling (KAM) [140], [141] as a framework which generalizes the REPET approaches in [129]–[131], [134], [135]. They assumed that a source at a TF location can be modeled using its values at other locations through a specified kernel which can account for features such as periodicity, self-similarity, stability over time or frequency, etc. This notably enabled modeling of the accompaniment using more than one repeating pattern. Liutkus et al. also proposed a light version using a fast compression algorithm to make the approach more scalable [142]. The approach was also used for interference reduction in music recordings [143], [144].

With the same idea of exploiting intra-song redundancies for singing voice separation, but through a very different methodology, Moussallam et al. assumed in [145] that all the sources can be decomposed sparsely in the same dictionary and used a matching pursuit greedy algorithm [146] to solve the problem. They integrated the separation process in the algorithm by modifying the atom selection criterion and adding a decision to assign a chosen atom to the repeated source or to the lead signal.

Deif et al. proposed to use multiple median filters to separate vocals from music recordings [147]. They augmented the approach in [148] with diagonal median filters to improve the separation of the vocal component. They also investigated different filter lengths to further improve the separation.

Lee et al. also proposed to use the KAM approach [149]–[152]. They applied the (\beta)-order minimum mean square error (MMSE) estimation [153] to the back-fitting algorithm in KAM to improve the separation. They adaptively calculated a perceptually weighting factor (\alpha) and the singular value decomposition (SVD)-based factorized spectral amplitude exponent (\beta) for each kernel component.

# Shortcomings

While methods focusing on harmonic models for the lead often fall short in their expressive power for the accompaniment, the methods we reviewed in this section are often observed to suffer exactly from the converse weakness, namely they do not provide an adequate model for the lead signal. Hence, the separated vocals often will feature interference from unpredictable parts from the accompaniment, such as some percussion or effects which occur infrequently.

Furthermore, even if the musical accompaniment will exhibit more redundancy, the vocals part will also be redundant to some extent, which is poorly handled by these methods. When the lead signal is not vocals but played by some lead instrument, its redundancy is even more pronounced, because the notes it plays lie in a reduced set of fundamental frequencies. Consequently, such methods would include the redundant parts of the lead within the accompaniment estimate, for example, a steady humming by a vocalist.

# Joint models for lead and accompaniment

In the previous sections, we reviewed two important bodies of literature, focused on modeling either the lead or the accompaniment parts of music recordings, respectively. While each approach showed its own advantages, it also featured its own drawbacks. For this reason, some researchers devised methods combining ideas for modeling both the lead and the accompaniment sources, and thus benefiting from both approaches. We now review this line of research.

# Using music structure analysis to drive learning

The first idea we find in the literature is to augment methods for accompaniment modeling with the prior identification of sections where the vocals are present or absent. In the case of the low rank models discussed in Sections 4.1 and 4.2, such a strategy indeed dramatically improves performance.

Raj et al. proposed an approach in [154] that is based on the PLCA formulation of NMF [155], and extends their prior work [156]. The parameters for the frequency distribution of the background music are estimated from the background music-only segments, and the rest of the parameters from the singing voice+background music segments, assuming a priori identified vocal regions.

Han and Chen also proposed a similar approach for melody extraction based on PLCA [157], which includes a further estimate of the melody from the vocals signal by an autocorrelation technique similar to [158].

Gómez et al. proposed to separate the singing voice from the guitar accompaniment in flamenco music to help with melody transcription [159]. They first manually segmented the mixture into vocal and non-vocal regions. They then learned percussive and harmonic bases from the non-vocal regions by using an unsupervised NMF percussive/harmonic separation approach [93], [160]. The vocal spectrogram was estimated by keeping the learned percussive and harmonic bases fixed.

Papadopoulos and Ellis proposed a signal-adaptive formulation of RPCA which incorporates music content information to guide the recovery of the sparse and low-rank components [161]. Prior musical knowledge, such as predominant melody, is used to regularize the selection of active coefficients during the optimization procedure.

In a similar manner, Chan et al. proposed to use RPCA with vocal activity information [162]. They modified the RPCA algorithm to constraint parts of the input spectrogram to be non-sparse to account for the non-vocal parts of the singing voice.

A related method was proposed by Jeong and Lee in [163], using RPCA with a weighted (l_1)-norm. They replaced the uniform weighting between the low-rank and sparse components in the RPCA algorithm by an adaptive weighting based on the variance ratio between the singing voice and the accompaniment. One key element of the method is to incorporate vocal activation information in the weighting.

# Factorization with a known melody

While using only the knowledge of vocal activity as described above already yields an increase of performance over methods operating blindly, many authors went further to also incorporate the fact that vocals often have a strong melody line. Some redundant model is then assumed for the accompaniment, while also enforcing a harmonic model for the vocals.

# Factorization informed with the melody. First, melody extraction is performed on the mixture. Then, this information is used to drive the estimation of the accompaniment: TF bins pertaining to the lead should not be taken into account for estimating the accompaniment model.

An early method to achieve this is depicted in Figure [fig:NMF_known_melody] and was proposed by Virtanen et al. in [164]. They estimated the pitch of the vocals in the mixture by using a melody transcription algorithm [63] and derived a binary TF mask to identify where vocals are not present. They then applied NMF on the remaining non-vocal segments to learn a model for the background.

Wang and Ou also proposed an approach which combines melody extraction and NMF-based soft masking [165]. They identified accompaniment, unvoiced, and voiced segments in the mixture using an HMM model with MFCCs and GMMs. They then estimated the pitch of the vocals from the voiced segments using the method in [166] and an HMM with the Viterbi algorithm as in [167]. They finally applied a soft mask to separate voice and accompaniment.

Rafii et al. investigated the combination of an approach for modeling the background and an approach for modeling the melody [168]. They modeled the background by deriving a rhythmic mask using the REPET-SIM algorithm [135] and the melody by deriving a harmonic mask using a pitch-based algorithm [169]. They proposed a parallel and a sequential combination of those algorithms.

Venkataramani et al. proposed an approach combining sinusoidal modeling and matrix decomposition, which incorporates prior knowledge about singer and phoneme identity [170]. They applied a predominant pitch algorithm on annotated sung regions [171] and performed harmonic sinusoidal modeling [172]. Then, they estimated the spectral envelope of the vocal component from the spectral envelope of the mixture using a phoneme dictionary. After that, a spectral envelope dictionary representing sung vowels from song segments of a given singer was learned using an extension of NMF [173], [174]. They finally estimated a soft mask using the singer-vowel dictionary to refine and extract the vocal component.

Ikemiya et al. proposed to combine RPCA with pitch estimation[175], [176]. They derived a mask using RPCA [115] to separate the mixture spectrogram into singing voice and accompaniment components. They then estimated the fundamental frequency contour from the singing voice component based on [177] and derived a harmonic mask. They integrated the two masks and resynthesized the singing voice and accompaniment signals. Dobashi et al. then proposed to use that singing voice separation approach in a music performance assistance system [178].

Hu and Liu proposed to combine approaches based on matrix decomposition and pitch information for singer identification[179]. They used non-negative matrix partial co-factorization [173], [180] which integrates prior knowledge about the singing voice and the accompaniment, to separate the mixture into singing voice and accompaniment portions. They then identified the singing pitch from the singing voice portions using [181] and derived a harmonic mask as in [182], and finally reconstructed the singing voice using a missing feature method [183]. They also proposed to add temporal and sparsity criteria to their algorithm [184].

That methodology was also adopted by Zhang et al. in [185], that followed the framework of the pitch-based approach in [66], by performing singing voice detection using an HMM classifier, singing pitch detection using the algorithm in [186], and singing voice separation using a binary mask. Additionally, they augmented that approach by analyzing the latent components of the TF matrix using NMF in order to refine the singing voice and accompaniment.

Zhu et al. [187] proposed an approach which is also representative of this body of literature, with the pitch detection algorithm being the one in [181] and binary TF masks used for separation after NMF.

# Joint factorization and melody estimation

The methods presented above put together the ideas of modeling the lead (typically the vocals) as featuring a melodic harmonic line and the accompaniment as redundant. As such, they already exhibit significant improvement over approaches only applying one of these ideas as presented in Sections 3 and 4, respectively. However, these methods above are still restricted in the sense that the analysis performed on each side cannot help improve the other one. In other words, the estimation of the models for the lead and the accompaniment are done sequentially. Another idea is to proceed jointly.

# Joint estimation of the lead and accompaniment, the former one as a source-filter model and the latter one as an NMF model.

A seminal work in this respect was done by Durrieu et al. using a source-filter and NMF model [188]–[190], depicted in Figure [fig:methods_sourcefilter]. Its core idea is to decompose the mixture spectrogram as the sum of two terms. The first term accounts for the lead and is inspired by the source-filter model described in Section 2: it is the element-wise product of an excitation spectrogram with a filter spectrogram. The former one can be understood as harmonic combs activated by the melodic line, while the latter one modulates the envelope and is assumed low-rank because few phonemes are used. The second term accounts for the accompaniment and is modeled with a standard NMF. In [188]–[190], they modeled the lead by using a GMM-based model [191] and a glottal source model [192], and the accompaniment by using an instantaneous mixture model [193] leading to an NMF problem [94]. They jointly estimated the parameters of their models by maximum likelihood estimation using an iterative algorithm inspired by [194] with multiplicative update rules developed in [91]. They also extracted the melody by using an algorithm comparable to the Viterbi algorithm, before re-estimating the parameters and finally performing source separation using Wiener filters [195]. In [196], they proposed to adapt their model for user-guided source separation.

The joint modeling of the lead and accompaniment parts of a music signal was also considered by Fuentes et al. in [197], that introduced the idea of using a log-frequency TF representation called the constant-Q transform (CQT) [198]–[200]. The advantage of such a representation is that a change in pitch corresponds to a simple translation in the TF plane, instead of a scaling as in the STFT. This idea was used along the creation of a user interface to guide the decomposition, in line with what was done in [196].

Joder and Schuller used the source-filter NMF model in [201], additionally exploiting MIDI scores [202]. They synchronized the MIDI scores to the audio using the alignment algorithm in [203]. They proposed to exploit the score information through two types of constraints applied in the model. In a first approach, they only made use of the information regarding whether the leading voice is present or not in each frame. In a second approach, they took advantage of both time and pitch information on the aligned score.

Zhao et al. proposed a score-informed leading voice separation system with a weighting scheme [204]. They extended the system in [202], which is based on the source-filter NMF model in [201], by using a Laplacian or a Gaussian-based mask on the NMF activation matrix to enhance the likelihood of the score-informed pitch candidates.

Jointly estimating accompaniment and lead allowed for some research in correctly estimating the unvoiced parts of the lead, which is the main issue with purely harmonic models, as highlighted in Section 3.3. In [201], [205], Durrieu et al. extended their model to account for the unvoiced parts by adding white noise components to the voice model.

In the same direction, Janer and Marxer proposed to separate unvoiced fricative consonants using a semi-supervised NMF [206]. They extended the source-filter NMF model in [201] using a low-latency method with timbre classification to estimate the predominant pitch [87]. They approximated the fricative consonants as an additive wideband component, training a model of NMF bases. They also used the transient quality to differentiate between fricatives and drums, after extracting transient time points using the method in [207].

Similarly, Marxer and Janer then proposed to separately model the singing voice breathiness [208]. They estimated the breathiness component by approximating the voice spectrum as a filtered composition of a glottal excitation and a wideband component. They modeled the magnitude of the voice spectrum using the model in [209] and the envelope of the voice excitation using the model in [192]. They estimated the pitch using the method in [87]. This was all integrated into the source-filter NMF model.

The body of research initiated by Durrieu et al. in [188] consists of using algebraic models more sophisticated than one simple matrix product, but rather inspired by musicological knowledge. Ozerov et al. formalized this idea through a general framework and showed its application for singing voice separation [210]–[212].

Finally, Hennequin and Rigaud augmented their model to account for long-term reverberation, with application to singing voice separation [213]. They extended the model in [214] which allows extraction of the reverberation of a specific source with its dry signal. They combined this model with the source-filter NMF model in [189].

# Different constraints for different sources

Algebraic methods that decompose the mixture spectrogram as the sum of the lead and accompaniment spectrograms are based on the minimization of a cost or loss function which measures the error between the approximation and the observation. While the methods presented above for lead and accompaniment separation did propose more sophisticated models with parameters explicitly pertaining to the lead or the accompaniment, another option that is also popular in the dedicated literature is to modify the cost function of an optimization algorithm for an existing algorithm (e.g., RPCA), so that one part of the resulting components would preferentially account for one source or another.

This approach can be exemplified by the harmonic-percussive source separation method (HPSS), presented in [160], [215], [216]. It consists in filtering a mixture spectrogram so that horizontal lines go in a so-called harmonic source, while its vertical lines go into a percussive source. Separation is then done with TF masking. Of course, such a method is not adequate for lead and accompaniment separation per se, because all the harmonic content of the accompaniment is classified as harmonic. However, it shows that nonparametric approaches are also an option, provided the cost function itself is well chosen for each source.

This idea was followed by Yang in [217] who proposed an approach based on RPCA with the incorporation of harmonicity priors and a back-end drum removal procedure to improve the decomposition. He added a regularization term in the algorithm to account for harmonic sounds in the low-rank component and used an NMF-based model trained for drum separation [211] to eliminate percussive sounds in the sparse component.

Jeong and Lee proposed to separate a vocal signal from a music signal [218], extending the HPSS approach in [160], [215]. Assuming that the spectrogram of the signal can be represented as the sum of harmonic, percussive, and vocal components, they derived an objective function which enforces the temporal and spectral continuity of the harmonic and percussive components, respectively, similarly to [160], but also the sparsity of the vocal component. Assuming non-negativity of the components, they then derived iterative update rules to minimize the objective function. Ochiai et al. extended this work in [219], notably by imposing harmonic constraints for the lead.

Watanabe et al. extended RPCA for singing voice separation [220]. They added a harmonicity constraint in the objective function to account for harmonic structures, such as in vocal signals, and regularization terms to enforce the non-negativity of the solution. They used the generalized forward-backward splitting algorithm [221] to solve the optimization problem. They also applied post-processing to remove the low frequencies in the vocal spectrogram and built a TF mask to remove time frames with low energy.

Going beyond smoothness and harmonicity, Hayashi et al. proposed an NMF with a constraint to help separate periodic components, such as a repeating accompaniment [222]. They defined a periodicity constraint which they incorporated in the objective function of the NMF algorithm to enforce the periodicity of the bases.

# Cascaded and iterated methods

In their effort to propose separation methods for the lead and accompaniment in music, some authors discovered that very different methods often have complementary strengths. This motivated the combination of methods. In practice, there are several ways to follow this line of research.

One potential route to achieve better separation is to cascade several methods. This is what FitzGerald and Gainza proposed in [216] with multiple median filters [148]. They used a median-filter based HPSS approach at different frequency resolutions to separate a mixture into harmonic, percussive, and vocal components. They also investigated the use of STFT or CQT as the TF representation and proposed a post-processing step to improve the separation with tensor factorization techniques [223] and non-negative partial co-factorization [180].

The two-stage HPSS system proposed by Tachibana et al. in [224] proceeds the same way. It is an extension of the melody extraction approach in [225] and was applied for karaoke in [226]. It consists in using the optimization-based HPSS algorithm from [160], [215], [227], [228] at different frequency resolutions to separate the mixture into harmonic, percussive, and vocal components.

# Cascading source separation methods. The results from method A is improved by applying methods B and C on its output, which are specialized in reducing interferences from undesired sources in each signal.

HPSS was not the only separation module considered as the building block of combined lead and accompaniment separation approaches. Deif et al. also proposed a multi-stage NMF-based algorithm [229], based on the approach in [230]. They used a local spectral discontinuity measure to refine the non-pitched components obtained from the factorization of the long window spectrogram and a local temporal discontinuity measure to refine the non-percussive components obtained from factorization of the short window spectrogram.

Finally, this cascading concept was considered again by Driedger and Müller in [231], that introduces a processing pipeline for the outputs of different methods [115], [164], [232], [233] to obtain an improved separation quality. Their core idea is depicted in Figure [fig:methods_cascading] and combines the output of different methods in a specific order to improve separation.

Another approach for improving the quality of separation when using several separation procedures is not to restrict the number of such iterations from one method to another, but rather to iterate them many times until satisfactory results are obtained. This is what is proposed in Hsu et al. in [234], extending the algorithm in [235]. They first estimated the pitch range of the singing voice by using the HPSS method in [160], [225]. They separated the voice given the estimated pitch using a binary mask obtained by training a multilayer perceptron [236] and re-estimated the pitch given the separated voice. Voice separation and pitch estimation are then iterated until convergence.

As another iterative method, Zhu et al. proposed a multi-stage NMF [230], using harmonic and percussive separation at different frequency resolutions similar to [225] and [216]. The main originality of their contribution was to iterate the refinements instead of applying it only once.

An issue with such iterated methods lies in how to decide whether convergence is obtained, and it is not clear whether the quality of the separated signals will necessarily improve. For this reason, Bryan and Mysore proposed a user-guided approach based on PLCA, which can be applied for the separation of the vocals [237]–[239]. They allowed a user to make annotations on the spectrogram of a mixture, incorporated the feedback as constraints in a PLCA model [110], [156], and used a posterior regularization technique [240] to refine the estimates, repeating the process until the user is satisfied with the results. This is similar to the way Ozerov et al. proposed to take user input into account in [241].

# Fusion of separation methods. The output of many separation methods is fed into a fusion system that combines them to produce a single estimate.

A principled way to aggregate the result of many source separation systems to obtain one single estimate that is consistently better than all of them was presented by Jaureguiberry et al. in their fusion framework, depicted in Figure [fig:methods_fusion]. It takes advantage of multiple existing approaches, and demonstrated its application to singing voice separation [242]–[244]. They investigated fusion methods based on non-linear optimization, Bayesian model averaging [245], and deep neural networks (DNN).

As another attempt to design an efficient fusion method, McVicar et al. proposed in [246] to combine the outputs of RPCA [115], HPSS [216], Gabor filtered spectrograms [247], REPET [130] and an approach based on deep learning [248]. To do this, they used different classification techniques to build the aggregated TF mask, such as a logistic regression model or a conditional random field (CRF) trained using the method in [249] with time and/or frequency dependencies.

Manilow et al. trained a neural network to predict quality of source separation for three source separation algorithms, each leveraging a different cue - repetition, spatialization, and harmonicity/pitch proximity [250]. The method estimates separation quality of the lead vocals for each algorithm, using only the original audio mixture and separated source output. These estimates were used to guide switching between algorithms along time.

# Source-dependent representations

In the previous section, we stated that some authors considered iterating separation at different frequency resolutions, i.e., using different TF representations [216], [224], [229]. This can be seen as a combination of different methods. However, this can also be seen from another perspective as based on picking specific representations.

Wolf et al. proposed an approach using rigid motion segmentation, with application to singing voice separation [251], [252]. They introduced harmonic template models with amplitude and pitch modulations defined by a velocity vector. They applied a wavelet transform [253] on the harmonic template models to build an audio image where the amplitude and pitch dynamics can be separated through the velocity vector. They then derived a velocity equation, similar to the optical flow velocity equation used in images [254], to segment velocity components. Finally, they identified the harmonic templates which model different sources in the mixture and separated them by approximating the velocity field over the corresponding harmonic template models.

Yen et al. proposed an approach using spectro-temporal modulation features [255], [256]. They decomposed a mixture using a two-stage auditory model which consists of a cochlear module [257] and cortical module [258]. They then extracted spectro-temporal modulation features from the TF units and clustered the TF units into harmonic, percussive, and vocal components using the EM algorithm and resynthesized the estimated signals.

Chan and Yang proposed an approach using an informed group sparse representation [259]. They introduced a representation built using a learned dictionary based on a chord sequence which exhibits group sparsity [260] and which can incorporate melody annotations. They derived a formulation of the problem in a manner similar to RPCA and solved it using the alternating direction method of multipliers [261]. They also showed a relation between their representation and the low-rank representation in [123], [262].

# Shortcomings

The large body of literature we reviewed in the preceding sections is concentrated on choosing adequate models for the lead and accompaniment parts of music signals in order to devise effective signal processing methods to achieve separation. From a higher perspective, their common feature is to guide the separation process in a model-based way: first, the scientist has some idea regarding characteristics of the lead signal and/or the accompaniment, and then an algorithm is designed to exploit this knowledge for separation.

Model-based methods for lead and accompaniment separation are faced with a common risk that their core assumptions will be violated for the signal under study. For instance, the lead to be separated may not be harmonic but saturated vocals or the accompaniment may not be repetitive or redundant, but rather always changing. In such cases, model-based methods are prone to large errors and poor performance.

# Data-driven approaches

A way to address the potential caveats of model-based separation behaving badly in case of violated assumptions is to avoid making assumptions altogether, but rather to let the model be learned from a large and representative database of examples. This line of research leads to data-driven methods, for which researchers are concerned about directly estimating a mapping between the mixture and either the TF mask for separating the sources, or their spectrograms to be used for designing a filter.

As may be foreseen, this strategy based on machine learning comes with several challenges of its own. First, it requires considerable amounts of data. Second, it typically requires a high-capacity learner (many tunable parameters) that can be prone to over-fitting the training data and therefore not working well on the audio it faces when deployed.

# Algebraic approaches

A natural way to exploit a training database was to learn some parts of the model to guide the estimation process into better solutions. Work on this topic may be traced back to the suggestion of Ozerov et al. in [276] to learn spectral template models based on a database of isolated sources, and then to adapt this dictionary of templates on the mixture using the method in [277].

The exploitation of training data was formalized by Smaragdis et al. in [110] in the context of source separation within the supervised and semi-supervised PLCA framework. The core idea of this probabilistic formulation, equivalent to NMF, is to learn some spectral bases from the training set which are then kept fixed at separation time.

In the same line, Ozerov et al. proposed an approach using Bayesian models [191]. They first segmented a song into vocal and non-vocal parts using GMMs with MFCCs. Then, they adapted a general music model on the non-vocal parts of a particular song by using the maximum a posteriori (MAP) adaptation approach in [278]

Ozerov et al. later proposed a framework for source separation which generalizes several approaches given prior information about the problem and showed its application for singing voice separation [210]–[212]. They chose the local Gaussian model [279] as the core of the framework and allowed the prior knowledge about each source and its mixing characteristics using user-specified constraints. Estimation was performed through a generalized EM algorithm [32].

Rafii et al. proposed in [280] to address the main drawback of the repetition-based methods described in Section 4.3, which is the weakness of the model for vocals. For this purpose, they combined the REPET-SIM model [135] for the accompaniment with a NMF-based model for singing voice learned from a voice dataset.

As yet another example of using training data for NMF, Boulanger-Lewandowski et al. proposed in [281] to exploit long-term temporal dependencies in NMF, embodied using recurrent neural networks (RNN) [236]. They incorporated RNN regularization into the NMF framework to temporally constrain the activity matrix during the decomposition, which can be seen as a generalization of the non-negative HMM in [282]. Furthermore, they used supervised and semi-supervised NMF algorithms on isolated sources to train the models, as in [110].

# Deep neural networks

# General architecture for methods exploiting deep learning. The network inputs the mixture and outputs either the sources spectrograms or a TF mask. Methods usually differ in their choice for a network architecture and the way it is learned using the training data.

Taking advantage of the recent availability of sufficiently large databases of isolated vocals along with their accompaniment, several researchers investigated the use of machine learning methods to directly estimate a mapping between the mixture and the sources. Although end-to-end systems inputting and outputting the waveforms have already been proposed in the speech community [283], they are not yet available for music source separation. This may be due to the relative small size of music separation databases, at most 10 h today. Instead, most systems feature pre and post-processing steps that consist in computing classical TF representations and building TF masks, respectively. Although such end-to-end systems will inevitably be proposed in the near future, the common structure of deep learning methods for lead and accompaniment separation usually corresponds for now to the one depicted in Figure [fig:methods_dnn]. From a general perspective, we may say that most current methods mainly differ in the structure picked for the network, as well as in the way it is learned.

Providing a thorough introduction to deep neural networks is out of the scope of this paper. For our purpose, it suffices to mention that they consist of a cascade of several possibly non-linear transformations of the input, which are learned during a training stage. They were shown to effectively learn representations and mappings, provided enough data is available for estimating their parameters [284]–[286]. Different architectures for neural networks may be combined/cascaded together, and many architectures were proposed in the past, such as feedforward fully-connected neural networks (FNN), convolutional neural networks (CNN), or RNN and variants such as the long short-term memory (LSTM) and the gated-recurrent units (GRU). Training of such functions is achieved by stochastic gradient descent [287] and associated algorithms, such as backpropagation [288] or backpropagation through time [236] for the case of RNNs.

To the best of our knowledge, Huang et al. were the first to propose deep neural networks, RNNs here [289], [290], for singing voice separation in [248], [291]. They adapted their framework from [292] to model all sources simultaneously through masking. Input and target functions were the mixture magnitude and a joint representation of the individual sources. The objective was to estimate jointly either singing voice and accompaniment music, or speech and background noise from the corresponding mixtures.

Modeling the temporal structures of both the lead and the accompaniment is a considerable challenge, even when using DNN methods. As an alternative to the RNN approach proposed by Huang et al. in [248], Uhlich et al. proposed the usage of FNNs [293] whose input consists of supervectors of a few consecutive frames from the mixture spectrogram. Later in [294], the same authors considered the use of bi-directional LSTMs for the same task.

In an effort to make the resulting system less computationally demanding at separation time but still incorporating dynamic modeling of audio, Simpson et al. proposed in [295] to predict binary TF masks using deep CNNs, which typically utilize fewer parameters than the FNNs. Similarly, Schlueter proposed a method trained to detect singing voice using CNNs [296]. In that case, the trained network was used to compute saliency maps from which TF masks can be computed for singing voice separation. Chandna et al. also considered CNNs for lead separation in [297], with a particular focus on low-latency.

The classical FNN, LSTM and CNN structures above served as baseline structures over which some others tried to improve. As a first example, Mimilakis et al. proposed to use a hybrid structure of FNNs with skip connections to separate the lead instrument for purposes of remixing jazz recordings [298]. Such skip connections allow to propagate the input spectrogram to intermediate representations within the network, and mask it similarly to the operation of TF masks. As advocated, this enforces the networks to approximate a TF masking process. Extensions to temporal data for singing voice separation were presented in [299], [300]. Similarly, Jansson et al. proposed to propagate the spectral information computed by convolutional layers to intermediate representations [301]. This propagation aggregates intermediate outputs to proceeding layer(s). The output of the last layer is responsible for masking the input mixture spectrogram. In the same vein, Takahashi et al. proposed to use skip connections via element-wise addition through representations computed by CNNs [302].

Apart from the structure of the network, the way it is trained, comprising how the targets are computed, has a tremendous impact on performance. As we saw, most methods operate on defining TF masks or estimating magnitude spectrograms. However, other methods were proposed based on deep clustering [303], [304], where TF mask estimation is seen as a clustering problem. Luo et al. investigated both approaches in [305] by proposing deep bidirectional LSTM networks capable of outputting both TF masks or features to use as in deep clustering. Kim and Smaragdis proposed in [306] another way to learn the model, in a denoising auto-encoding fashion [307], again utilizing short segments of the mixture spectrogram as an input to the network, as in [293].

As the best network structure may vary from one track to another, some authors considered a fusion of methods, in a manner similar to the method [242] presented above. Grais et. al [308], [309] proposed to aggregate the results from an ensemble of feedforward DNNs to predict TF masks for separation. An improvement was presented in [310], [311] where the inputs to the fusion network were separated signals, instead of TF masks, aiming at enhancing the reconstruction of the separated sources.

As can be seen the use of deep learning methods for the design of lead and accompaniment separation has already stimulated a lot of research, although it is still in its infancy. Interestingly, we also note that using audio and music specific knowledge appears to be fundamental in designing effective systems. As an example of this, the contribution from Nie et al. in [312] was to include the construction of the TF mask as an extra non-linearity included in a recurrent network. This is an exemplar of where signal processing elements, such as filtering through masking, are incorporated as a building block of the machine learning method.

The network structure is not the only thing that can benefit from audio knowledge for better separation. The design of appropriate features is another. While we saw that supervectors of spectrogram patches offered the ability to effectively model time-context information in FNNs [293], Sebastian and Murthy [313] proposed the use of the modified group delay feature representation [314] in their deep RNN architecture. They applied their approach for both singing voice and vocal-violin separation.

Finally, as with other methods, DNN-based separation techniques can also be combined with others to yield improved performance. As an example, Fan et al. proposed to use DNNs to separate the singing voice and to also exploit vocal pitch estimation [315]. They first extracted the singing voice using feedforward DNNs with sigmoid activation functions. They then estimated the vocal pitch from the extracted singing voice using dynamic programming.

# Shortcomings

Data-driven methods are nowadays the topic of important research efforts, particularly those based on DNNs. This is notably due to their impressive performance in terms of separation quality, as can, for instance, be noticed below in Section 8. However, they also come with some limitations.

First, we highlighted that lead and accompaniment separation in music has the very specific problem of scarce data. Since it is very hard to gather large amounts of training data for that application, it is hard to fully exploit learning methods that require large training sets. This raises very specific challenges in terms of machine learning.

Second, the lack of interpretability of model parameters is often mentioned as a significant shortcoming when it comes to applications. Indeed, music engineering systems are characterized by a strong importance of human-computer interactions, because they are used in an artistic context that may require specific needs or results. As of today, it is unclear how to provide user interaction for controlling the millions of parameters of DNN-based systems.

# Including multichannel information

In describing the above methods, we have not discussed the fact that music signals are typically stereophonic. On the contrary, the bulk of methods we discussed focused on designing good spectrogram models for the purpose of filtering mixtures that may be monophonic. Such a strategy is called single-channel source separation and is usually presented as more challenging than multichannel source separation. Indeed, only TF structure may then be used to discriminate the accompaniment from the lead. In stereo recordings, one further so-called spatial dimension is introduced, which is sometimes referred to as pan, that corresponds to the perceived position of a source in the stereo field. Devising methods to exploit this spatial diversity for source separation has also been the topic of an important body of research that we review now.

# Extracting the lead based on panning

In the case of popular music signals, a fact of paramount practical importance is that the lead signal — such as vocals — is very often mixed in the center, which means that its energy is approximately the same in left and right channels. On the contrary, other instruments are often mixed at positions to the left or right of the stereo field.

# Separation of the lead based on panning information. A stereo cue called panning allows to design a TF mask.

The general structure of methods extracting the lead based on stereo cues is displayed on Figure [fig:methods_panning], introduced by Avendano, who proposed to separate sources in stereo mixtures by using a panning index [316]. He derived a two-dimensional map by comparing left and right channels in the TF domain to identify the different sources based on their panning position [317]. The same methodology was considered by Barry et al. in [318] in his Azimuth Discrimination and Resynthesis (ADRess) approach, with panning indexes computed with differences instead of ratios.

Vinyes et al. also proposed to unmix commercially produced music recordings thanks to stereo cues [319]. They designed an interface similar to [318] where a user can set some parameters to generate different TF filters in real time. They showed applications for extracting various instruments, including vocals.

Cobos and López proposed to separate sources in stereo mixtures by using TF masking and multilevel thresholding [320]. They based their approach on the Degenerate Unmixing Estimation Technique (DUET) [321]. They first derived histograms by measuring the amplitude relationship between TF points in left and right channels. Then, they obtained several thresholds using the multilevel extension of Otsu’s method [322]. Finally, TF points were assigned to their related sources to produce TF masks.

Sofianos et al. proposed to separate the singing voice from a stereo mixture using ICA [323]–[325]. They assumed that most commercial songs have the vocals panned to the center and that they dominate the other sources in amplitude. In [323], they proposed to combine a modified version of ADRess with ICA to filter out the other instruments. In [324], they proposed a modified version without ADRess.

Kim et al. proposed to separate centered singing voice in stereo music by exploiting binaural cues, such as inter-channel level and inter-channel phase difference [326]. To this end, they build the pan-based TF mask through an EM algorithm, exploiting a GMM model on these cues.

# Augmenting models with stereo

As with using only a harmonic model for the lead signal, using stereo cues in isolation is not always sufficient for good separation, as there can often be multiple sources at the same spatial location. Combining stereo cues with other methods improves performance in these cases.

Cobos and López proposed to extract singing voice by combining panning information and pitch tracking [327]. They first obtained an estimate for the lead thanks to a pan-based method such as [316], and refined the singing voice by using a TF binary mask based on comb-filtering method as in Section 3.2. The same combination was proposed by Marxer et al. in [87] in a low-latency context, with different methods used for the binaural cues and pitch tracking blocks.

FitzGerald proposed to combine approaches based on repetition and panning to extract stereo vocals [328]. He first used his nearest neighbors median filtering algorithm [139] to separate vocals and accompaniment from a stereo mixture. He then used the ADRess algorithm [318] and a high-pass filter to refine the vocals and improve the accompaniment. In a somewhat different manner, FitzGerald and Jaiswal also proposed to combine approaches based on repetition and panning to improve stereo accompaniment recovery [329]. They presented an audio inpainting scheme [330] based on the nearest neighbors and median filtering algorithm [139] to recover TF regions of the accompaniment assigned to the vocals after using a source separation algorithm based on panning information.

In a more theoretically grounded manner, several methods based on a probabilistic model were generalized to the multichannel case. For instance, Durrieu et al. extended their source-filter model in [201], [205] to handle stereo signals, by incorporating the panning coefficients as model parameters to be estimated.

Ozerov and Févotte proposed a multichannel NMF framework with application to source separation, including vocals and music [331], [332]. They adopted a statistical model where each source is represented as a sum of Gaussian components [193], and where maximum likelihood estimation of the parameters is equivalent to NMF with the Itakura-Saito divergence [94]. They proposed two methods for estimating the parameters of their model, one that maximized the likelihood of the multichannel data using EM, and one that maximized the sum of the likelihoods of all channels using a multiplicative update algorithm inspired by NMF [90].

Ozerov et al. then proposed a multichannel non-negative tensor factorization (NTF) model with application to user-guided source separation [333]. They modeled the sources jointly by a 3-valence tensor (time/frequency/source) as in [334] which extends the multichannel NMF model in [332]. They used a generalized EM algorithm based on multiplicative updates [335] to minimize the objective function. They incorporated information about the temporal segmentation of the tracks and the number of components per track. Ozerov et al. later proposed weighted variants of NMF and NTF with application to user-guided source separation, including separation of vocals and music [241], [336].

Sawada et al. also proposed multichannel extensions of NMF, tested for separating stereo mixtures of multiple sources, including vocals and accompaniment [337]–[339]. They first defined multichannel extensions of the cost function, namely, Euclidean distance and Itakura-Saito divergence, and derived multiplicative update rules accordingly. They then proposed two techniques for clustering the bases, one built into the NMF model and one performing sequential pair-wise merges.

Finally, multichannel information was also used with DNN models. Nugraha et al. addressed the problem of multichannel source separation for speech enhancement [340], [341] and music separation [342], [343]. In this framework, DNNs are still used for the spectrograms, while more classical EM algorithms [344], [345] are used for estimating the spatial parameters.

# Shortcomings

When compared to simply processing the different channels independently, incorporating spatial information in the separation method often comes at the cost of additional computational complexity. The resulting methods are indeed usually more demanding in terms of computing power, because they involve the design of beamforming filters and inversion of covariance matrices. While this is not really an issue for stereophonic music, this may become prohibiting in configurations with higher numbers of channels

# References

[1] R. Kalakota and M. Robinson, E-business 2.0: Roadmap for success. Addison-Wesley Professional, 2000.

[2] C. K. Lam and B. C. Tan, “The Internet is changing the music industry,” Communications of the ACM, vol. 44, no. 8, pp. 62–68, 2001.

[3] P. Common and C. Jutten, Handbook of blind source separation. Academic Press, 2010.

[4] G. R. Naik and W. Wang, Blind source separation. Springer-Verlag Berlin Heidelberg, 2014.

[5] A. Hyvärinen, “Fast and robust fixed-point algorithm for independent component analysis,” IEEE Transactions on Neural Networks, vol. 10, no. 3, pp. 626–634, May 1999.

[6] A. Hyvärinen and E. Oja, “Independent component analysis: Algorithms and applications,” Neural Networks, vol. 13, nos. 4-5, pp. 411–430, Jun. 2000.

[7] S. Makino, T.-W. Lee, and H. Sawada, Blind speech separation. Springer Netherlands, 2007.

[8] E. Vincent, T. Virtanen, and S. Gannot, Audio source separation and speech enhancement. Wiley, 2018.

[9] P. C. Loizou, Speech enhancement: Theory and practice. CRC Press, 1990.

[10] A. Liutkus, J.-L. Durrieu, L. Daudet, and G. Richard, “An overview of informed audio source separation,” in 14th international workshop on image analysis for multimedia interactive services, 2013.

[11] E. Vincent, N. Bertin, R. Gribonval, and F. Bimbot, “From blind to guided audio source separation: How models and side information can improve the separation of sound,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 107–115, May 2014.

[12] U. Zölzer, DAFX - digital audio effects. Wiley, 2011.

[13] M. Müller, Fundamentals of music processing: Audio, analysis, algorithms, applications. Springer, 2015.

[14] E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003.

[15] O. Cappé, E. Moulines, and T. Ryden, Inference in hidden markov models (springer series in statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2005.

[16] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 34, no. 4, pp. 744–754, Aug. 1986.

[17] S. Rickard and O. Yilmaz, “On the approximate w-disjoint orthogonality of speech,” in IEEE international conference on acoustics, speech, and signal processing, 2002.

[18] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979.

[19] N. Wiener, “Extrapolation, interpolation, and smoothing of stationary time series,” 1975.

[20] A. Liutkus and R. Badeau, “Generalized Wiener filtering with fractional power spectrograms,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[21] G. Fant, Acoustic theory of speech production. Walter de Gruyter, 1970.

[22] B. P. Bogert, M. J. R. Healy, and J. W. Tukey, “The quefrency alanysis of time series for echoes: Cepstrum pseudo-autocovariance, cross-cepstrum, and saphe cracking,” Proceedings of a symposium on time series analysis, pp. 209–243, 1963.

[23] A. M. Noll, “Short-time spectrum and ‘cepstrum’ techniques for vocal-pitch detection,” Journal of the Acoustical Society of America, vol. 36, no. 2, pp. 296–302, 1964.

[24] A. M. Noll, “Cepstrum pitch determination,” Journal of the Acoustical Society of America, vol. 41, no. 2, pp. 293–309, 1967.

[25] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 28, no. 4, pp. 357–366, Aug. 1980.

[26] A. V. Oppenheim, “Speech analysis-synthesis system based on homomorphic filtering,” Journal of the Acoustical Society of America, vol. 45, no. 2, pp. 458–465, 1969.

[27] R. Durrett, Probability: Theory and examples. Cambridge university press, 2010.

[28] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6, no. 2, pp. 461–464, Mar. 1978.

[29] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989.

[30] A. J. Viterbi, “A personal history of the Viterbi algorithm,” IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 120–142, 2006.

[31] C. Bishop, Neural networks for pattern recognition. Clarendon Press, 1996.

[32] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1–38, 1977.

[33] J. Salamon, E. Gómez, D. Ellis, and G. Richard, “Melody extraction from polyphonic music signals: Approaches, applications and challenges,” IEEE Signal Processing Magazine, vol. 31, 2014.

[34] N. J. Miller, “Removal of noise from a voice signal by synthesis,” Utah University, 1973.

[35] A. V. Oppenheim and R. W. Schafer, “Homomorphic analysis of speech,” IEEE Transactions on Audio and Electroacoustics, vol. 16, no. 2, pp. 221–226, Jun. 1968.

[36] R. C. Maher, “An approach for the separation of voices in composite musical signals,” PhD thesis, University of Illinois at Urbana-Champaign, 1989.

[37] A. L. Wang, “Instantaneous and frequency-warped techniques for auditory source separation,” PhD thesis, Stanford University, 1994.

[38] A. L. Wang, “Instantaneous and frequency-warped techniques for source separation and signal parametrization,” in IEEE workshop on applications of signal processing to audio and acoustics, 1995.

[39] Y. Meron and K. Hirose, “Separation of singing and piano sounds,” in 5th international conference on spoken language processing, 1998.

[40] T. F. Quatieri, “Shape invariant time-scale and pitch modification of speech,” IEEE Transactions on Signal Processing, vol. 40, no. 3, pp. 497–510, Mar. 1992.

[41] A. Ben-Shalom and S. Dubnov, “Optimal filtering of an instrument sound in a mixed recording given approximate pitch prior,” in International computer music conference, 2004.

[42] S. Shalev-Shwartz, S. Dubnov, N. Friedman, and Y. Singer, “Robust temporal and spectral modeling for query by melody,” in 25th annual international acm sigir conference on research and development in information retrieval, 2002.

[43] X. Serra, “Musical sound modeling with sinusoids plus noise,” in Musical signal processing, Swets & Zeitlinger, 1997, pp. 91–122.

[44] B. V. Veen and K. M. Buckley, “Beamforming techniques for spatial filtering,” in The digital signal processing handbook, CRC Press, 1997, pp. 1–22.

[45] Y.-G. Zhang and C.-S. Zhang, “Separation of voice and music by harmonic structure stability analysis,” in IEEE international conference on multimedia and expo, 2005.

[46] Y.-G. Zhang and C.-S. Zhang, “Separation of music signals by harmonic structure modeling,” in Advances in neural information processing systems 18, MIT Press, 2006, pp. 1617–1624.

[47] E. Terhardt, “Calculating virtual pitch,” Hearing Research, vol. 1, no. 2, pp. 155–182, Mar. 1979.

[48] Y.-G. Zhang, C.-S. Zhang, and S. Wang, “Clustering in knowledge embedded space,” in Machine learning: ECML 2003, Springer Berlin Heidelberg, 2003, pp. 480–491.

[49] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Singer identification based on accompaniment sound reduction and reliable frame selection,” in 6th international conference on music information retrieval, 2005.

[50] H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno, “A modeling of singing voice robust to accompaniment sounds and its application to singer identification and vocal-timbre-similarity-based music information retrieval,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 638–648, Mar. 2010.

[51] M. Goto, “A real-time music-scene-description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals,” Speech Communication, vol. 43, no. 4, pp. 311–329, Sep. 2004.

[52] J. A. Moorer, “Signal processing aspects of computer music: A survey,” Proceedings of the IEEE, vol. 65, no. 8, pp. 1108–1137, Aug. 2005.

[53] A. Mesaros, T. Virtanen, and A. Klapuri, “Singer identification in polyphonic music using vocal separation and pattern recognition methods,” in 7th international conference on music information retrieval, 2007.

[54] M. Ryynänen and A. Klapuri, “Transcription of the singing melody in polyphonic music,” in 7th international conference on music information retrieval, 2006.

[55] Z. Duan, Y.-F. Zhang, C.-S. Zhang, and Z. Shi, “Unsupervised single-channel music source separation by average harmonic structure modeling,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 4, pp. 766–778, May 2008.

[56] X. Rodet, “Musical sound signal analysis/synthesis: Sinusoidal+Residual and elementary waveform models,” in IEEE time-frequency and time-scale workshop, 1997.

[57] J. O. Smith and X. Serra, “PARSHL: An analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation,” in International computer music conference, 1987.

[58] M. Slaney, D. Naar, and R. F. Lyon, “Auditory model inversion for sound separation,” in IEEE international conference on acoustics, speech and signal processing, 1994.

[59] M. Lagrange and G. Tzanetakis, “Sound source tracking and formation using normalized cuts,” in IEEE international conference on acoustics, speech and signal processing, 2007.

[60] M. Lagrange, L. G. Martins, J. Murdoch, and G. Tzanetakis, “Normalized cuts for predominant melodic source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 278–290, Feb. 2008.

[61] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, Aug. 2000.

[62] M. Ryynänen, T. Virtanen, J. Paulus, and A. Klapuri, “Accompaniment separation and karaoke application based on automatic melody transcription,” in IEEE international conference on multimedia and expo, 2008.

[63] M. Ryynänen and A. Klapuri, “Automatic transcription of melody, bass line, and chords in polyphonic music,” Computer Music Journal, vol. 32, no. 3, pp. 72–86, Sep. 2008.

[64] Y. Ding and X. Qian, “Processing of musical tones using a combined quadratic polynomial-phase sinusoid and residual (QUASAR) signal model,” Journal of the Audio Engineering Society, vol. 45, no. 7/8, pp. 571–584, Jul. 1997.

[65] Y. Li and D. Wang, “Singing voice separation from monaural recordings,” in 7th international conference on music information retrieval, 2006.

[66] Y. Li and D. Wang, “Separation of singing voice from music accompaniment for monaural recordings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1475–1487, May 2007.

[67] C. Duxbury, J. P. Bello, M. Davies, and M. Sandler, “Complex domain onset detection for musical signals,” in 6th international conference on digital audio effects, 2003.

[68] Y. Li and D. Wang, “Detecting pitch of singing voice in polyphonic audio,” in IEEE international conference on acoustics, speech and signal processing, 2005.

[69] M. Wu, D. Wang, and G. J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 11, no. 3, pp. 229–241, May 2003.

[70] G. Hu and D. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Transactions on Neural Networks, vol. 15, no. 5, pp. 1135–1150, Sep. 2002.

[71] Y. Han and C. Raphael, “Desoloing monaural audio using mixture models,” in 7th international conference on music information retrieval, 2007.

[72] S. T. Roweis, “One microphone source separation,” in Advances in neural information processing systems 13, MIT Press, 2001, pp. 793–799.

[73] C.-L. Hsu, J.-S. R. Jang, and T.-L. Tsai, “Separation of singing voice from music accompaniment with unvoiced sounds reconstruction for monaural recordings,” in AES 125th convention, 2008.

[74] C.-L. Hsu and J.-S. R. Jang, “On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 310–319, Feb. 2010.

[75] K. Dressler, “Sinusoidal extraction using an efficient implementation of a multi-resolution FFT,” in 9th international conference on digital audio effects, 2006.

[76] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in IEEE international conference on acoustics, speech and signal processing, 1996.

[77] C. Raphael and Y. Han, “A classifier-based approach to score-guided music audio source separation,” Computer Music Journal, vol. 32, no. 1, pp. 51–59, 2008.

[78] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. Chapman; Hall/CRC, 1984.

[79] E. Cano and C. Cheng, “Melody line detection and source separation in classical saxophone recordings,” in 12th international conference on digital audio effects, 2009.

[80] S. Grollmisch, E. Cano, and C. Dittmar, “Songs2See: Learn to play by playing,” in AES 41st conference: Audio for games, 2011, pp. P2–3.

[81] C. Dittmar, E. Cano, J. Abeßer, and S. Grollmisch, “Music information retrieval meets music education,” in Multimodal music processing, Dagstuhl Publishing, 2012, pp. 95–120.

[82] E. Cano, C. Dittmar, and G. Schuller, “Efficient implementation of a system for solo and accompaniment separation in polyphonic music,” in 20th european signal processing conference, 2012.

[83] K. Dressler, “Pitch estimation by the pair-wise evaluation of spectral peaks,” in 42nd aes conference on semantic audio, 2011.

[84] E. Cano, C. Dittmar, and G. Schuller, “Re-thinking sound separation: Prior information and additivity constraints in separation algorithms,” in 16th international conference on digital audio effects, 2013.

[85] E. Cano, G. Schuller, and C. Dittmar, “Pitch-informed solo and accompaniment separation towards its use in music education applications,” EURASIP Journal on Advances in Signal Processing, vol. 2014, no. 23, Sep. 2014.

[86] J. J. Bosch, K. Kondo, R. Marxer, and J. Janer, “Score-informed and timbre independent lead instrument separation in real-world scenarios,” in 20th european signal processing conference, 2012.

[87] R. Marxer, J. Janer, and J. Bonada, “Low-latency instrument separation in polyphonic audio using timbre models,” in 10th international conference on latent variable analysis and signal separation, 2012.

[88] A. Vaneph, E. McNeil, and F. Rigaud, “An automated source separation technology and its practical applications,” in 140th aes convention, 2016.

[89] S. Leglaive, R. Hennequin, and R. Badeau, “Singing voice detection with deep recurrent neural networks,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[90] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999.

[91] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in neural information processing systems 13, MIT Press, 2001, pp. 556–562.

[92] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in IEEE workshop on applications of signal processing to audio and acoustics, 2003.

[93] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066–1074, Mar. 2007.

[94] C. Févotte, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, Mar. 2009.

[95] P. Common, “Independent component analysis, a new concept?” Signal Processing, vol. 36, no. 3, pp. 287–314, Apr. 1994.

[96] S. Vembu and S. Baumann, “Separation of vocals from polyphonic audio recordings,” in 6th international conference on music information retrieval, 2005.

[97] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, Apr. 1990.

[98] T. L. Nwe and Y. Wang, “Automatic detection of vocal segments in popular songs,” in 5th international conference for music information retrieval, 2004.

[99] M. A. Casey and A. Westner, “Separation of mixed audio sources by independent subspace analysis,” in International computer music conference, 2000.

[100] A. Chanrungutai and C. A. Ratanamahatana, “Singing voice separation for mono-channel music using non-negative matrix factorization,” in International conference on advanced technologies for communications, 2008.

[101] A. Chanrungutai and C. A. Ratanamahatana, “Singing voice separation in mono-channel music,” in International symposium on communications and information technologies, 2008.

[102] A. N. Tikhonov, “Solution of incorrectly formulated problems and the regularization method,” Soviet Mathematics, vol. 4, pp. 1035–1038, 1963.

[103] R. Marxer and J. Janer, “A Tikhonov regularization method for spectrum decomposition in low latency audio source separation,” in IEEE international conference on acoustics, speech and signal processing, 2012.

[104] P.-K. Yang, C.-C. Hsu, and J.-T. Chien, “Bayesian singing-voice separation,” in 15th international society for music information retrieval conference, 2014.

[105] J.-T. Chien and P.-K. Yang, “Bayesian factorization and learning for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 185–195, Jan. 2015.

[106] A. T. Cemgil, “Bayesian inference for nonnegative matrix factorisation models,” Computational Intelligence and Neuroscience, vol. 2009, no. 4, pp. 1–17, Jan. 2009.

[107] M. N. Schmidt, O. Winther, and L. K. Hansen, “Bayesian non-negative matrix factorization,” in 8th international conference on independent component analysis and signal separation, 2009.

[108] M. Spiertz and V. Gnann, “Source-filter based clustering for monaural blind source separation,” in 12th international conference on digital audio effects, 2009.

[109] P. Smaragdis and G. J. Mysore, “Separation by ‘humming’: User-guided sound extraction from monophonic mixtures,” in IEEE workshop on applications of signal processing to audio and acoustics, 2009.

[110] P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in 7th international conference on independent component analysis and signal separation, 2007.

[111] T. Nakamuray and H. Kameoka, “(L_p)-norm non-negative matrix factorization and its application to singing voice enhancement,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[112] J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear equations in several variables. Academic Press, 1970.

[113] H. Kameoka, M. Goto, and S. Sagayama, “Selective amplifier of periodic and non-periodic components in concurrent audio signals with spectral control envelopes,” Information Processing Society of Japan, 2006.

[114] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, pp. 1–37, May 2011.

[115] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in IEEE international conference on acoustics, speech and signal processing, 2012.

[116] P. Sprechmann, A. Bronstein, and G. Sapiro, “Real-time online singing voice separation from monaural recordings using robust low-rank modeling,” in 13th international society for music information retrieval conference, 2012.

[117] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review, vol. 52, no. 3, pp. 471–501, Aug. 2010.

[118] B. Recht and C. Ré, “Parallel stochastic gradient algorithms for large-scale matrix completion,” Mathematical Programming Computation, vol. 5, no. 2, pp. 201–226, Jun. 2013.

[119] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in 27th international conference on machine learning, 2010.

[120] L. Zhang, Z. Chen, M. Zheng, and X. He, “Robust non-negative matrix factorization,” Frontiers of Electrical Electronic Engineering China, vol. 6, no. 2, pp. 192–200, Jun. 2011.

[121] I.-Y. Jeong and K. Lee, “Vocal separation using extended robust principal component analysis with Schatten (P)/(L_p)-norm and scale compression,” in International workshop on machine learning for signal processing, 2014.

[122] F. Nie, H. Wang, and H. Huang, “Joint Schatten (p)-norm and (l_p)-norm robust matrix completion for missing value recovery,” Knowledge and Information Systems, vol. 42, no. 3, pp. 525–544, Mar. 2015.

[123] Y.-H. Yang, “Low-rank representation of both singing voice and music accompaniment via learned dictionaries,” in 14th international society for music information retrieval conference, 2013.

[124] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in 26th annual international conference on machine learning, 2009.

[125] T.-S. T. Chan and Y.-H. Yang, “Complex and quaternionic principal component pursuit and its application to audio separation,” IEEE Signal Processing Letters, vol. 23, no. 2, pp. 287–291, Feb. 2016.

[126] G. Peeters, “Deriving musical structures from signal analysis for music audio summary generation: "Sequence" and "state" approach,” in International symposium on computer music multidisciplinary research, 2003.

[127] R. B. Dannenberg and M. Goto, “Music structure analysis from acoustic signals,” in Handbook of signal processing in acoustics, Springer New York, 2008, pp. 305–331.

[128] J. Paulus, M. Müller, and A. Klapuri, “Audio-based music structure analysis,” in 11th international society for music information retrieval conference, 2010.

[129] Z. Rafii and B. Pardo, “A simple music/voice separation system based on the extraction of the repeating musical structure,” in IEEE international conference on acoustics, speech and signal processing, 2011.

[130] Z. Rafii and B. Pardo, “REpeating Pattern Extraction Technique (REPET): A simple method for music/voice separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 1, pp. 73–84, Jan. 2013.

[131] Z. Rafii, A. Liutkus, and B. Pardo, “REPET for background/foreground separation in audio,” in Blind source separation, Springer Berlin Heidelberg, 2014, pp. 395–411.

[132] J. Foote and S. Uchihashi, “The beat spectrum: A new approach to rhythm analysis,” in IEEE international conference on multimedia and expo, 2001.

[133] P. Seetharaman, F. Pishdadian, and B. Pardo, “Music/voice separation using the 2D Fourier transform,” in IEEE workshop on applications of signal processing to audio and acoustics, 2017.

[134] A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, and G. Richard, “Adaptive filtering for music/voice separation exploiting the repeating musical structure,” in IEEE international conference on acoustics, speech and signal processing, 2012.

[135] Z. Rafii and B. Pardo, “Music/voice separation using the similarity matrix,” in 13th international society for music information retrieval conference, 2012.

[136] J. Foote, “Visualizing music and audio using self-similarity,” in 7th acm international conference on multimedia, 1999.

[137] Z. Rafii and B. Pardo, “Online REPET-SIM for real-time speech enhancement,” in IEEE international conference on acoustics, speech and signal processing, 2013.

[138] Z. Rafii, A. Liutkus, and B. Pardo, “A simple user interface system for recovering patterns repeating in time and frequency in mixtures of sounds,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[139] D. FitzGerald, “Vocal separation using nearest neighbours and median filtering,” in 23rd iet irish signals and systems conference, 2012.

[140] A. Liutkus, Z. Rafii, B. Pardo, D. FitzGerald, and L. Daudet, “Kernel spectrogram models for source separation,” in 4th joint workshop on hands-free speech communication microphone arrays, 2014.

[141] A. Liutkus, D. FitzGerald, Z. Rafii, B. Pardo, and L. Daudet, “Kernel additive models for source separation,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4298–4310, Aug. 2014.

[142] A. Liutkus, D. FitzGerald, and Z. Rafii, “Scalable audio separation with light kernel additive modelling,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[143] T. Prätzlich, R. Bittner, A. Liutkus, and M. Müller, “Kernel additive modeling for interference reduction in multi-channel music recordings,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[144] D. F. Yela, S. Ewert, D. FitzGerald, and M. Sandler, “Interference reduction in music recordings combining kernel additive modelling and non-negative matrix factorization,” in IEEE international conference on acoustics, speech and signal processing, 2017.

[145] M. Moussallam, G. Richard, and L. Daudet, “Audio source separation informed by redundancy with greedy multiscale decompositions,” in 20th european signal processing conference, 2012.

[146] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.

[147] H. Deif, D. FitzGerald, W. Wang, and L. Gan, “Separation of vocals from monaural music recordings using diagonal median filters and practical time-frequency parameters,” in IEEE international symposium on signal processing and information technology, 2015.

[148] D. FitzGerald and M. Gainza, “Single channel vocal separation using median filtering and factorisation techniques,” ISAST Transactions on Electronic and Signal Processing, vol. 4, no. 1, pp. 62–73, Jan. 2010.

[149] J.-Y. Lee and H.-G. Kim, “Music and voice separation using log-spectral amplitude estimator based on kernel spectrogram models backfitting,” Journal of the Acoustical Society of Korea, vol. 34, no. 3, pp. 227–233, 2015.

[150] J.-Y. Lee, H.-S. Cho, and H.-G. Kim, “Vocal separation from monaural music using adaptive auditory filtering based on kernel back-fitting,” in Interspeech, 2015.

[151] H.-S. Cho, J.-Y. Lee, and H.-G. Kim, “Singing voice separation from monaural music based on kernel back-fitting using beta-order spectral amplitude estimation,” in 16th international society for music information retrieval conference, 2015.

[152] H.-G. Kim and J. Y. Kim, “Music/voice separation based on kernel back-fitting using weighted (\beta)-order MMSE estimation,” ETRI Journal, vol. 38, no. 3, pp. 510–517, Jun. 2016.

[153] E. Plourde and B. Champagne, “Auditory-based spectral amplitude estimators for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1614–1623, Nov. 2008.

[154] B. Raj, P. Smaragdis, M. Shashanka, and R. Singh, “Separating a foreground singer from background music,” in International symposium on frontiers of research on speech and music, 2007.

[155] P. Smaragdis and B. Raj, “Shift-invariant probabilistic latent component analysis,” MERL, 2006.

[156] B. Raj and P. Smaragdis, “Latent variable decomposition of spectrograms for single channel speaker separation,” in IEEE workshop on applications of signal processing to audio and acoustics, 2005.

[157] J. Han and C.-W. Chen, “Improving melody extraction using probabilistic latent component analysis,” in IEEE international conference on acoustics, speech and signal processing, 2011.

[158] P. Boersma, “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,” in IFA proceedings 17, 1993.

[159] E. Gómez, F. J. C. Quesada, J. Salamon, J. Bonada, P. V. Candea, and P. C. Molero, “Predominant fundamental frequency estimation vs singing voice separation for the automatic transcription of accompanied flamenco singing,” in 13th international society for music information retrieval conference, 2012.

[160] N. Ono, K. Miyamoto, J. L. Roux, H. Kameoka, and S. Sagayama, “Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram,” in 16th european signal processing conference, 2008.

[161] H. Papadopoulos and D. P. Ellis, “Music-content-adaptive robust principal component analysis for a semantically consistent separation of foreground and background in music audio signals,” in 17th international conference on digital audio effects, 2014.

[162] T.-S. Chan et al., “Vocal activity informed singing voice separation with the iKala dataset,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[163] I.-Y. Jeong and K. Lee, “Singing voice separation using RPCA with weighted (l_1)-norm,” in 13th international conference on latent variable analysis and signal separation, 2017.

[164] T. Virtanen, A. Mesaros, and M. Ryynänen, “Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music,” in ISCA tutorial and research workshop on statistical and perceptual audition, 2008.

[165] Y. Wang and Z. Ou, “Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio,” in IEEE international conference on acoustics, speech and signal processing, 2011.

[166] A. Klapuri, “Multiple fundamental frequency estimation by summing harmonic amplitudes,” in 7th international conference on music information retrieval, 2006.

[167] C.-L. Hsu, L.-Y. Chen, J.-S. R. Jang, and H.-J. Li, “Singing pitch extraction from monaural polyphonic songs by contextual audio modeling and singing harmonic enhancement,” in 10th international society for music information retrieval conference, 2009.

[168] Z. Rafii, Z. Duan, and B. Pardo, “Combining rhythm-based and pitch-based methods for background and melody separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1884–1893, Sep. 2014.

[169] Z. Duan and B. Pardo, “Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2121–2133, Nov. 2010.

[170] S. Venkataramani, N. Nayak, P. Rao, and R. Velmurugan, “Vocal separation using singer-vowel priors obtained from polyphonic audio,” in 15th international society for music information retrieval conference, 2014.

[171] V. Rao and P. Rao, “Vocal melody extraction in the presence of pitched accompaniment in polyphonic music,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2145–2154, Nov. 2010.

[172] V. Rao, C. Gupta, and P. Rao, “Context-aware features for singing voice detection in polyphonic music,” in International workshop on adaptive multimedia retrieval, 2011.

[173] M. Kim, J. Yoo, K. Kang, and S. Choi, “Nonnegative matrix partial co-factorization for spectral and temporal drum source separation,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1192–1204, Oct. 2011.

[174] L. Zhang, Z. Chen, M. Zheng, and X. He, “Nonnegative matrix and tensor factorizations: An algorithmic perspective,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 54–65, May 2014.

[175] Y. Ikemiya, K. Yoshii, and K. Itoyama, “Singing voice analysis and editing based on mutually dependent F0 estimation and source separation,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[176] Y. Ikemiya, K. Itoyama, and K. Yoshii, “Singing voice separation and vocal F0 estimation based on mutual combination of robust principal component analysis and subharmonic summation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2084–2095, Nov. 2016.

[177] D. J. Hermes, “Measurement of pitch by subharmonic summation,” Journal of the Acoustical Society of America, vol. 83, no. 1, pp. 257–264, Jan. 1988.

[178] A. Dobashi, Y. Ikemiya, K. Itoyama, and K. Yoshii, “A music performance assistance system based on vocal, harmonic, and percussive source separation and content visualization for music audio signals,” in 12th sound and music computing conference, 2015.

[179] Y. Hu and G. Liu, “Separation of singing voice using nonnegative matrix partial co-factorization for singer identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 4, pp. 643–653, Apr. 2015.

[180] J. Yoo, M. Kim, K. Kang, and S. Choi, “Nonnegative matrix partial co-factorization for drum source separation,” in IEEE international conference on acoustics, speech and signal processing, 2010.

[181] P. Boersma, “PRAAT, a system for doing phonetics by computer,” Glot International, vol. 5, no. 9/10, pp. 341–347, Dec. 2001.

[182] Y. Li, J. Woodruff, and D. Wang, “Monaural musical sound separation based on pitch and common amplitude modulation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 7, pp. 1361–1371, Sep. 2009.

[183] B. Raj, M. L. Seltzer, and R. M. Stern, “Reconstruction of missing features for robust speech recognition,” Speech Communication, vol. 43, no. 4, pp. 275–296, Sep. 2004.

[184] Y. Hu and G. Liu, “Monaural singing voice separation by non-negative matrix partial co-factorization with temporal continuity and sparsity criteria,” in 12th international conference on intelligent computing, 2016.

[185] X. Zhang, W. Li, and B. Zhu, “Latent time-frequency component analysis: A novel pitch-based approach for singing voice separation,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[186] A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, Apr. 2002.

[187] B. Zhu, W. Li, and L. Li, “Towards solving the bottleneck of pitch-based singing voice separation,” in 23rd acm international conference on multimedia, 2015.

[188] J.-L. Durrieu, G. Richard, and B. David, “Singer melody extraction in polyphonic signals using source separation methods,” in IEEE international conference on acoustics, speech and signal processing, 2008.

[189] J.-L. Durrieu, G. Richard, and B. David, “An iterative approach to monaural musical mixture de-soloing,” in IEEE international conference on acoustics, speech and signal processing, 2009.

[190] J.-L. Durrieu, G. Richard, B. David, and C. Févotte, “Source/filter model for unsupervised main melody extraction from polyphonic audio signals,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 564–575, Mar. 2010.

[191] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adaptation of Bayesian models for single-channel source separation and its application to voice/music separation in popular songs,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1564–1578, Jul. 2007.

[192] D. H. Klatt and L. C. Klatt, “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” Journal of the Acoustical Society of America, vol. 87, no. 2, pp. 820–857, Feb. 1990.

[193] L. Benaroya, L. Mcdonagh, F. Bimbot, and R. Gribonval, “Non negative sparse representation for Wiener based source separation with a single sensor,” in IEEE international conference on acoustics, speech and signal processing, 2003.

[194] I. S. Dhillon and S. Sra, “Generalized nonnegative matrix approximations with Bregman divergences,” in Advances in neural information processing systems 18, MIT Press, 2005, pp. 283–290.

[195] L. Benaroya, F. Bimbot, and R. Gribonval, “Audio source separation with a single sensor,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 191–199, Jan. 2006.

[196] J.-L. Durrieu and J.-P. Thiran, “Musical audio source separation based on user-selected F0 track,” in 10th international conference on latent variable analysis and signal separation, 2012.

[197] B. Fuentes, R. Badeau, and G. Richard, “Blind harmonic adaptive decomposition applied to supervised source separation,” in Signal processing conference (eusipco), 2012 proceedings of the 20th european, 2012, pp. 2654–2658.

[198] J. C. Brown, “Calculation of a constant Q spectral transform,” Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, Jan. 1991.

[199] J. C. Brown and M. S. Puckette, “An efficient algorithm for the calculation of a constant Q transform,” Journal of the Acoustical Society of America, vol. 92, no. 5, pp. 2698–2701, Nov. 1992.

[200] C. Schörkhuber and A. Klapuri, “Constant-Q transform toolbox,” in 7th sound and music computing conference, 2010.

[201] J.-L. Durrieu, B. David, and G. Richard, “A musically motivated mid-level representation for pitch estimation and musical audio source separation,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1180–1191, Oct. 2011.

[202] C. Joder and B. Schuller, “Score-informed leading voice separation from monaural audio,” in 13th international society for music information retrieval conference, 2012.

[203] C. Joder, S. Essid, and G. Richard, “A conditional random field framework for robust and scalable audio-to-score matching,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8, pp. 2385–2397, Nov. 2011.

[204] R. Zhao, S. Lee, D.-Y. Huang, and M. Dong, “Soft constrained leading voice separation with music score guidance,” in 9th international symposium on chinese spoken language, 2014.

[205] J.-L. Durrieu, A. Ozerov, C. Févotte, G. Richard, and B. David, “Main instrument separation from stereophonic audio signals using a source/filter model,” in 17th european signal processing conference, 2009.

[206] J. Janer and R. Marxer, “Separation of unvoiced fricatives in singing voice mixtures with semi-supervised NMF,” in 16th international conference on digital audio effects, 2013.

[207] J. Janer, R. Marxer, and K. Arimoto, “Combining a harmonic-based NMF decomposition with transient analysis for instantaneous percussion separation,” in IEEE international conference on acoustics, speech and signal processing, 2012.

[208] R. Marxer and J. Janer, “Modelling and separation of singing voice breathiness in polyphonic mixtures,” in 16th international conference on digital audio effects, 2013.

[209] G. Degottex, A. Roebel, and X. Rodet, “Pitch transposition and breathiness modification using a glottal source model and its adapted vocal-tract filter,” in IEEE international conference on acoustics, speech and signal processing, 2011.

[210] A. Ozerov, E. Vincent, and F. Bimbot, “A general modular framework for audio source separation,” in 9th international conference on latent variable analysis and signal separation, 2010.

[211] A. Ozerov, E. Vincent, and F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1118–1133, May 2012.

[212] Y. Salaün et al., “The flexible audio source separation toolbox version 2.0,” in IEEE international conference on acoustics, speech and signal processing, 2014.

[213] R. Hennequin and F. Rigaud, “Long-term reverberation modeling for under-determined audio source separation with application to vocal melody extraction,” in 17th international society for music information retrieval conference, 2016.

[214] R. Singh, B. Raj, and P. Smaragdis, “Latent-variable decomposition based dereverberation of monaural and multi-channel signals,” in IEEE international conference on acoustics, speech and signal processing, 2010.

[215] N. Ono, K. Miyamoto, H. Kameoka, and S. Sagayama, “A real-time equalizer of harmonic and percussive components in music signals,” in 9th international conference on music information retrieval, 2008.

[216] D. FitzGerald, “Harmonic/percussive separation using median filtering,” in 13th international conference on digital audio effects, 2010.

[217] Y.-H. Yang, “On sparse and low-rank matrix decomposition for singing voice separation,” in 20th acm international conference on multimedia, 2012.

[218] I.-Y. Jeong and K. Lee, “Vocal separation from monaural music using temporal/spectral continuity and sparsity constraints,” IEEE Signal Processing Letters, vol. 21, no. 10, pp. 1197–1200, Jun. 2014.

[219] E. Ochiai, T. Fujisawa, and M. Ikehara, “Vocal separation by constrained non-negative matrix factorization,” in Asia-pacific signal and information processing association annual summit and conference, 2015.

[220] T. Watanabe, T. Fujisawa, and M. Ikehara, “Vocal separation using improved robust principal component analysis and post-processing,” in IEEE 59th international midwest symposium on circuits and systems, 2016.

[221] H. Raguet, J. Fadili, and and Gabriel Peyré, “A generalized forward-backward splitting,” SIAM Journal on Imaging Sciences, vol. 6, no. 3, pp. 1199–1226, Jul. 2013.

[222] A. Hayashi, H. Kameoka, T. Matsubayashi, and H. Sawada, “Non-negative periodic component analysis for music source separation,” in Asia-pacific signal and information processing association annual summit and conference, 2016.

[223] D. FitzGerald, M. Cranitch, and E. Coyle, “Using tensor factorisation models to separate drums from polyphonic music,” in 12th international conference on digital audio effects, 2009.

[224] H. Tachibana, N. Ono, and S. Sagayama, “Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 1, pp. 228–237, Jan. 2014.

[225] H. Tachibana, T. Ono, N. Ono, and S. Sagayama, “Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source,” in IEEE international conference on acoustics, speech and signal processing, 2010.

[226] H. Tachibana, N. Ono, and S. Sagayama, “A real-time audio-to-audio karaoke generation system for monaural recordings based on singing voice suppression and key conversion techniques,” Journal of Information Processing, vol. 24, no. 3, pp. 470–482, May 2016.

[227] N. Ono et al., “Harmonic and percussive sound separation and its application to MIR-related tasks,” in Advances in music information retrieval, Springer Berlin Heidelberg, 2010, pp. 213–236.

[228] H. Tachibana, H. Kameoka, N. Ono, and S. Sagayama, “Comparative evaluations of multiple harmonic/percussive sound separation techniques based on anisotropic smoothness of spectrogram,” in IEEE international conference on acoustics, speech and signal processing, 2012.

[229] H. Deif, W. Wang, L. Gan, and S. Alhashmi, “A local discontinuity based approach for monaural singing voice separation from accompanying music with multi-stage non-negative matrix factorization,” in IEEE global conference on signal and information processing, 2015.

[230] B. Zhu, W. Li, R. Li, and X. Xue, “Multi-stage non-negative matrix factorization for monaural singing voice separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2096–2107, Oct. 2013.

[231] J. Driedger and M. Müller, “Extracting singing voice from music recordings by cascading audio decomposition techniques,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[232] J. Driedger, M. Müller, and S. Disch, “Extending harmonic-percussive separation of audio signals,” in 15th international society for music information retrieval conference, 2014.

[233] R. Talmon, I. Cohen, and S. Gannot, “Transient noise reduction using nonlocal diffusion filters,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 19, no. 6, pp. 1584–1599, Aug. 2011.

[234] C.-L. Hsu, D. Wang, J.-S. R. Jang, and K. Hu, “A tandem algorithm for singing pitch extraction and voice separation from music accompaniment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 5, pp. 1482–1491, Jul. 2012.

[235] G. Hu and D. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2067–2079, Nov. 2010.

[236] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, MIT Press Cambridge, 1986, pp. 318–362.

[237] N. J. Bryan and G. J. Mysore, “Interactive user-feedback for sound source separation,” in International conference on intelligent user-interfaces, workshop on interactive machine learning, 2013.

[238] N. J. Bryan and G. J. Mysore, “An efficient posterior regularized latent variable model for interactive sound source separation,” in 30th international conference on machine learning, 2013.

[239] N. J. Bryan and G. J. Mysore, “Interactive refinement of supervised and semi-supervised sound source separation estimates,” in IEEE international conference on acoustics, speech, and signal processing, 2013.

[240] K. Ganchev, J. Graça, J. Gillenwater, and B. Taskar, “Posterior regularization for structured latent variable models,” Journal of Machine Learning Research, vol. 11, pp. 2001–2049, Mar. 2010.

[241] A. Ozerov, N. Duong, and L. Chevallier, “Weighted nonnegative tensor factorization: On monotonicity of multiplicative update rules and application to user-guided audio source separation,” Technicolor, 2013.

[242] X. Jaureguiberry, G. Richard, P. Leveau, R. Hennequin, and E. Vincent, “Introducing a simple fusion framework for audio source separation,” in IEEE international workshop on machine learning for signal processing, 2013.

[243] X. Jaureguiberry, E. Vincent, and G. Richard, “Variational Bayesian model averaging for audio source separation,” in IEEE workshop on statistical signal processing workshop, 2014.

[244] X. Jaureguiberry, E. Vincent, and G. Richard, “Fusion methods for speech enhancement and audio source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1266–1279, Jul. 2016.

[245] J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, “Bayesian model averaging: A tutorial,” Statistical Science, vol. 14, no. 4, pp. 382–417, Nov. 1999.

[246] M. McVicar, R. Santos-Rodriguez, and T. D. Bie, “Learning to separate vocals from polyphonic mixtures via ensemble methods and structured output prediction,” in IEEE international conference on acoustics, speech and signal processing, 2016.

[247] A. K. Jain and F. Farrokhnia, “Unsupervised texture segmentation using Gabor filters,” in IEEE international conference on systems, man and cybernetics, 1990.

[248] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Singing-voice separation from monaural recordings using deep recurrent neural networks,” in 15th international society for music information retrieval conference, 2014.

[249] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher, “Block-coordinate Frank-Wolfe optimization for structural SVMs,” in 30th international conference on machine learning, 2013.

[250] E. Manilow, P. Seetharaman, F. Pishdadian, and B. Pardo, “Predicting algorithm efficacy for adaptive, multi-cue source separation,” in IEEE workshop on applications of signal processing to audio and acoustics, 2017.

[251] G. Wolf, S. Mallat, and S. Shamma, “Audio source separation with time-frequency velocities,” in IEEE international workshop on machine learning for signal processing, 2014.

[252] G. Wolf, S. Mallat, and S. Shamma, “Rigid motion model for audio source separation,” IEEE Transactions on Signal Processing, vol. 64, no. 7, pp. 1822–1831, Apr. 2016.

[253] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114–4128, Aug. 2014.

[254] C. P. Bernard, “Discrete wavelet analysis for fast optic flow computation,” Applied and Computational Harmonic Analysis, vol. 11, no. 1, pp. 32–63, Jul. 2001.

[255] F. Yen, Y.-J. Luo, and T.-S. Chi, “Singing voice separation using spectro-temporal modulation features,” in 15th international society for music information retrieval conference, 2014.

[256] F. Yen, M.-C. Huang, and T.-S. Chi, “A two-stage singing voice separation algorithm using spectro-temporal modulation features,” in Interspeech, 2015.

[257] T. Chi, P. Rub, and S. A. Shamma, “Multiresolution spectrotemporal analysis of complex sounds,” Journal of the Acoustical Society of America, vol. 118, no. 2, pp. 887–906, Aug. 2005.

[258] T. Chi, Y. Gao, M. C. Guyton, P. Ru, and S. Shamma, “Spectro-temporal modulation transfer functions and speech intelligibility,” Journal of the Acoustical Society of America, vol. 106, no. 5, pp. 2719–2732, Nov. 1999.

[259] T.-S. T. Chan and Y.-H. Yang, “Informed group-sparse representation for singing voice separation,” IEEE Signal Processing Letters, vol. 24, no. 2, pp. 156–160, Feb. 2017.

[260] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society Series B, vol. 68, no. 1, pp. 49–67, Dec. 2006.

[261] S. Ma, “Alternating proximal gradient method for convex minimization,” Journal of Scientific Computing, vol. 68, no. 2, pp. 546–572, Aug. 2016.

[262] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, Jan. 2007.

[263] A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247–251, Jul. 1993.

[264] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon technical report n. 1993.

[265] N. Sturmel et al., “Linear mixing models for active listening of music productions in realistic studio conditions,” in 132nd aes convention, 2012.

[266] M. Vinyes, “MTG MASS database.” 2008.

[267] E. Vincent, S. Araki, and P. Bofill, “The 2008 signal separation evaluation campaign: A community-based approach to large-scale evaluation,” in 8th international conference on independent component analysis and signal separation, 2009.

[268] S. Araki et al., “The 2010 signal separation evaluation campaign (SiSEC2010): - audio source separation -,” in 9th international conference on latent variable analysis and signal separation, 2010.

[269] S. Araki et al., “The 2011 signal separation evaluation campaign (SiSEC2011): - audio source separation -,” in 10th international conference on latent variable analysis and signal separation, 2012.

[270] E. Vincent et al., “The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges,” Signal Processing, vol. 92, no. 8, pp. 1928–1936, Aug. 2012.

[271] N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 signal separation evaluation campaign,” in 12th international conference on latent variable analysis and signal separation, 2015.

[272] A. Liutkus et al., “The 2016 signal separation evaluation campaign,” in 13th international conference on latent variable analysis and signal separation, 2017.

[273] A. Liutkus, R. Badeau, and G. Richard, “Gaussian processes for underdetermined source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 59, no. 7, pp. 3155–3167, Feb. 2011.

[274] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and and Juan P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive mir research,” in 15th international society for music information retrieval conference, 2014.

[275] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18, a dataset for audio source separation.” Dec-2017.

[276] A. Ozerov, P. Philippe, R. Gribonval, and F. Bimbot, “One microphone singing voice separation using source-adapted models,” in IEEE workshop on applications of signal processing to audio and acoustics, 2005.

[277] W.-H. Tsai, D. Rogers, and H.-M. Wang, “Blind clustering of popular music recordings based on singer voice characteristics,” Computer Music Journal, vol. 28, no. 3, pp. 68–78, 2004.

[278] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 2, no. 2, pp. 291–298, Apr. 1994.

[279] E. Vincent, M. Jafari, S. Abdallah, M. Plumbley, and M. Davies, “Probabilistic modeling paradigms for audio source separation,” in Machine audition: Principles, algorithms and systems, IGI Global, 2010, pp. 162–185.

[280] Z. Rafii, D. L. Sun, F. G. Germain, and G. J. Mysore, “Combining modeling of singing voice and background music for automatic separation of musical mixtures,” in 14th international society for music information retrieval conference, 2013.

[281] N. Boulanger-Lewandowski, G. J. Mysore, and M. Hoffman, “Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation,” in IEEE international conference on acoustics, speech and signal processing, 2014.

[282] G. J. Mysore, P. Smaragdis, and B. Raj, “Non-negative hidden Markov modeling of audio with application to source separation,” in 9th international conference on latent variable analysis and signal separation, 2010.

[283] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet,” Proc. Interspeech 2017, pp. 2013–2017, 2017.

[284] L. Deng and D. Yu, “Deep learning: Methods and applications,” Foundations and Trends in Signal Processing, vol. 7, nos. 3-4, pp. 197–387, Jun. 2014.

[285] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015.

[286] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.

[287] H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, Sep. 1951.

[288] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.

[289] M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in 26th international conference on neural information processing systems, 2013.

[290] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” in International conference on learning representations, 2014.

[291] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, 2015.

[292] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in IEEE international conference on acoustics, speech and signal processing, 2014.

[293] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural network based instrument extraction from music,” in IEEE international conference on acoustics, speech and signal processing, 2015.

[294] S. Uhlich et al., “Improving music source separation based on deep neural networks through data augmentation and network blending,” in IEEE international conference on acoustics, speech and signal processing, 2017.

[295] A. J. R. Simpson, G. Roma, and M. D. Plumbley, “Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network,” in 12th international conference on latent variable analysis and signal separation, 2015.

[296] J. Schlüter, “Learning to pinpoint singing voice from weakly labeled examples,” in 17th international society for music information retrieval conference, 2016.

[297] P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural audio source separation using deep convolutional neural networks,” in 13th international conference on latent variable analysis and signal separation, 2017.

[298] S. I. Mimilakis, E. Cano, J. Abeßer, and G. Schuller, “New sonorities for jazz recordings: Separation and mixing using deep neural networks,” in 2nd aes workshop on intelligent music production, 2016.

[299] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation,” in IEEE international workshop on machine learning for signal processing, 2017.

[300] S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen, and Y. Bengio, “Monaural singing voice separation with skip-filtering connections and recurrent inference of time-frequency mask,” in IEEE international conference on acoustics, speech and signal processing, 2018.

[301] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-Net convolutional networks,” in 18th international society for music information retrieval conferenceng, 2017.

[302] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in IEEE workshop on applications of signal processing to audio and acoustics, 2017.

[303] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE international conference on acoustics, speech and signal processing, 2016.

[304] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multispeaker separation using deep clustering,” in Interspeech, 2016.

[305] Y. Luo, Z. Chen, J. R. Hershey, J. L. Roux, and N. Mesgarani, “Deep clustering and conventional networks for music separation: Stronger together,” in IEEE international conference on acoustics, speech and signal processing, 2017.

[306] M. Kim and P. Smaragdis, “Adaptive denoising autoencoders: A fine-tuning scheme to learn from test mixtures,” in 12th international conference on latent variable analysis and signal separation, 2015.

[307] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408, Dec. 2010.

[308] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Single channel audio source separation using deep neural network ensembles,” in 140th aes convention, 2016.

[309] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Combining mask estimates for single channel audio source separation using deep neural networks,” in Interspeech, 2016.

[310] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Discriminative enhancement for single channel audio source separation using deep neural networks,” in 13th international conference on latent variable analysis and signal separation, 2017.

[311] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Two-stage single-channel audio source separation using deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1773–1783, Sep. 2017.

[312] S. Nie et al., “Joint optimization of recurrent networks exploiting source auto-regression for source separation,” in Interspeech, 2015.

[313] J. Sebastian and H. A. Murthy, “Group delay based music source separation using deep recurrent neural networks,” in International conference on signal processing and communications, 2016.

[314] B. Yegnanarayana, H. A. Murthy, and V. R. Ramachandran, “Processing of noisy speech using modified group delay functions,” in IEEE international conference on acoustics, speech and signal processing, 1991.

[315] Z.-C. Fan, J.-S. R. Jang, and C.-L. Lu, “Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking,” in IEEE international conference on multimedia big data, 2016.

[316] C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications,” in IEEE workshop on applications of signal processing to audio and acoustics, 2003.

[317] C. Avendano and J.-M. Jot, “Frequency domain techniques for stereo to multichannel upmix,” in AES 22nd international conference, 2002.

[318] D. Barry, B. Lawlor, and E. Coyle, “Sound source separation: Azimuth discrimination and resynthesis,” in 7th international conference on digital audio effects, 2004.

[319] M. Vinyes, J. Bonada, and A. Loscos, “Demixing commercial music productions via human-assisted time-frequency masking,” in 120th aes convention, 2006.

[320] M. Cobos and J. J. López, “Stereo audio source separation based on time-frequency masking and multilevel thresholding,” Digital Signal Processing, vol. 18, no. 6, pp. 960–976, Nov. 2008.

[321] Ö. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830–1847, Jul. 2004.

[322] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, Jan. 1979.

[323] S. Sofianos, A. Ariyaeeinia, and R. Polfreman, “Towards effective singing voice extraction from stereophonic recordings,” in IEEE international conference on acoustics, speech and signal processing, 2010.

[324] S. Sofianos, A. Ariyaeeinia, and R. Polfreman, “Singing voice separation based on non-vocal independent component subtraction,” in 13th international conference on digital audio effects, 2010.

[325] S. Sofianos, A. Ariyaeeinia, R. Polfreman, and R. Sotudeh, “H-semantics: A hybrid approach to singing voice separation,” Journal of the Audio Engineering Society, vol. 60, no. 10, pp. 831–841, Oct. 2012.

[326] M. Kim, S. Beack, K. Choi, and K. Kang, “Gaussian mixture model for singing voice separation from stereophonic music,” in AES 43rd conference, 2011.

[327] M. Cobos and J. J. López, “Singing voice separation combining panning information and pitch tracking,” in AES 124th convention, 2008.

[328] D. FitzGerald, “Stereo vocal extraction using ADRess and nearest neighbours median filtering,” in 16th international conference on digital audio effects, 2013.

[329] D. FitzGerald and R. Jaiswal, “Improved stereo instrumental track recovery using median nearest-neighbour inpainting,” in 24th iet irish signals and systems conference, 2013.

[330] A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribonval, and M. D. Plumbley, “Audio inpainting,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pp. 922–932, Mar. 2012.

[331] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures with application to blind audio source separation,” in IEEE international conference on acoustics, speech and signal processing, 2009.

[332] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 550–563, Mar. 2010.

[333] A. Ozerov, C. Févotte, R. Blouet, and J.-L. Durrieu, “Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation,” in IEEE international conference on acoustics, speech and signal processing, 2011.

[334] A. Liutkus, R. Badeau, and G. Richard, “Informed source separation using latent components,” in 9th international conference on latent variable analysis and signal separation, 2010.

[335] C. Févotte and A. Ozerov, “Notes on nonnegative tensor factorization of the spectrogram for audio source separation: Statistical insights and towards self-clustering of the spatial cues,” in 7th international symposium on computer music modeling and retrieval, 2010.

[336] A. Ozerov, N. Duong, and L. Chevallier, “On monotonicity of multiplicative update rules for weighted nonnegative tensor factorization,” in International symposium on nonlinear theory and its applications, 2014.

[337] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “New formulations and efficient algorithms for multichannel NMF,” in IEEE workshop on applications of signal processing to audio and acoustics, 2011.

[338] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization,” in IEEE international conference on acoustics, speech and signal processing, 2012.

[339] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of non-negative matrix factorization with complex-valued data,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 971–982, May 2013.

[340] S. Sivasankaran et al., “Robust ASR using neural network based speech enhancement and feature simulation,” in IEEE automatic speech recognition and understanding workshop, 2015.

[341] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1652–1664, Sep. 2016.

[342] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” Inria, 2015.

[343] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel music separation with deep neural networks,” in 24th european signal processing conference, 2016.

[344] N. Q. K. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1830–1840, Sep. 2010.

[345] A. Ozerov, A. Liutkus, R. Badeau, and G. Richard, “Informed source separation: Source coding meets source separation,” in IEEE workshop on applications of signal processing to audio and acoustics, 2011.

[346] E. Zwicker and H. Fastl, Psychoacoustics: Facts and models. Springer-Verlag Berlin Heidelberg, 2013.

[347] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE international conference on acoustics, speech and signal processing, 2001.

[348] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? A new look at signal fidelity measures,” IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98–117, Jan. 2009.

[349] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in IEEE workshop on automatic speech recognition and understanding, 2015.

[350] I. Recommendation, “Bs. 1534-1. method for the subjective assessment of intermediate sound quality (MUSHRA),” International Telecommunications Union, Geneva, 2001.

[351] E. Vincent, M. Jafari, and M. Plumbley, “Preliminary guidelines for subjective evaluation of audio source separation algorithms,” in ICA research network international workshop, 2006.

[352] E. Cano, C. Dittmar, and G. Schuller, “Influence of phase, magnitude and location of harmonic components in the perceived quality of extracted solo signals,” in AES 42nd conference on semantic audio, 2011.

[353] C. Févotte, R. Gribonval, and E. Vinvent, “BSS_EVAL toolbox user guide - revision 2.0,” IRISA, 2005.

[354] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, Jul. 2006.

[355] B. Fox, A. Sabin, B. Pardo, and A. Zopf, “Modeling perceptual similarity of audio signals for blind source separation evaluation,” in 7th international conference on latent variable analysis and signal separation, 2007.

[356] B. Fox and B. Pardo, “Towards a model of perceived quality of blind audio source separation,” in IEEE international conference on multimedia and expo, 2007.

[357] J. Kornycky, B. Gunel, and A. Kondoz, “Comparison of subjective and objective evaluation methods for audio source separation,” Journal of the Acoustical Society of America, vol. 4, no. 1, 2008.

[358] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Multi-criteria subjective and objective evaluation of audio source separation,” in 38th international aes conference, 2010.

[359] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Subjective and objective quality assessment of audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2046–2057, Sep. 2011.

[360] E. Vincent, “Improved perceptual metrics for the evaluation of audio source separation,” in 10th international conference on latent variable analysis and signal separation, 2012.

[361] M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman, “Fast and easy crowdsourced perceptual audio evaluation,” in IEEE international conference on acoustics, speech and signal processing, 2016.

[362] U. Gupta, E. Moore, and A. Lerch, “On the perceptual relevance of objective source separation measures for singing voice separation,” in IEEE workshop on applications of signal processing to audio and acoustics, 2005.

[363] F.-R. Stöter, A. Liutkus, R. Badeau, B. Edler, and P. Magron, “Common fate model for unison source separation,” in IEEE international conference on acoustics, speech and signal processing, 2016.

[364] G. Roma, E. M. Grais, A. J. Simpson, I. Sobieraj, and M. D. Plumbley, “Untwist: A new toolbox for audio source separation,” in 17th international society on music information retrieval conference, 2016.