Deep Learning for Music Unmixing

Fabian-Robert Stöter & Antoine Liutkus
Inria and LIRMM, Montpellier

fabian-robert.stoter@inria.fr
faroit

September 5th, 2018

Music Unmixing/Separation

Applications

  • Automatic Karaoke
  • Creative Music Production
  • Personal Remixing
  • Music Education

Methods: the big picture

Mixture spectrogram

Vocals spectrogram

Mixture spectrogram

Drums spectrogram

Mixture spectrogram

Bass spectrogram

Classical approach: an inverse problem

Old and new

Paradigma shift: data-centered design

  • Defining sources through examples (2015+)
  • Exploiting deep source-models in signal processing pipelines

Datasets

DatasetYearReference(s)TracksDur (s) Full/stereo?
MASS2008(Vinyes 2008)9(16 ±\pm 7)❌ / ✔️
MIR-1K2010(Hsu and Jang 2010)1,0008 ❌ / ❌
QUASI2011(Liutkus et. al. 2011)5(206 ±\pm 21)✔️ / ✔️
ccMixter2014(Liutkus et al. 2014)50(231 ±\pm 77)✔️ / ✔️
MedleyDB2014(Bittner et al. 2014)63(206 ±\pm 121)✔️ / ✔️
iKala2015(Chan et al. 2015)20630❌ / ❌
DSD1002015(Ono et al. 2015)100(251 ±\pm 60)✔️ / ✔️
MUSDB182017(Rafii et al. 2017)150(236 ±\pm 95)✔️ / ✔️

Music separation as a machine learning problem

Generative or discriminative

Music separation as a machine learning problem

Classification ...

Binary Masking

... or regression ?

Softmask

Magnitude Spectrogram

Music separation as a machine learning problem

Supervised ...

  • Single I/O: modeling sources independently
  • Multiple I/O: modeling sources jointly
  • Siamese networks, Chimera Networks

... or unsupervised ? (open direction)

Music separation as a machine learning problem

Modeling fixed-sized spectrograms ... ?

  • Separating chunks: straightforward reuse of image models
  • Batching over chunks
  • Fully connected, etc

... or learning dynamic models ?

  • Very long-term dependencies !
  • LSTM, CNN, etc

A Baseline System

A Baseline System

Pre-Processing

  • Time-Frequency Transform: «pre-whitening»
  • Normalization: Gain Variation
  • Standardization: Scale Frequency Bands

Post-processing

  • Mono Models, Filtering Stereo Signals

A Baseline System

Sampling for Training

  • Slicing Temporal Context
    • Full tracks too large (vanishing gradient)
    • Context usually 1-10 seconds
  • Batch from different Tracks
  • Data Augmentation
    • Image Augmenations doesn't work
    • Apply Random Gains

A Baseline System

Architectures I

  • Denoising Auto-Encoder [Uhlich 2014]
  • FCN [Chandna 2017]

A Baseline System

Architectures II

  • Bidirectional LSTM [Huang 2014, Uhlich 2015, Takashi 2018]
  • Sequence2Sequence
  • Input (mix): (sample, frames, frequency)
  • Output (targets): (sample, frames, frequency, source)

SiSEC 2018

DEMO

How trendy is DNN based source separation?

  • Fully Convolutional Networks
  • Batch Normalization
  • Skip-Connections
  • GAN
  • End-to-End Timedomain (Wavenet)
  • Capsule Networks
  • Attention
  • Reinforcement Learning
  • ...

Opening considerations

  • Convergence of signal processing, probability theory and DL
  • Learning with limited amount of data
  • Model long term dependency
  • Representation learning for sound and music
  • Exploiting knowledge domain, user interaction
  • Unsupervised Learning ?

Resources