Fabian-Robert Stöter & Antoine Liutkus

Inria and LIRMM, Montpellier

fabian-robert.stoter@inria.fr

faroit

antoine.liutkus@inria.fr

September 23rd, 2018

- Automatic Karaoke
- Creative Music Production
- Active listening
- Upmixing (stereo $\Rightarrow$ 5.1)
- Music Education
- Pre-processing for MIR

- Intense past research
- Many evaluation campaigns: MIREX, SiSEC
- Recent breakthroughs:
**separation works**

- Difficult topic
- Signal processing skills
- Deep learning experience

- Objectives of this tutorial
- Basics of DNN, basics of signal processing
- Understand an open-source state of the art system
- Implement it!

- Introduction
- Our vanilla model
- Choosing the right representation
- Tuning the DNN Structure
- Training tricks
- Testing tricks
- Conclusion

- Time-frequency representations
- Filtering
- A brief history of separation
- Datasets
- Deep neural networks

- Start the notebook session
- For one track, display waveforms, play some audio
- Display spectrogram of mixture

- Frames too short: not diagonalized
- Frames too long: not stationary

- Get spectrograms of the sources
- Display the corresponding soft-mask for vocals
- Apply it on the mixture, reconstruct and listen to the result

- Pitch detection
- Clean voices
- "Metallic" artifacts

- Spectral templates
- Low-rank assumptions
- Bad generalization

- Low-rank for music
- Vocals as unstructured
- Strong interferences in general

- Repetitive music
- Non-repetitive vocals
- Solos in vocals

- Harmonic vocals
- Low-rank music
- Poor generalization

- Combining methods
- Handcrafted systems
- Poor generalization

- Combining in a data-driven way
- Doing best than all
- Computationally demanding

Name |
Year |
Reference |
#Tracks |
Tracks dur (s) |
Full/stereo? |
Total length |
---|---|---|---|---|---|---|

MASS | 2008 | (Vinyes) | 9 | (16 $\pm$ 7) | ❌ / ✔️ | 2m24s |

MIR-1K | 2010 | (Hsu and Jang) | 1,000 | 8 | ❌ / ❌ | 2h13m20s |

QUASI | 2011 | (Liutkus et al.) | 5 | (206 $\pm$ 21) | ✔️ / ✔️ | 17m10s |

ccMixter | 2014 | (Liutkus et al) | 50 | (231 $\pm$ 77) | ✔️ / ✔️ | 3h12m30s |

MedleyDB | 2014 | (Bittner et al) | 63 | (206 $\pm$ 121) | ✔️ / ✔️ | 3h36m18s |

iKala | 2015 | (Chan et al) | 206 | 30 | ❌ / ❌ | 1h43m |

DSD100 | 2015 | (Ono et al) | 100 | (251 $\pm$ 60) | ✔️ / ✔️ | 6h58m20s |

MUSDB18 | 2017 | (Rafii et al) | 150 | (236 $\pm$ 95) | ✔️ / ✔️ | 9h50m |

- 100 train / 50 test full tracks
- Mastered with pro. digital audio workstations
- Parser and Evaluation tools in
- https://sigsep.github.io/datasets/musdb.html

**SDR**: Source to distortion ratio.*Error in the estimate*.**SIR**: Source to interference ratio.*Presence of other sources*.**SAR**: Source to artifacts ratio.*Amount of artificial noise*.

**Better**matching filters computed track-wise**Faster**10x

- Loop over some musdb tracks
- Evaluate our separation system on musdb
- Compare to state of the art (SiSEC18)

- Cascading linear and non-linear operations augments expressive power
- 7 millions parameters in our case

- $loss\leftarrow \sum_{(x,y)\in batch}cost\left(y_\Theta\left(x\right), y\right)$
- Update $\Theta$ to reduce the loss!
- We can compute $\frac{\partial loss}{\partial\Theta_{i}}$ for any parameter $\Theta_i$
- "The influence of $\Theta_i$ on the error"
- It's the
**gradient** - Computed through
**backpropagation**

- A simple optimization: $\Theta_i\leftarrow \Theta_i - \lambda \frac{\partial loss}{\partial\Theta_{i}}$
- It's the
**stochastic gradient descent** - $\lambda$ is the
**learning rate**

- It's the
- Batching is important

There are many other optimization algorithms...

- $y_{t}=f\left(linear\left\{ x_{t},y_{t-1}\right\} \right)$
- Similar to a Markov model
- Exponential decay of information
- Vanishing or exploding gradient for training

- Limited for long-term dependencies

- LSTM are causal systems
- Predicts future from past

- We can use anti-causal LSTM
- Different predictions!

- Independent forward and backward
- Outputs can be concatenated
- Outputs can be summed

- Model
- Spectrogram sampling
- Test

- One LSTM
- One fully connected
- 6 million parameters

- Implement the model in pytorch

- Build a naive data sampler
- Start training the vanilla!

- Use the vanilla model for separation

- Input dimensionality reduction
- Fourier transforms parameters
- Standardization

- Compensate different features scales
- Classical pre-processing
- Either dataset stats or trainable

- Input/output scaling always lead to better loss
- Trainable is better than fixed
- Good loss needs scaling, but no influence on SDR

- We use scaling all the time

- Network is always fed 360ms of context

**Frames**: 92ms=4096 samples @ 44k1kHz**Overlap**: 75%

- Long frames, large overlap

- Reduce number of parameters?
- From 6 million to 500k

- Fixed or trainable reduction
- Should we handcraft features?

- Reducing dimension
- Reduces model size
- Gives bettter performance

- Don't handcraft features, train them

- Network dimensions
- LSTM vs BLSTM
- Skip connection

- Context length not so important
- LSTM unable to model long-term musical contexts?

- Hidden size (model dimension) has strong influene

- Large models are good

- Moderate impact on SDR
- Strong impact on SIR
- Improves separation much
- But not a huge impact as in loss

- Loss on spectrograms is
**not**audio quality

- Loss is similar
- BLSTM are better metric-wise

- In practice: BLSTM require the
**same context length at train and test**$\Rightarrow$ chop the data into batches at test time!

- Should reduce vanishing gradient
- Much used in very deep nets

- Improves loss slightly
- No effect on overall metrics

- Better bass and drums

- Cost function
- Training tricks
- Data augmentation
- Sampling strategy

- Wide variety of cost functions $d\left(a,b\right)$
- squared loss $\left|a-b\right|^2$
- absolute loss $\left|a-b\right|$
- Kullback Leibler loss $a\log\frac{a}{b}-a+b$
- Itakura Saito loss $\frac{a}{b}-\log\frac{a}{b}-1$
- Cauchy, alpha divergence, ...

- Applied on $Y$, $Y^2$, any $Y^\alpha$, $\log Y$, ...
- Theoretical groundings for all

- We check for $\left[0.01, 0.001\right]$

- High learning rates don't work

- Keep default parameters for optim

- batchnorm fails in our case
- train batch: many songs
- test batch: one song

- layernorm forces matching
- test batch: normalized as in training

- batchnorm behaves wildly, avoid it

- Improves isolation: better SIR
- Adds distortion: worse SAR

- Parts of the net randomly set to 0
- No unit should be critical:
*regularization* - Probabilistic interpretation

- Regularization makes things worse

- Non unique tracks in batch
- Not all samples per epoch

- Unique tracks in batch
- Not all samples per epoch

- Non unique tracks in batch
- All samples per epoch

- Unique tracks in batch
- All samples per epoch

- Implement the 4 strategies with pescador
- Apply they on spectrograms

- Unique tracks per batch is slower
- All samples per epoch is faster

- Basic augmentation: overlap samples within each track
- There are more advanced strategies

- Basic augmentation helps a bit (0.5dB)

- Not shown: new tracks are better!

- Representation
- Mono filter tricks
- Multichannel Gaussian model
- The multichannel Wiener filter
- Testing: evaluation

- The first source of poor results: inverse STFT!
- Verify perfect reconstruction
- Better: use established libraries, like
`librosa`

,`scipy`

...

- If the mask is 0.8... just put 1
- If the mask is 0.2... just put 0
- Cheap interference reduction

- Sources and mixtures are jointly Gaussian
- We observe the mix, what can we say about the sources?

- Iterations improve SIR
$\Rightarrow$ greatly reduces interferences

- Iterations worsen SAR
$\Rightarrow$ introduces distortion

- logit has good SIR
$\Rightarrow$ cheap interference reduction

- Resulting baseline
- What was kept out
- What is promising
- Ending remarks

- Exotic representations

- Alternative structures
- The convolutional neural network (CNN)
- The U-NET
- The MM-densenet
- Deep clustering

- Generative approaches
- Generative adversarial nets
- (Variational) auto encoders

- Deep clustering

- Full grid search over parameters (fund us!)
- Advanced data augmentation (naive=+0.3dB SDR)

- More data
- Even more data
- Did we mention more data?

- Structures with more parameters work better...
- Better signal processing helps

- We got 3dB SDR improvement with no publishable contribution
$\Rightarrow$ evaluating the real impact of a contribution is difficult

- Convergence of signal processing, probability theory and DL
- Learning with limited amount of data
- Model long term dependency
- Representation learning for sound and music
- Exploiting knowledge domain, user interaction
- Unsupervised learning ?

- References and Software tools: sigsep.github.io
- SiSEC 2018 Website: sisec18.unmix.app

Deep Learning for Music Unmixing