✍️
The Daily Ink
  • 🙇‍♂️AI research is getting crazy...
  • Paper Summaries
    • 👨‍⚕️[2/25/23] Anthropic makes AI that teaches itself ethics
    • 🪄[2/22/23] Models can magically learn new skills at scale
    • *️[2/21/23] Discovering a better optimization algorithm with evolution
    • 🔮[2/17/23] Talking to models requires special prompts that help them think sequentially
    • 🏹[2/15/23] Teaching LLMs to use tools and not suck at math
    • ➗[2/13/23] English is just math in prettier clothing
    • 👨‍💻[2/8/23] The secret to good writing is editing
    • 💢[2/6/23] Solving context length constraints by distillation
    • 🔭[2/3/23] A Large Language Model for SCIENCE
  • 🎇[2/1/23] Optimal parallelism in ML training is possible, says ALPA
  • 🎼[1/31/23] Google makes a language model for music
  • 🚒[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired
  • 🤔[1/25/23] Teaching computers to think in abstractions
  • 🎩[1/23/23] The secret sauce behind ChatGPT
  • 📸[1/20/23] FlashAttention challenges ML researchers to think about systems-level improvements
  • ✂️[1/18/23] Make models smarter not larger, with data pruning
  • 🙅[1/16/23] DeepMind attempts to make AI that can do anything
  • 🐣[1/8/23] Chinchilla is the strongest animal/model in the world
  • ⬇️[1/5/23] Gradient descent-ing gradient descent
  • 🥘[1/3/23] Cramming: Training a Language Model on a Single GPU in One Day
  • 🗃️[1/1/23] A Neural Corpus Indexer for Document Retrieval
  • 👋[12/27/22] Can We Teach Machines to Act Like Humans? (And Should We?)
Powered by GitBook
On this page
  • Introduction and Motivation
  • Development Details
  • Evaluation
  • Limitations and Future Work

[1/31/23] Google makes a language model for music

Text-to-music is finally here, and perhaps it even works.

Previous[2/1/23] Optimal parallelism in ML training is possible, says ALPANext[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired

Last updated 2 years ago

Late last week, Google dropped a model that lit up Twitter with claims that they’d “”. A look at their certainly seem to suggest that. Seriously, go take a look. It’s wild.

How did they pull this off? Let’s dive into the paper to understand.

Introduction and Motivation

There have been previous attempts to create a text-to-audio model by casting audio synthesis as a language modelling task. This allows these models (AudioLM) to achieve high-fidelity and long-term coherence.

Constraints: It is hard to map audio to text. Summarizing music is hard, and since music is a temporal medium, what is a good descriptor for minute 1 might be not good for minute 8. There just isn’t enough training data.

How can we train on a lot of unlabelled music, if we had a little bit of labelled music?

Solution: Project music and its corresponding text description to representations close to each other in an embedding space. During training, use the audio embedding. During inference, use the text embeddings. Then part of training can happen using only audio.

Development Details

First, the model is pre-trained to get three sets of embeddings — a music BERT, a text BERT, and MuLAN, which creates a mutual audio and text embedding.

During training, the co-optimized text and audio embeddings from LuLAN are taken. Then, there are two training phases:

  1. Semantic Modelling: Learn a mapping from audio tokens to the semantic tokens.

  2. Acoustic Modelling: Learn a mapping from audio and semantic tokens, to acoustic tokens.

Some clarifications: Think of the semantic tokens as a representation between the text and audio tokens. The difference between audio and acoustic tokens is that audio tokens are low-level representations of sound, while acoustic tokens are higher-level representations of music.

The inference is much simpler — The input text is converted to a text and a semantic embedding. Then audio is generated conditioned on those two embeddings (similar to the second phase mentioned above)

Don’t get lost in the technical jargon — at a high-level, what they’re doing is:

  1. Create a vector space that text and sound both map to. Use your limited text-to-audio dataset to create this.

  2. Training on sound tokens only, use your previously created space to embed sounds using your previous text-to-audio assumptions.

    1. For example, unlabelled “funk” music would hopefully be embedded close to the labelled funk music, and you can leverage this.

  3. For inference, cast whatever text you want to embeddings, and find the most probable acoustic tokens.

Evaluation

Here’s their results against other baselines:

Some explanation is warranted — FAD is a metric that serves as a proxy for human perception. KL divergence suggests a lack of similarity to a provided reference music. Wins is a qualitative measure of comparing model outputs.

While not a landslide, MusicLM does improve on all metrics mentioned, and has the most qualitative wins.

Finally, they demonstrate that the model outputs bear less similarity to their input audio — suggesting that it isn’t just copying the training data, but understanding patterns and creating new music.

Limitations and Future Work

  1. The model misunderstands negations and does not adhere to precise temporal ordering described in the text.

  2. The qualitative evaluations provided in the paper that attempt to demonstrate the quality of the model are quite subjective, and “wins” is a very coarse metric. Further evaluation is needed.

  3. Many of the quantitative metrics are also biased — MCC relies on the MuLan model, for example. This means it would be unsurprisingly favorable to MusicLM, a derivate of MuLan.

  4. The vocal generation is the model’s weakest link, and perhaps generating lyrics alongside the music would assist the model is generating music with specific words. There is exciting future work in this domain.

To summarize – MusicLM is definitely a state-of-the-art model by Google, and the results speak for themselves. My intuition is that a simpler model, trained with more data, would outperform MusicLM’s complex training strategy. I also suspect that a temporal sequence like music is a better fit for a state space model or a diffusion model.

I’ll definitely be keeping an eye on this industry in the future — the potential is massive!

🎼
solved music generation
demos