✍️
The Daily Ink
  • 🙇‍♂️AI research is getting crazy...
  • Paper Summaries
    • 👨‍⚕️[2/25/23] Anthropic makes AI that teaches itself ethics
    • 🪄[2/22/23] Models can magically learn new skills at scale
    • *️[2/21/23] Discovering a better optimization algorithm with evolution
    • 🔮[2/17/23] Talking to models requires special prompts that help them think sequentially
    • 🏹[2/15/23] Teaching LLMs to use tools and not suck at math
    • ➗[2/13/23] English is just math in prettier clothing
    • 👨‍💻[2/8/23] The secret to good writing is editing
    • 💢[2/6/23] Solving context length constraints by distillation
    • 🔭[2/3/23] A Large Language Model for SCIENCE
  • 🎇[2/1/23] Optimal parallelism in ML training is possible, says ALPA
  • 🎼[1/31/23] Google makes a language model for music
  • 🚒[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired
  • 🤔[1/25/23] Teaching computers to think in abstractions
  • 🎩[1/23/23] The secret sauce behind ChatGPT
  • 📸[1/20/23] FlashAttention challenges ML researchers to think about systems-level improvements
  • ✂️[1/18/23] Make models smarter not larger, with data pruning
  • 🙅[1/16/23] DeepMind attempts to make AI that can do anything
  • 🐣[1/8/23] Chinchilla is the strongest animal/model in the world
  • ⬇️[1/5/23] Gradient descent-ing gradient descent
  • 🥘[1/3/23] Cramming: Training a Language Model on a Single GPU in One Day
  • 🗃️[1/1/23] A Neural Corpus Indexer for Document Retrieval
  • 👋[12/27/22] Can We Teach Machines to Act Like Humans? (And Should We?)
Powered by GitBook
On this page
  • Introduction and Motivation
  • The Data That Led to This Conclusion
  • References

[1/8/23] Chinchilla is the strongest animal/model in the world

Training Compute Optimal Large Language Models | NeurIPS Outstanding Paper 2022, Deepmind

Previous[1/16/23] DeepMind attempts to make AI that can do anythingNext[1/5/23] Gradient descent-ing gradient descent

Last updated 2 years ago

If you’ve ever heard of the “Chinchilla” paper in online AI conversations, look no further — this is what they’re talking about.

Introduction and Motivation

Goal: In machine learning in production, we typically have a set number of FLOPS. FLOPs stand for “floating point operations per second”. It is a unit of compute. Given a set number of FLOPs, how does one trade-off parameter count (model size) and token count (data size)?

Previous research: (Kaplan, OpenAI 2020 [2]) suggested scaling laws that valued parameter count much over token count. This is why model size exploded in the past ~3 years.

Core Insight: Empirical evidence in this paper suggests that parameter counts should be scaled at an equal number with token counts.

Implication: Most modern large language models are under-trained.

The Data That Led to This Conclusion

Having trained ~400 language models, here’s what the data says:

First, they demonstrate that:

  • Holding tokens constant, the optimal performance in terms of parameters increases linearly as the compute budget increases.

  • Holding parameters constant, the optimal performance in terms of tokens increases linearly as the compute budget increases.

Here, optimal performance = min training loss.

These ISO-loss contours clearly demonstrate a symmetry in FLOPs and model size. A model with 1B parameters could have loss equivalent to a model with 100M parameters, given enough training FLOPs (and tokens).

Finally, they train a 70B param model Chinchilla, which is a smaller but much-more trained equivalent to the 280B Gopher. They demonstrate that, at this optimality, Chinchilla outperforms Gopher and its equivalent models. Here it is on MMLU, a classic LLM benchmark:

Future work: The paper did not consider models trained for multiple epochs. Recent success of models like Galactica [3] imply the value of multi-epoch training.

References

[1] Paper link:

[2]

[3]

🐣
https://arxiv.org/abs/2203.15556
https://arxiv.org/abs/2001.08361
https://arxiv.org/abs/2211.09085