โœ๏ธ
The Daily Ink
  • ๐Ÿ™‡โ€โ™‚๏ธAI research is getting crazy...
  • Paper Summaries
    • ๐Ÿ‘จโ€โš•๏ธ[2/25/23] Anthropic makes AI that teaches itself ethics
    • ๐Ÿช„[2/22/23] Models can magically learn new skills at scale
    • *๏ธ[2/21/23] Discovering a better optimization algorithm with evolution
    • ๐Ÿ”ฎ[2/17/23] Talking to models requires special prompts that help them think sequentially
    • ๐Ÿน[2/15/23] Teaching LLMs to use tools and not suck at math
    • โž—[2/13/23] English is just math in prettier clothing
    • ๐Ÿ‘จโ€๐Ÿ’ป[2/8/23] The secret to good writing is editing
    • ๐Ÿ’ข[2/6/23] Solving context length constraints by distillation
    • ๐Ÿ”ญ[2/3/23] A Large Language Model for SCIENCE
  • ๐ŸŽ‡[2/1/23] Optimal parallelism in ML training is possible, says ALPA
  • ๐ŸŽผ[1/31/23] Google makes a language model for music
  • ๐Ÿš’[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired
  • ๐Ÿค”[1/25/23] Teaching computers to think in abstractions
  • ๐ŸŽฉ[1/23/23] The secret sauce behind ChatGPT
  • ๐Ÿ“ธ[1/20/23] FlashAttention challenges ML researchers to think about systems-level improvements
  • โœ‚๏ธ[1/18/23] Make models smarter not larger, with data pruning
  • ๐Ÿ™…[1/16/23] DeepMind attempts to make AI that can do anything
  • ๐Ÿฃ[1/8/23] Chinchilla is the strongest animal/model in the world
  • โฌ‡๏ธ[1/5/23] Gradient descent-ing gradient descent
  • ๐Ÿฅ˜[1/3/23] Cramming: Training a Language Model on a Single GPU in One Day
  • ๐Ÿ—ƒ๏ธ[1/1/23] A Neural Corpus Indexer for Document Retrieval
  • ๐Ÿ‘‹[12/27/22] Can We Teach Machines to Act Like Humans? (And Should We?)
Powered by GitBook
On this page
  • Introduction and Motivation
  • Development Details
  • Evaluation
  • Limitations and Future Work
  • In Summary
  1. Paper Summaries

[2/22/23] Models can magically learn new skills at scale

The mysterious emergent abilities of LLMs

Previous[2/25/23] Anthropic makes AI that teaches itself ethicsNext[2/21/23] Discovering a better optimization algorithm with evolution

Last updated 2 years ago

Have you ever wondered how GPT-3 can do such wonderful things? How ChatGPT is somehow better? In fact, letโ€™s run some tests real fast:

The classic GPT-3 writes poems like I did in high school, while ChatGPT is much better, and seems to understand the concept of rhymes!

Those amongst you whoโ€™ve been following this series for a while and read my would call me out โ€” this isnโ€™t about scale! This is because ChatGPT uses reinforcement learning! And youโ€™d be right!

So letโ€™s take another example Of Arithmetic! We compare a small GPT model Ada to a larger model Babbage to their largest model Davinci. None of these are trained with RL, and yet:

Adaโ€™s spitting out non-sense. Babbage gets closer but still not quite there. And then BAM! DaVinci gets it in one go. All of these are training on the same data! Whatโ€™s going on here?

Today, we discuss โ€œEmergent Abilities of Large Language Modelsโ€.

Introduction and Motivation

Core Claim: As models are scaled up, the downstream performance on tasks does not grow continuously but has a sudden jump. This means that we cannot predict what larger models will be able to do by extrapolating from smaller models.

Development Details

What is scale? ML models are parameterized by three properties of their training:

  1. Model size: Number of parameters in the model

  2. Data size: Number of tokens in the training data

  3. Compute size: Number of FloPs (Floating point operations) the model was trained for

These metrics are not the same, but are heavily correlated โ€” bigger models are trained with more compute, for example. The paper defines scale as some combination of these.

Evaluation

The core of this paper is the data collected, so letโ€™s view it!

Across many tasks, we notice that the performance is quite random, until we cross a threshold of model scale (here, training FLOPs). Then, the ability of the model jumps to on average ~20% above random (statistically significant!). 5 * 10^23 FLOPS seems to be the magic number for many thresholds.

The Word in Context (bottom-right) benchmark is very interesting, however. GPT-3 and Chinchilla (2 SOTA models) fail to rise above random even past the magic threshold. While this might put doubt on the power of scaling, above-random performance eventually emerged when PaLM was scaled to 2.5 * 10^24 FLOPs (540B parameters), much larger than GPT-3 and Chinchilla.

Prompting methods like scratchpad, instruction-finetuning or chain-of-thought reasoning also show similar patterns โ€” smaller models donโ€™t benefit from them, until a threshold, and then they demonstrate massive returns.

Some might argue that smaller models do show these abilities, and our metrics are not capturing it. The simple counter to that is that classification tasks also show emergent abilities, which are all or nothing.

Limitations and Future Work

  1. Further scaling: If scale is magical, the obvious next step is to scale it up even more to see if and when we stop seeing emergent abilities. This is challenging, however, since current models already train on a good chunk of the internet. This means bigger models as well as more data are likely in our future.

  2. There are some tasks at which these models still donโ€™t perform any better than random โ€” often problems with abstract reasoning. Further research can investigate why these abilities havenโ€™t emerged yet.

  3. An investigation of cross-entropy loss, generative task metrics (like BLEU or ROUGE) and type of tasks was not enough to develop a theory for why emergence happens. Understanding emergence is an important direction of future work.

A Big Caveat: Hello and welcome back to what can Arushi find in the appendix? And today, we turn our attention all the way to Appendix F, where the authors highlight that there are many tasks where PaLM 62B is emergent but bigger models GPT-3 and LaMDA are not. Not sure why this is in the appendix, since this is a crucial observation in the correlation between scale and emergence.

In Summary

I tend to try not to be negative about papers โ€” research is fun and important and should be encouraged! However, I feel firmly like this is half a paper โ€” it provides a lot of data, but does not make any claim about what that data means. It suggests a correlation between scale and emergence, but hides counterexamples and does not try to explain exceptions. It suggests metrics to define emergence (like cross-entropy loss) but moves those to the Appendix and admits that correlation is not statistically significant.

Despite my complaints, it is still extremely valuable to highlight how such models have โ€œemergent abilitiesโ€. Understanding how emergence occurs will be crucial to improving model performance once we hit the metaphorical cap on training FLOPs. In the meantime, if youโ€™ve ever wondered why companies like Google and OpenAI are willing to throw this much money to make bigger and bigger models, now you know.

Until next time!

3SAVECROSS-POST

๐Ÿช„
ChatGPT breakdown