โœ๏ธ
The Daily Ink
  • ๐Ÿ™‡โ€โ™‚๏ธAI research is getting crazy...
  • Paper Summaries
    • ๐Ÿ‘จโ€โš•๏ธ[2/25/23] Anthropic makes AI that teaches itself ethics
    • ๐Ÿช„[2/22/23] Models can magically learn new skills at scale
    • *๏ธ[2/21/23] Discovering a better optimization algorithm with evolution
    • ๐Ÿ”ฎ[2/17/23] Talking to models requires special prompts that help them think sequentially
    • ๐Ÿน[2/15/23] Teaching LLMs to use tools and not suck at math
    • โž—[2/13/23] English is just math in prettier clothing
    • ๐Ÿ‘จโ€๐Ÿ’ป[2/8/23] The secret to good writing is editing
    • ๐Ÿ’ข[2/6/23] Solving context length constraints by distillation
    • ๐Ÿ”ญ[2/3/23] A Large Language Model for SCIENCE
  • ๐ŸŽ‡[2/1/23] Optimal parallelism in ML training is possible, says ALPA
  • ๐ŸŽผ[1/31/23] Google makes a language model for music
  • ๐Ÿš’[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired
  • ๐Ÿค”[1/25/23] Teaching computers to think in abstractions
  • ๐ŸŽฉ[1/23/23] The secret sauce behind ChatGPT
  • ๐Ÿ“ธ[1/20/23] FlashAttention challenges ML researchers to think about systems-level improvements
  • โœ‚๏ธ[1/18/23] Make models smarter not larger, with data pruning
  • ๐Ÿ™…[1/16/23] DeepMind attempts to make AI that can do anything
  • ๐Ÿฃ[1/8/23] Chinchilla is the strongest animal/model in the world
  • โฌ‡๏ธ[1/5/23] Gradient descent-ing gradient descent
  • ๐Ÿฅ˜[1/3/23] Cramming: Training a Language Model on a Single GPU in One Day
  • ๐Ÿ—ƒ๏ธ[1/1/23] A Neural Corpus Indexer for Document Retrieval
  • ๐Ÿ‘‹[12/27/22] Can We Teach Machines to Act Like Humans? (And Should We?)
Powered by GitBook
On this page
  • Introduction and Motivation
  • Development Details
  • Evaluation
  • Limitations and Future Work

[2/1/23] Optimal parallelism in ML training is possible, says ALPA

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Previous[2/3/23] A Large Language Model for SCIENCENext[1/31/23] Google makes a language model for music

Last updated 2 years ago

You might heard of data parallelism in model training before. The core idea โ€” have multiple copies of your model, train different data on them, average the gradients, update all weights.

That is not what weโ€™re talking about today. Weโ€™re talking about model parallelism. Most ML models that youโ€™ve heard about (ChatGPT, Claude, Cohere, Adept, LaMDA and so onโ€ฆ) are not trained on one machine. GPT (175B params) famously needs at least to just fit the size of the model.

Introduction and Motivation

Since the model needs to be divided into many parts spread over many machines, what is the best way to split the model?

Thereโ€™s two core ways to do it:

  1. Inter-operator parallelism: If there are two operations A and B that are part of the computation graph โ€” A goes on one machine and B goes on another.

  2. Intra-operator parallelism: Given an operation A, half of it could be computed on one machine and the other half on another.

Goal: Given a computation graph that represents a neural network, what combination of intra and inter-operator parallelism gives us the fastest training structure?

Core Insight: Machines in a cluster communicate fast, machine across clusters communicate slow. Intra-operator parallelism has to communicate a lot more than inter-operator parallelism. Therefore, optimize for intra-operator within a cluster and inter-operator across clusters.

Development Details

At a high-level, ALPA does the following:

  1. First, the model uses dynamic programming to split the computation graph across clusters, putting different sets of operations in different clusters.

  2. Next, the model uses integer linear programming to optimize the intra-operator parallelism within a cluster, considering only the subgraph generated in the first step.

These are co-optimized โ€” the DP algorithm finds the division that minimizes the cost/speed of the ILP split in all the clusters.

Evaluation

Their testbed consisted of 8 nodes and 64 GPUs. We want the throughput to increase as the number of GPUs used in the model increases. We see that, when tested against state-of-the-art expert-written parallelism, Alpa is competitive and on occasion better.

They also show that casting the intra-operator parallelism as an ILP problem was clearly the correct mechanism, and it scales better than other previous techniques.

Limitations and Future Work

  1. I would have liked to see more testing against other models except GPT at larger scales. While out-of-memory is an unrecoverable error, it doesnโ€™t make for the fairest testing mechanisms.

  2. In the ILP formulation for intra-operator parallelism, a lot of smaller operators were discarded to simply to the point that the ILP calculation is tractable (completed in an hour). There is a possibility of improving the accuracy of this by allowing a variable granularity in the ILP detail.

  3. Currently, the paper assumes that clusters communicate fast and cross-cluster communication is slow. While this view does work for them, networking is a lot more complicated than this simple binary. There is future work possible to profiling the network and using that information to create a more realistically optimal system.

  4. In practice, the set of machines that a model is trained on are rarely constant, and the networking between them even more variable. There is work to be done in the online evaluation of automatic parallelism, instead of keeping this step offline.

In summary โ€” I think Alpa is a big step in the right direction. Parallelism doesnโ€™t need to be heuristic driven, or expert-written. The casting of the parallelism problem as a dynamic programming problem is quite inspired. I think this is a great first step, and see a lot of potential in exciting future work. There is still a ways to go before such a model yields the truly optimal parallelism strategy, but I believe we are well on our way there.

Thanks

๐ŸŽ‡
8 GPUs