โœ๏ธ
The Daily Ink
  • ๐Ÿ™‡โ€โ™‚๏ธAI research is getting crazy...
  • Paper Summaries
    • ๐Ÿ‘จโ€โš•๏ธ[2/25/23] Anthropic makes AI that teaches itself ethics
    • ๐Ÿช„[2/22/23] Models can magically learn new skills at scale
    • *๏ธ[2/21/23] Discovering a better optimization algorithm with evolution
    • ๐Ÿ”ฎ[2/17/23] Talking to models requires special prompts that help them think sequentially
    • ๐Ÿน[2/15/23] Teaching LLMs to use tools and not suck at math
    • โž—[2/13/23] English is just math in prettier clothing
    • ๐Ÿ‘จโ€๐Ÿ’ป[2/8/23] The secret to good writing is editing
    • ๐Ÿ’ข[2/6/23] Solving context length constraints by distillation
    • ๐Ÿ”ญ[2/3/23] A Large Language Model for SCIENCE
  • ๐ŸŽ‡[2/1/23] Optimal parallelism in ML training is possible, says ALPA
  • ๐ŸŽผ[1/31/23] Google makes a language model for music
  • ๐Ÿš’[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired
  • ๐Ÿค”[1/25/23] Teaching computers to think in abstractions
  • ๐ŸŽฉ[1/23/23] The secret sauce behind ChatGPT
  • ๐Ÿ“ธ[1/20/23] FlashAttention challenges ML researchers to think about systems-level improvements
  • โœ‚๏ธ[1/18/23] Make models smarter not larger, with data pruning
  • ๐Ÿ™…[1/16/23] DeepMind attempts to make AI that can do anything
  • ๐Ÿฃ[1/8/23] Chinchilla is the strongest animal/model in the world
  • โฌ‡๏ธ[1/5/23] Gradient descent-ing gradient descent
  • ๐Ÿฅ˜[1/3/23] Cramming: Training a Language Model on a Single GPU in One Day
  • ๐Ÿ—ƒ๏ธ[1/1/23] A Neural Corpus Indexer for Document Retrieval
  • ๐Ÿ‘‹[12/27/22] Can We Teach Machines to Act Like Humans? (And Should We?)
Powered by GitBook
On this page
  • Introduction and Motivation
  • Development Details
  • Evaluation
  • Limitations and Future Work
  1. Paper Summaries

[2/13/23] English is just math in prettier clothing

Word2Vec and other ways of casting English to vector form

Previous[2/15/23] Teaching LLMs to use tools and not suck at mathNext[2/8/23] The secret to good writing is editing

Last updated 2 years ago

Hereโ€™s a simple premise โ€” neural networks know how to work on floating point numbers. English is not represented in floating point numbers (duh). Then, how do we tackle NLP problems and teach NNs to output English or other languages?

We need some way to relate words to vectors. We could start a normal count, assign Aardvak the number 0, and then go upwards. But this doesnโ€™t capture the relationship between two words, like man and woman for example. We ideally want to capture this relationship in numbers.

Enter: embeddings. The paper we discuss today is old โ€” all the way back from 2013. It introduced โ€œWord2Vecโ€, a technique to convert words to vectors while retaining their semantic meaning.

What a classic. Letโ€™s get into it!

Introduction and Motivation

Goal: Capture not only similar words, but multiple degrees of similarity, such that we can scale to vocabularies with millions of words, and datasets of billions.

Given linear relationships between vectors, like the following:

vector("king")ย -ย vector("man")ย +ย vector("woman")\text{vector("king") - vector("man") + vector("woman")}vector("king")ย -ย vector("man")ย +ย vector("woman")

We try to maximize the accuracy of these vector operations by developing new model architectures.

Letโ€™s discuss the two primary architectures below!

Development Details

  1. Continuous Bag of Words (CBOW): Given a sentence, we attempt to predict the middle word.

  2. Skip-gram: Given a sentence, we use the middle word and attempt to predict the surrounding words.

In the above example, the context length of each model is โ€œ5โ€, but it can be longer in a training model.

Evaluation

The models are analyzed 1) semantically and 2) syntactically. First, they demonstrate how the model significantly outperforms prior art in a word relationship test set.

Part of the paperโ€™s claim is that their models would scale better. They demonstrate this using a 1000-dimension vector embedding space and 6B training words.

An interesting observation as well is that models trained for more epochs but lesser dimensionality do substantially better.

The impact of epochs on training accuracy remains an open question in ML research.

Limitations and Future Work

  • On their comparison in high-dimensional vector space against NNLM (old SOTA), they give NNLM 100-dimension space and their models 1000-dimension space. They do not justify this โ€” I suspect something like NNLM would not be able to train a high-dimensional space like that. I would have liked to see a more just comparison.

  • 100% accuracy is likely impossible using the given model since the current models do not have any input information about word morphology. Future work can incorporate information about the structure of words, which would specially increase syntactic accuracy.

  • Having been written before the attention/transformer era of neural nets, the paper uses RNNs. A transformer model with an attention block would probably do better in learning these embeddings, and has indeed been used in future work.

In summary โ€” what an amazing paper to read. In retrospect, these ideas look so simple. You could implement word2vec in an evening if you wanted. But itโ€™s hard to predict when scale helps a problem breakthrough, which they bet on here and succeeded. A parting thought: what other problems havenโ€™t been scaled up yet, such that they would maybe significantly improve if they had been?

Until next time!

โž—