โœ๏ธ
The Daily Ink
  • ๐Ÿ™‡โ€โ™‚๏ธAI research is getting crazy...
  • Paper Summaries
    • ๐Ÿ‘จโ€โš•๏ธ[2/25/23] Anthropic makes AI that teaches itself ethics
    • ๐Ÿช„[2/22/23] Models can magically learn new skills at scale
    • *๏ธ[2/21/23] Discovering a better optimization algorithm with evolution
    • ๐Ÿ”ฎ[2/17/23] Talking to models requires special prompts that help them think sequentially
    • ๐Ÿน[2/15/23] Teaching LLMs to use tools and not suck at math
    • โž—[2/13/23] English is just math in prettier clothing
    • ๐Ÿ‘จโ€๐Ÿ’ป[2/8/23] The secret to good writing is editing
    • ๐Ÿ’ข[2/6/23] Solving context length constraints by distillation
    • ๐Ÿ”ญ[2/3/23] A Large Language Model for SCIENCE
  • ๐ŸŽ‡[2/1/23] Optimal parallelism in ML training is possible, says ALPA
  • ๐ŸŽผ[1/31/23] Google makes a language model for music
  • ๐Ÿš’[1/27/23] Google's LaMDA model is too convincing, and a researcher is fired
  • ๐Ÿค”[1/25/23] Teaching computers to think in abstractions
  • ๐ŸŽฉ[1/23/23] The secret sauce behind ChatGPT
  • ๐Ÿ“ธ[1/20/23] FlashAttention challenges ML researchers to think about systems-level improvements
  • โœ‚๏ธ[1/18/23] Make models smarter not larger, with data pruning
  • ๐Ÿ™…[1/16/23] DeepMind attempts to make AI that can do anything
  • ๐Ÿฃ[1/8/23] Chinchilla is the strongest animal/model in the world
  • โฌ‡๏ธ[1/5/23] Gradient descent-ing gradient descent
  • ๐Ÿฅ˜[1/3/23] Cramming: Training a Language Model on a Single GPU in One Day
  • ๐Ÿ—ƒ๏ธ[1/1/23] A Neural Corpus Indexer for Document Retrieval
  • ๐Ÿ‘‹[12/27/22] Can We Teach Machines to Act Like Humans? (And Should We?)
Powered by GitBook
On this page
  • Introduction and Motivation
  • Development Details
  • Evaluation
  • Limitations and Future Work
  1. Paper Summaries

[2/15/23] Teaching LLMs to use tools and not suck at math

And maybe then they'll figure out how to 2 + 3

Previous[2/17/23] Talking to models requires special prompts that help them think sequentiallyNext[2/13/23] English is just math in prettier clothing

Last updated 2 years ago

Language models cannot do basic arithmetic, and are therefore very often. For all their magical abilities, it feels very surprising that smaller models outperform them in this so much.

In other models like LaMDA, we see the introduction of a โ€œtoolboxโ€ โ€” allowing LLMs to deploy some tasks to calculators and APIs, where they perform worse. Toolformer introduces a standardized way to do so.

Introduction and Motivation

Goal: Create a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.

Problem: Existing approaches rely on large amounts of human annotations or limit tool use to task-specific settings only. Want a self-supervising way to learn tool adoption.

Constraint: The LM should not lose any of its generality, cleverly deciding when and how to use which tool. If the model is worse, the tools donโ€™t make sense.

Solution: Use in-context learning to generate entire datasets from scratch using the model itself.

Development Details

Hereโ€™s how the model generates its self-annotated dataset:

  1. Give the model a handful of human-written examples of an API usage.

  2. Let the LM annotate a huge language dataset with potential calls.

  3. Use self-supervised loss to determine which calls help the model in predicting future tokens. Between a call and no call, whichever minimizes total loss is chosen.

  4. Finetune the LM itself on API calls that it considers useful.

They give a good example demonstrating how the model works":

Evaluation

On evaluation benchmark LAMA โ€” particularly the subset where the model is tasked with completing a short statement with a missing fact, they demonstrate that Toolformer 6.7B is competitive with much larger models.

Similarly, it does better on mathematical problems.

However, it surprisingly does not outperform GPT-3 on question-answering datasets.

The paper claims this is because of the simplicity of their search engine which yields many false positives, and the inability of Toolformer to interact with it by, for example, reformulating its query if results are not helpful.

Limitations and Future Work

  • I think the toolformer model could have benefited from demonstrating that making it better at tasks using tools does not make it worse at tasks where it should not use tools โ€” a rewording email task, for example, requires no API calls, but can suffer from worse performance on a model biased to make such calls. This is crucial to prove their โ€œgeneralization retainedโ€ idea.

  • The toolformer model still cannot use tools in a chain โ€” using the output of one tool as an input to another. This would allow the model to perform more detailed computations.

  • The model also cannot interact with the tool โ€” especially for tools like search engines, which could potential return hundreds of different results. Something like rewording and browsing would benefit the tool greatly.

  • The model was also found to be extremely sensitive to the prompt โ€” this could be a factor of the modelโ€™s smaller size, or its in-context learning mechanism โ€” any prompt variation tends to bias models.

  • The process is also very sample-inefficient โ€” processing more than a million documents results only in a few thousand examples of useful calls to the calculator API.

In Summary โ€” I like this paper a lot. It presents a relatively simple idea in a straight forward way, and incredibly, owns up to its flaws and weaknesses. While I imagine budgetary constraints to be a factor, I would have loved to see toolformer scaled up, such that it could make a direct comparison to GPT-3. Iโ€™m also not a fan of the need for prompt engineering in fine-tuning and beyond, as mentioned in the paper. However, I am extremely impressed by even a small modelโ€™s new-found capabilities, and look forward to reading more work in this domain.

๐Ÿน
memed on