New Study Investigates Limitations of Transformers on Compositional Reasoning
A new study explores the capabilities and limitations of large language models like GPT-3 on compositional reasoning tasks. The research, conducted by Nouha Dziri, Ximing Lu, Melanie Sclar and colleagues, rigorously tested Transformers on tasks like multi-digit multiplication, logic grid puzzles, and dynamic programming problems. These tasks require combining multiple reasoning steps to arrive at a correct solution.
The team found that while Transformers achieve near-perfect accuracy on in-domain examples seen during training, their performance drastically declines on more complex or out-of-distribution examples. For instance, GPT-3 scored only 55-59% on 3-digit by 3-digit multiplication despite exhaustive training. The researchers posit that Transformers reduce multi-step reasoning to pattern matching on training data graphs, rather than systematically applying core computations
Through computation graph analysis, the study breaks down how errors accumulate and propagate through the reasoning process. While Transformers can learn isolated operations, they struggle to chain multiple steps correctly. Theoretical results further show the probability of predicting incorrect answers grows exponentially with increasing reasoning complexity.
Overall, the research indicates fundamental limitations of current Transformers for advanced compositional reasoning. Lead author Nouha Dziri states "Transformers could be inherently limited in solving compositionally complex tasks out-of-the-box." However, the authors suggest promising directions like using Transformers in combination with planning modules.
With growing real-world deployment, it is critical to interpret Transformer capabilities rigorously. This study contributes an in-depth perspective, advancing language AI through careful analysis. The results underscore the need for models that extrapolate systematically beyond their training data when tackling multifaceted tasks.