Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

This is a Plain English Papers summary of a research paper called Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. If you like these kinds of analysis, you should subscribe to the AImodels.fyi newsletter or follow me on Twitter.

Introduction

The paper introduces a method for evaluating the ability of large language models (LMs) to generalize to new task variants, called counterfactual tasks. While LMs have shown impressive performance on various benchmarks, it is unclear whether this success stems from their general reasoning abilities or from recognizing and recalling specific tasks seen during training.

The authors propose creating counterfactual tasks by altering the conditions or rules of tasks that LMs perform well on, while keeping the general reasoning procedure the same. For example, instead of performing arithmetic in base-10 (the default), the counterfactual task would be to perform arithmetic in base-9.

They designed a suite of 11 counterfactual evaluation tasks across different categories and domains, including traditional NLP tasks, code generation, drawing, and spatial reasoning. These tasks test whether LMs can learn conceptual structures that mirror the non-linguistic world.

The authors evaluated several large language models, including GPT-4, GPT-3.5, Claude, and PaLM-2, on both the default and counterfactual tasks. The models exhibited above-random performance on counterfactual tasks, indicating some degree of task generalizability. However, their performance consistently and substantially degraded on the counterfactual variants compared to the default settings.

These results suggest that the models' success is at least partially supported by non-transferable, default-condition-specific behaviors rather than abstract, generalizable reasoning skills. The paper also discusses surprising relationships between model behavior on default and counterfactual tasks, such as correlations between performance, effectiveness of zero-shot chain-of-thought prompting, and interactions between task- and instance-level frequency effects.

Overall, the authors conclude that small variations on the default instantiations of tasks are challenging for models, indicating that the success of existing LMs should not be fully attributed to a fully general capacity for the target task.

Counterfactual Tasks

The paper describes an approach to evaluate language models on their ability to generalize to new task variants or conditions, rather than just new inputs. Each task is conceptualized as a function that maps an input to an output under certain assumptions or a "world model." For example, arithmetic operations assume a number base like base-10.

The authors refer to the typical set of assumptions, like base-10 for arithmetic, as the "default world." They introduce counterfactual worlds that deviate from these default conditions, like using a different number base. By evaluating language models on tasks in both the default and counterfactual worlds, they can measure if the models have overfit to the default conditions.

To ensure the models understand the specified counterfactual conditions, the authors use "counterfactual comprehension checks" - simple auxiliary tasks that test if the model grasps the counterfactual world appropriately. For example, checking if the model carries digits over 8 correctly when doing base-9 arithmetic.

Tasks

The paper evaluates large language models (LMs) on various tasks under both default and counterfactual settings to assess their robustness and understanding. The tasks include arithmetic in different bases, programming with different indexing conventions, syntactic reasoning with different word orders, logical reasoning with premises violating common sense, spatial reasoning with transformed coordinate systems, drawing objects with rotations/flips, music tasks with altered tunings/transpositions, chess openings with piece position swaps, and the SET card game with rule variations.

For each task, the models are given a counterfactual comprehension check (CCC) to verify their understanding of the altered setting. The results show that while LMs perform well on the default tasks, their performance consistently drops in the counterfactual conditions, despite often passing the CCC. This suggests that LMs may lack robust understanding and instead rely on patterns from their training data. The paper aims to disentangle memorization effects from true conceptual understanding through these counterfactual evaluations.

Results

The researchers evaluated several large language models (GPTs, PaLM, Claude) on various tasks under default and counterfactual conditions. They found the models performed substantially worse on the counterfactual task variants, even when prompted to reason step-by-step. While the models exhibited some ability to solve the counterfactual tasks, the large performance gaps between default and counterfactual conditions suggest the models overfit to the default conditions. This overfitting limits their abstract understanding of the tasks. For example, the models sometimes failed completely on counterfactual arithmetic problems despite near-perfect default performance. The researchers conclude the models rely on condition-specific implementations rather than general task comprehension.

Analysis

The paper explores how various factors affect the performance gap between default and counterfactual conditions for language models on different tasks. Some key findings:

  • More common or familiar counterfactual conditions (e.g. base 8, drop-D guitar tuning) lead to smaller performance drops compared to uncommon ones.

  • Performance drops are smaller when the counterfactual condition is closer to the default (e.g. base 9 vs 11 for arithmetic).

  • Across tasks and models, there is a strong correlation between default task performance and counterfactual performance. Better models on the default also tend to be better on counterfactuals.

  • However, larger models sometimes fail more on counterfactuals that contradict their training data, suggesting over-reliance on memorization.

  • Few-shot demonstrations help but do not fully eliminate the performance gap between default and counterfactual conditions.

  • Qualitative analysis shows even when models can complete counterfactual tasks, the outputs are often simplified or contain errors compared to default conditions.

The findings suggest models rely on both memorization and reasoning abilities, which co-exist on a continuum rather than being an either-or situation.

Discussion

The paper discusses whether large language models (LMs) trained on text data can generalize their reasoning abilities to unfamiliar counterfactual conditions, or if their performance is specific to the familiar default task setup they were trained on.

The authors hypothesize that while humans may initially struggle with unfamiliar counterfactual scenarios under time constraints, they ultimately have the competence and causal reasoning abilities to generalize to novel situations given sufficient time.

However, the authors argue that replicating human intelligence is not necessarily the goal for LMs. Their aim is to evaluate if LMs genuinely learn generalizable reasoning capabilities from training data, rather than just memorizing patterns specific to the default tasks.

The paper finds a notable performance gap between default and counterfactual task variants for LMs across several reasoning tasks. While more informative prompting can reduce this gap, it does not fully eliminate it, suggesting LMs may overfit to frequent patterns during pretraining rather than learning generalizable abstractions.

The authors posit that an ideal learner should structure representations to implement general-purpose abstractions that allow generalizing to new contexts, akin to how humans and scientific theories can extrapolate to drastically different scenarios. Their study indicates current LMs still struggle with such generalization.

Limitations

The paper discusses potential factors that could lead to underestimation or overestimation of a language model's true reasoning ability when using counterfactual tasks for evaluation.

Underestimation:

  • It is challenging to construct counterfactual tasks with the same difficulty as the default tasks. An objective difficulty measure may not exist.
  • Some counterfactual tasks involve additional steps, making them harder than the original task. This increased difficulty, rather than overfitting, could explain poor performance.

Overestimation:

  • It is difficult to ensure that counterfactual conditions are truly unseen during pretraining. If models have encountered some counterfactual conditions, the gap between default and counterfactual performance could be smaller than expected.

The paper distinguishes between counterfactual perturbations that fundamentally change the world model (e.g., arithmetic base) and more superficial perturbations where models could potentially exploit shortcuts (e.g., word replacements). For some tasks, models might identify simple mappings to revert the input to default conditions and leverage memorization, rather than reasoning under counterfactual conditions.

Finally, the counterfactual condition comprehension (CCC) task itself involves some reasoning, making it difficult to isolate the model's understanding of counterfactual conditions from its reasoning ability.

The paper discusses evaluating how well large language models (LMs) can reason about and understand conceptual structures in hypothetical or counterfactual scenarios that deviate from the real world. Previous research has shown LMs can acquire grounded understanding of certain concepts like color and size through text training alone. However, this study found that while LMs can generalize to new concepts, they struggle with the reasoning process required for hypothetical scenarios.

The paper relates this issue to causal reasoning - LMs fail to robustly learn the causal effects of different world states on the evaluated tasks. It notes that "counterfactual" is an informal term in NLP referring to different types of perturbations or hypothetical scenarios. Some past work looked at counterfactuals still licensed within a default world model, finding LMs struggle with consistent reasoning in such cases. Other work examined using counterfactual data to test model robustness.

Most similar to this study, one paper showed that while LMs seem capable of some reasoning in counterfactual worlds, this is largely due to surface cues rather than true understanding. The current paper reveals that even recent LMs exhibit difficulties with reasoning about conceptual structures in scenarios deviating from the real world.

Conclusion

The researchers conducted an evaluation using counterfactual scenarios across 11 tasks. They found a significant decrease in the performance of language models when faced with counterfactual conditions, as opposed to the default task variants. This performance gap is attributed to the language models overfitting to the default task variants, which may be more prevalent in the pretraining data. The authors recommend that future analyses of language models should consider their abstract task ability separately from their observed task performance, especially when the evaluated task variants might be overrepresented in the pretraining corpora.

Acknowledgments

The authors express gratitude to various individuals for their contributions. They thank several people, listed alphabetically, for helpful discussions and feedback on the work. Additionally, they acknowledge a group of annotators who made the drawing evaluation possible. One author expresses personal thanks for guitar lessons relevant to components of the paper. The figure in the paper uses icons from a website, and the study received funding from multiple sources, including the MIT-IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation.

Appendix A Full Setups

The paper describes several tasks used to evaluate large language models' (LMs) ability to reason under counterfactual conditions.

For arithmetic, models are asked to perform operations in different number bases. For programming, models generate code or execute provided code under different indexing conventions (0-based vs 1-based).

For syntactic reasoning, models identify subjects and verbs in sentences with different word orders. For logical reasoning, models determine if conclusions follow from counterfactual premises.

For spatial reasoning, models locate objects under different coordinate system orientations. For drawing, models generate code to draw objects with different rotations/flips.

For music tasks, models provide fret placements for alternate instrument tunings and identify notes in different musical keys.

For chess, models validate move sequences under swapped piece positions. For the SET game, models identify set completions with inverted rules.

The paper also describes "counterfactual condition checks" to verify models understand the counterfactual conditions.

Appendix B Prompts

The paper discusses the specific prompts used to query language models, presented in Tables 1 to 17. Instead of providing templates, the authors include concrete prompts that embed test instances. Minor design decisions related to the prompts are explained in the respective table captions. The system message field was not utilized for any of the language models evaluated.

Appendix C Raw Results

Tables 18 to 34 present the numerical findings of the study. The data is displayed in tabular form, likely containing values, measurements, or statistics related to the research.

If you enjoyed this summary, consider subscribing to the AImodels.fyi newsletter or following me on Twitter for more AI and machine learning content.

Did you find this article valuable?

Support Mike Young by becoming a sponsor. Any amount is appreciated!