Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression

Qiming Bao, Xiaoxuan Fu
12/6/2025
cs.AIcs.CLcs.LG

Abstract

Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised. We present a controlled evaluation framework that probes reasoning reliability through four stress tests: (1) rule deletion, removing redundant versus essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites based on equivalence laws (contraposition, double negation, implication-to-disjunction, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that composes 2--5 transformations. Across three representative model families -- BERT, Qwen2, and LLaMA-like models -- all models attain Acc$=1.0000$ on the base split and show no degradation under redundant rule deletion. In contrast, essential rule deletion yields a pronounced decrease to near-chance performance, and injecting explicit contradictions reduces accuracy to 0.0000. Under logic-preserving rewrites, accuracy is largely preserved for single-law transformations with only small degradations in a few cases, whereas multi-law stacking exposes model-dependent sensitivity: BERT matches the base condition, TinyLlama shows only marginal degradation, and Qwen2 exhibits a substantial drop. Overall, the results indicate that contemporary LLMs are generally stable under semantic-preserving reformulations, yet remain brittle to missing or inconsistent evidence and may degrade under composed logical transformations depending on the model family. The proposed framework provides a concise diagnostic tool for isolating these failure modes and for evaluating logical generalisation beyond surface-form variation.

View on arXivView PDF

Code Implementations(2)

openai/evalsOfficial100%

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18,3502,937HTML, ShellJan 23, 20231 months agoNOASSERTION
14H034160212/lemoOfficial100%

Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors

00TeX, LeanDec 2, 20251 months ago

Cite this paper

@article{bao2025less,
  title  = {Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression},
  author = {Qiming Bao and Xiaoxuan Fu},
  year   = {2025},
  eprint = {2512.06393},
  archivePrefix = {arXiv},
  url    = {http://arxiv.org/abs/2512.06393v2}
}

Discussion