General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo +8 more
4/13/2026
cs.CLcs.AI

Abstract

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

View on arXivView PDF

Code Implementations(4)

An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the simplest implementation of a deep research agent - e.g. an agent that can refine its research direction overtime and deep dive into a topic.

18,3671,895Feb 4, 20257 months agoMIT
agentaigpto3-miniresearch

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

13,195881Jun 18, 20211 years agoMIT
adaptationdebertadeep-learninggpt-2gpt-3+5 more

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

4,559376May 21, 20238 months agoApache-2.0
artificial-intelligencechatgptdeep-learninggpt4multimodal+4 more

We introduce MKQA, an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Please refer to our paper for details, MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

18919Jul 30, 20203 years agoApache-2.0
datasetmultilingual-evaluationnlp

Discussion