Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li¹*, Qihang Cao²*, Tao Tang¹, Kun Xiang¹, Zihan Guo^1,3, Jianhua Han⁴, Hang Xu⁴, Xiaodan Liang^1,5†

¹Sun Yat-sen University ²Shanghai Jiao Tong University ³Shanghai Innovation Institute ⁴Yingwang Intelligent Technology Co., Ltd. ⁵MBZUAI

* Equal contribution. † Corresponding author.

Key Contributions

Active Perception

We propose GeoThinker, which enables MLLMs to actively retrieve and integrate geometry conditioned on their internal reasoning needs, rather than passively fusing a uniformly exposed geometry stream.

SOTA Performance

GeoThinker achieves SOTA results on spatial intelligence benchmarks, notably best score on VSI-Bench.

Robust & Transferable

GeoThinker remains robust under debiased and long-video evaluation settings, and transfers effectively to diverse downstream scenarios such as embodied referring and autonomous driving.

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.

Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.

Model Overview

Our framework features a decoupled interaction mechanism where the VGGT is integrated via Spatial-Grounded Fusion layers. By employing Importance Gating, the model predicts a localized attention bias to dynamically modulate the injection of geometric textures. This design ensures that rich structural details are only queried when they are contextually relevant to the semantic reasoning process.

Visualization

Heatmaps illustrate that GeoThinker naturally learns to prioritize salient object boundaries and structural edges while suppressing non-informative regions like floors or walls.

@article{li2026thinking, title={Thinking with Geometry: Active Geometry Integration for Spatial Reasoning}, author={Haoyuan, Li and Qihang, Cao and Tao, Tang and Kun, Xiang and Zihan, Guo and Jianhua, Han and JiaWang, Bian and Hang, Xu and Xiaodan, Liang}, year={2026} }

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Key Contributions

Abstract

Model Overview

Visualization

BibTeX