Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li1*, Qihang Cao2*, Tao Tang1, Kun Xiang1, Zihan Guo1,3, Jianhua Han4, Hang Xu4, Xiaodan Liang1,5
1Sun Yat-sen University 2Shanghai Jiao Tong University 3Shanghai Innovation Institute 4Yingwang Intelligent Technology Co., Ltd. 5MBZUAI
* Equal contribution. Corresponding author.
Teaser

LEFT (a) Passive Fusion:Conventional MLLMs indiscriminately incorporate a global stream of geometric features, which leads to significant information redundancy and semantic-texture misalignment.
(b) Active Perception (GeoThinker): Our framework shifts the paradigm by empowering the model to discern and selectively retrieve spatial cues guided by its internal reasoning demands.
RIGHT Active perception yields superior performance across diverse spatial intelligence benchmarks.

Key Contributions

Active Perception

We propose GeoThinker, which enables MLLMs to actively retrieve and integrate geometry conditioned on their internal reasoning needs, rather than passively fusing a uniformly exposed geometry stream.

SOTA Performance

GeoThinker achieves SOTA results on spatial intelligence benchmarks, notably best score on VSI-Bench.

Robust & Transferable

GeoThinker remains robust under debiased and long-video evaluation settings, and transfers effectively to diverse downstream scenarios such as embodied referring and autonomous driving.

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.

Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.

Model Overview

Model Overview

Our framework features a decoupled interaction mechanism where the VGGT is integrated via Spatial-Grounded Fusion layers. By employing Importance Gating, the model predicts a localized attention bias to dynamically modulate the injection of geometric textures. This design ensures that rich structural details are only queried when they are contextually relevant to the semantic reasoning process.

Visualization

Visualization 1
Visualization 2

Heatmaps illustrate that GeoThinker naturally learns to prioritize salient object boundaries and structural edges while suppressing non-informative regions like floors or walls.

BibTeX

@article{li2026thinking,
  title={Thinking with Geometry: Active Geometry Integration for Spatial Reasoning},
  author={Haoyuan, Li and Qihang, Cao and Tao, Tang and Kun, Xiang and Zihan, Guo and Jianhua, Han and JiaWang, Bian and Hang, Xu and Xiaodan, Liang},
  year={2026}
}