We propose GeoThinker, which enables MLLMs to actively retrieve and integrate geometry conditioned on their internal reasoning needs, rather than passively fusing a uniformly exposed geometry stream.
GeoThinker achieves SOTA results on spatial intelligence benchmarks, notably best score on VSI-Bench.
GeoThinker remains robust under debiased and long-video evaluation settings, and transfers effectively to diverse downstream scenarios such as embodied referring and autonomous driving.
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.
Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.
Our framework features a decoupled interaction mechanism where the VGGT is integrated via Spatial-Grounded Fusion layers. By employing Importance Gating, the model predicts a localized attention bias to dynamically modulate the injection of geometric textures. This design ensures that rich structural details are only queried when they are contextually relevant to the semantic reasoning process.
Heatmaps illustrate that GeoThinker naturally learns to prioritize salient object boundaries and structural edges while suppressing non-informative regions like floors or walls.
@article{li2026thinking,
title={Thinking with Geometry: Active Geometry Integration for Spatial Reasoning},
author={Haoyuan, Li and Qihang, Cao and Tao, Tang and Kun, Xiang and Zihan, Guo and Jianhua, Han and JiaWang, Bian and Hang, Xu and Xiaodan, Liang},
year={2026}
}