Tech

Vision-language models gain spatial reasoning skills through artificial worlds and 3D scene descriptions

LovabledanielsJune 13, 202546 Views

A framework to boost the visual perspective taking and spatial reasoning of vision-language models — On the left, the simulated environment containing a cuboid placed on a plane and observed by a camera, placed directly above the object at varying distances. On the right, an example of the dataset elements used to train the model: an image and textual prompt as input, with the spatial relationship between the cuboid and camera represented as a transformation matrix as the desired output. Credit: Gioele Migno.

Vision-language models (VLMs) are advanced computational techniques designed to process both images and written texts, making predictions accordingly. Among other things, these models could be used to improve the capabilities of robots, helping them to accurately interpret their surroundings and interact with human users more effectively.

A team of researchers from the Italian Institute of Technology (IIT) and the University of Aberdeen have recently introduced a new conceptual framework and a dataset containing computationally generated data, which could be used to train VLMs on spatial reasoning tasks. Their framework and dataset, presented in a paper posted to the arXiv preprint server, could contribute to the development of embodied artificial intelligence (AI) systems that are better equipped to navigate real-world environments and communicate with humans.

This research marks the outcome of the FAIR* project and stems from a recent collaboration between the Social Cognition in Human-Robot Interaction (S4HRI) research line at IIT, guided by Prof. Agnieszka Wykowska, and the Action Prediction Lab at the University of Aberdeen, which is led by Prof. Patric Bach.

“Our research group investigates how human social cognition mechanisms are engaged during interactions with artificial agents,” Davide De Tommaso, technologist at IIT and co-senior author of the paper, told Tech Xplore. “Our previous studies indicated that, under specific conditions, people attribute intentionality to robots and interact with them in ways that closely resemble interactions with other social partners.

“Therefore, understanding these mechanisms, particularly the role of nonverbal cues such as gaze, gestures, and spatial behaviors, is crucial for developing effective computational models of social cognition in robots.”

Visual perspective taking (VPT), the ability to understand what a visual scene looks like from another’s point of view, could be greatly advantageous for robotic systems, as it could allow them to make sense of instructions they are given, cooperate with other agents and successfully complete missions. De Tommaso and his colleagues have recently been trying to reproduce this key ability in robots, while also ensuring that the robots can apply it across a wide range of contexts.

“Our primary objective was to enable robots to reason effectively about what other agents (human or artificial) can or cannot perceive from their vantage points within shared environments,” said De Tommaso. “For example, robots should accurately assess whether text is readable from another person’s viewpoint, if an object is hidden behind an obstacle, or whether an object is suitably oriented for a human to grasp or point to it.

“Despite current foundational models often lacking sophisticated spatial reasoning capabilities, we strongly believe that harnessing large-language models for scene understanding, alongside synthetic scene representations, holds significant promise for modeling human-like VPT capabilities in embodied artificial agents.”

To improve the VPT capabilities of VLMs, the researchers compiled a dataset that could support their training on spatial reasoning tasks. Using NVIDIA’s Omniverse Replicator, a platform for generating synthetic data, they created a new “artificial world,” which essentially consisted of a simple scene capturing a cube, which was viewed from different angles and distances.

They then took captured 3D images of the cube in this synthetic world, adding a natural language description for each of them, along with a 4×4 transformation matrix, a mathematical structure that represents the position and orientation of the cube. The dataset was published online and can be used by other teams to train their VLMs.

“Each image captured by the virtual camera comes with a text prompt containing the cube’s dimensions, and a precise transformation matrix that encodes the spatial relationship between the camera and the object, the kind of data robots use to plan movements and interact with the world,” explained Joel Currie, the first author of the paper, who is a Ph.D. student at the University of Aberdeen and a Research Fellow at the Italian Institute of Technology.

“Because the environment is synthetic, we control every aspect and generate tens of thousands of image-matrix pairs quickly (something nearly impossible with real-world setups). It’s a way of teaching robots to not just see, but to understand space like a physical being would.”

So far, the framework introduced by the researchers is merely theoretical, yet it could soon open new possibilities for the training of real VLMs. The researchers themselves could soon assess its potential by training a model using the dataset they compiled or similar synthetically generated data.

“What we’ve done is fundamentally conceptual,” Currie said. “We’re proposing a new way for AI to learn space, not just from its own viewpoint, but from someone else’s. Instead of hardcoded geometry, we treat Visual Perspective Taking as something the model can learn using vision and language. It’s a step toward embodied cognition—robots that don’t just see the world, but can imagine how it looks to others. We see this as foundational for true social intelligence in machines.”

The recent work by De Tommaso, Currie, Migno and their colleagues could inspire the generation of other similar synthetic datasets for training VLMs on spatial reasoning tasks. These efforts could collectively contribute to the improvement of humanoid robots and other embodied AI agents, potentially facilitating their deployment in real-world settings.

“Our next step will be to make the virtual environment as realistic as possible, bringing the distance between a scene from the simulated space and the real world closer,” added Gioele Migno, who graduated in Artificial Intelligence and Robotics from Sapienza University of Rome and recently joined the S4HRI research unit at IIT as a Research Fellow.

“This step is crucial to transfer the knowledge acquired by the model in simulation into the real world, and to make it possible for an embodied robot to exploit spatial reasoning. Once this is achieved, we are then interested in investigating how these capabilities can make interactions with humans more effective in scenarios where they share a spatial understanding of the scene.”

Written for you by our author Ingrid Fadelli, edited by Lisa Lock, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a donation (especially monthly). You’ll get an ad-free account as a thank-you.

More information:
Joel Currie et al, Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds, arXiv (2025). DOI: 10.48550/arxiv.2505.14366

Journal information:
arXiv

Citation:
Vision-language models gain spatial reasoning skills through artificial worlds and 3D scene descriptions (2025, June 13)
retrieved 13 June 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.