A good spatial dataset is not just a set of images with coordinates attached. It is a record of how a world exists and changes over time.
Most progress in AI has followed data. Text data led to language models. Image data led to visual recognition. Video data added motion. Spatial intelligence is different. It is about understanding where things are, how they relate to each other, and how they persist even when we do not see them. Many current datasets do not teach this. They focus on appearance, not structure.
The first question a spatial dataset must answer is what the model is learning about. In language, the unit is a word or token. In vision, it is a pixel or a patch. In spatial intelligence, the unit is the state of the world. A good dataset should show the same world across time and from different viewpoints. Objects should be treated as things that continue to exist, not as detections that appear and disappear with each frame.
This is why static images are not enough. A single image can teach recognition, but not understanding. Even short video clips often fall short. Spatial understanding comes from interaction. When the camera moves, the view changes. When an object is pushed, its position changes. Actions have consequences. A good spatial dataset must include this link between perception and action.
Consistency matters more than size. Very large datasets with inconsistent geometry or weak physical structure teach models to ignore space. Smaller datasets with clean, coherent worlds can teach much more. In this sense, synthetic data is not a shortcut. It is often the right tool. Simulation lets us control geometry, motion, and cause and effect. What matters is not whether data is real or synthetic, but whether the world it represents makes sense.
Time horizon is another key factor. Many datasets reset the scene every few seconds. Real spatial reasoning does not work this way. It depends on memory. An agent must remember where something was seen before and reason about where it might be now. A good spatial dataset should support long sequences and stable world identity, so that models are encouraged to build internal maps.
Partial observation is also essential. In the real world, we never see everything at once. Objects are hidden, views are limited, and sensors are noisy. A spatial dataset should include these limits by design. Learning to reason under uncertainty is part of spatial intelligence, not an edge case.
Annotations should focus on structure rather than labels. Knowing that something is a chair is less important than knowing that it supports weight, blocks movement, or connects two spaces. Relationships and constraints matter more than names. The goal is to understand how the world works, not just what things are called.
Evaluation should reflect these goals. Accuracy on single frames is not enough. We should ask whether a model can predict future states, plan over multiple steps, and remain consistent over time. If we do not measure memory, causality, and long term reasoning, our datasets will not encourage them.
Spatial intelligence will not emerge by accident. It will come from data that is designed to teach persistence, interaction, and structure. When datasets reflect how the world actually behaves, models will begin to understand space not as images, but as reality.