Three‐Dimensional Object Perception Can Emerge From Predictive Learning

John Day, Tushar Arora, Jirui Liu, Li Erran Li, Ming Bo Cai

Published online on June 04, 2026

Abstract

["Developmental Science, Volume 29, Issue 4, July 2026. ", "\nABSTRACT\n\nHow do infants develop the ability to perceive objects in a 3D world? The theory of core knowledge suggests infants employ a few principles, such as cohesion, continuity, rigidity, and contact, to guide inference of objects. However, it is challenging to answer how object perception can be learned with similar constraints faced by infants and whether these principles are sufficient for the learning, solely based on studies of infant behavior. We hypothesize that the construct of objects emerges to serve the purpose of efficient prediction, and tested the computational sufficiency of these principles with a deep neural network model trained in a simplified virtual environment that mimics key constraints faced by the infant brain with an objective of learning to predict future visual input. The model simultaneously learns three fundamental perceptual capabilities without supervision: depth perception, object segmentation, and 3D localization of objects from single images. Its internal representation of objects reflects their shapes and textures. Of the core knowledge principles, the cohesion, continuity, and rigidity principles are a sufficient subset that allows our model to learn object perception in the environment tested, without incorporating the contact principle. Relaxing the assumption of rigidity harms depth perception and 3D localization of objects, but preserves 2D object segmentation. Our findings suggest that predictive learning is a potential candidate mechanism that drives the emergence of object perception in early development.\n\nSummary\n\nPredictive learning drives the emergence of object perception in a novel neural network without supervision.\nThe model learns depth perception, object segmentation, and 3D localization in a virtual environment with only visual inputs and self‐motion information.\nWe tested the computational sufficiency of the assumptions of cohesion, continuity, and rigidity in core knowledge theory under similar constraints as what infants face.\nRelaxing rigidity assumption impairs depth perception and 3D localization, but not 2D object segmentation.\n\n\n"]