CObL: Toward Zero-Shot Ordinal Layering without User Prompting

Abstract

Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. We capture this with a scene representation comprising an occlusion-ordered stack of "object layers," each containing an isolated and amodally-completed object.

To infer this representation from an image, we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers in parallel using Stable Diffusion as a prior for natural objects and inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to photographs of real-world tabletops with varying numbers of novel objects.

In contrast to recent models for amodal object completion, CObL reconstructs multiple occluded objects without user prompting and without knowing the number of objects beforehand. Unlike previous models for unsupervised object-centric representation learning, CObL is not limited to the world it was trained in.

Concurrent Object Layers (CObL)

CObL discovers ordered object layers from images containing occlusions. Each object layer contains a single object, and CObL attempts to estimate and complete occluded regions, without requiring user prompting like text prompts or object masks. We train additional modules attached to fixed instances of Stable Diffusion to extract object layers from synthetic scenes, and find that our model generalizes to object layer discovery in real world scenes.

Ordinal object layers

The object layer representation generated by CObL encompasses many other common output representations, including but not limited to amodal completion of the occluded objects, occlusion-ordered layering, amodal segmentation and panoptic segmentation.

Learning from synthetic data

We introduce an efficient procedural generation scheme using Blender and Controlnet-depth to generate diverse, synthetic images and ground truth object layers from 3D assets. We focus on the domain of tabletop scenes, but this pipeline can be extended to other scene layouts.

We find experimentally that this data is sufficiently realistic to bridge the sim-to-real gap. When CObL is trained on this data, it generalizes to captured photographs without finetuning.

TableTop: A real world object-layer evaluation benchmark

We create TableTop as a way to test object layer generalization to real world objects. By iteratively placing objects in a scene, we can record ground truth object layers.

BibTeX

@article{damaraju2025cobl,
      author    = {Damaraju, Aneel and Hazineh, Dean and Zickler, Todd},
      title     = {CObL: Toward Zero-Shot Ordinal Layering without User Prompting},
      journal   = {ICCV},
      year      = {2025},
    }