SAI3D: Segment Any Instance in 3D Scenes


CVPR2024



Yingda Yin1,2*      Yuzheng Liu2,3*      Yang Xiao4*      Daniel Cohen-Or5      Jingwei Huang6      Baoquan Chen2,3

1School of Computer Science, Peking University     2National Key Lab of General AI, China    
3School of Intelligence Science and Technology, Peking University    
4Ecole des Ponts ParisTech     5Tel-Aviv University     6Tencent    

* equal contribution


Abstract


Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of fine-grained 3D scene parsing. Empirical evaluations on ScanNet and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.



Method




Our approach combines geometric priors with the capabilities of 2D foundation models. We over-segment 3D point clouds into superpoints (top-left), and generate 2D image masks using SAM (bottom-left). We then construct a scene graph that quantifies the pairwise affinity scores of super points (middle). Finally, we leverage a progressive region growing to gradually merge 3D superpoints into the final 3D instance segmentation masks (right).



Qualitative Results on ScanNet++/ScanNet


Click the thumbnails below to select scenes.


3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2

3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2

3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2

3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2

3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2

3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2

3D Instance Segmentation

Left click to rotate, right click to translate, scroll wheel to zoom.

Comparisons

View1
View2



Contact


Please feel free to contact Yingda Yin or Yuzheng Liu.