CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

1University of Trento, Italy
2MDSR Labs Adobe, India
3Indian Institute of Technology Bombay, India
4Fondazione Bruno Kessler, Italy

Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10.
Encoder-based 3D UDA methods (e.g., PointDAN, GAST, MLSP) are accurate but computationally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap.
CLIPoint3D achieves +16.4% improvement with minimal overhead.

Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt taskspecific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropyguided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines.

The model architecture for CLIPoint3D

Results

qualitative results

Domain adaptation performance on the PointDA-10 benchmark.



qualitative results

Domain adaptation performance on the GraspNetPC-10 benchmark.

quantitative results

t-SNE visualization of CLIPoint3D's performance before and after adaptation.

quantitative results

LLM attributes generation

BibTeX


Copyright: CC BY-NC-SA 4.0 © Mainak Singha | Last updated: 09 Jul 2024 |Template Credit: Nerfies