CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

¹University of Trento, Italy
²MDSR Labs Adobe, India
³Indian Institute of Technology Bombay, India
⁴Fondazione Bruno Kessler, Italy

Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt taskspecific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropyguided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines.

Results

Domain adaptation performance on the PointDA-10 benchmark.

Domain adaptation performance on the GraspNetPC-10 benchmark.

t-SNE visualization of CLIPoint3D's performance before and after adaptation.

LLM attributes generation

BibTeX

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10.
Encoder-based 3D UDA methods (e.g., PointDAN, GAST, MLSP) are accurate but computationally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap.
CLIPoint3D achieves +16.4% improvement with minimal overhead.

Abstract

The model architecture for CLIPoint3D

Results

Domain adaptation performance on the PointDA-10 benchmark.

Domain adaptation performance on the GraspNetPC-10 benchmark.

t-SNE visualization of CLIPoint3D's performance before and after adaptation.

LLM attributes generation

BibTeX

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10. Encoder-based 3D UDA methods (e.g., PointDAN, GAST, MLSP) are accurate but computationally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap. CLIPoint3D achieves +16.4% improvement with minimal overhead.

Abstract

The model architecture for CLIPoint3D

Results

Domain adaptation performance on the PointDA-10 benchmark.

Domain adaptation performance on the GraspNetPC-10 benchmark.

t-SNE visualization of CLIPoint3D's performance before and after adaptation.

LLM attributes generation

BibTeX

Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10.
Encoder-based 3D UDA methods (e.g., PointDAN, GAST, MLSP) are accurate but computationally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap.
CLIPoint3D achieves +16.4% improvement with minimal overhead.