hh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding
Chinese Academy Of Sciences, Beijing, China.ORCID iD: 0000-0001-7897-1673
Chinese Academy Of Sciences, Beijing, China; University Of Chinese Academy Of Sciences, Beijing, China.ORCID iD: 0000-0002-3425-1153
Old Dominion University, Norfolk, United States.ORCID iD: 0000-0002-4323-2632
Chinese Academy Of Sciences, Beijing, China.ORCID iD: 0000-0001-9668-2883
Show others and affiliations
2024 (English)In: Information Fusion, ISSN 1566-2535, E-ISSN 1872-6305, Vol. 102, p. 1-12, article id 102033Article in journal (Refereed) Published
Abstract [en]

Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has shown promising open-world performance in zero-shot 3D shape understanding tasks by information fusion among language and 3D modality. It first renders 3D objects into multiple 2D image views and then learns to understand the semantic relationships between the textual descriptions and images, enabling the model to generalize to new and unseen categories. However, existing studies in zero-shot 3D shape understanding rely on predefined rendering parameters, resulting in repetitive, redundant, and low-quality views. This limitation hinders the model's ability to fully comprehend 3D shapes and adversely impacts the text–image fusion in a shared latent space. To this end, we propose a novel approach called Differentiable rendering-based multi-view Image–Language Fusion (DILF) for zero-shot 3D shape understanding. Specifically, DILF leverages large-scale language models (LLMs) to generate textual prompts enriched with 3D semantics and designs a differentiable renderer with learnable rendering parameters to produce representative multi-view images. These rendering parameters can be iteratively updated using a text–image fusion loss, which aids in parameters’ regression, allowing the model to determine the optimal viewpoint positions for each 3D object. Then a group-view mechanism is introduced to model interdependencies across views, enabling efficient information fusion to achieve a more comprehensive 3D shape understanding. Experimental results can demonstrate that DILF outperforms state-of-the-art methods for zero-shot 3D classification while maintaining competitive performance for standard 3D classification. The code is available at https://github.com/yuzaiyang123/DILP. © 2023 The Author(s)

Place, publisher, year, edition, pages
Amsterdam: Elsevier, 2024. Vol. 102, p. 1-12, article id 102033
Keywords [en]
Differentiable rendering, Information fusion, Text–image fusion, Zero-shot 3D shape understanding
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:hh:diva-51777DOI: 10.1016/j.inffus.2023.102033ISI: 001084085200001Scopus ID: 2-s2.0-85172076357OAI: oai:DiVA.org:hh-51777DiVA, id: diva2:1812539
Note

Funding agency:

National Natural Science Foundation of China (NSFC) Grant number: 6237334

Beijing Natural Science Foundation Grant number: L233036

Available from: 2023-11-16 Created: 2023-11-16 Last updated: 2025-10-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Tiwari, Prayag

Search in DiVA

By author/editor
Ning, XinYu, ZaiyangLi, LusiLi, WeijunTiwari, Prayag
By organisation
School of Information Technology
In the same journal
Information Fusion
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 126 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf