Swin-MGNet: Swin Transformer based Multi-view Grouping Network for 3D Object RecognitionShow others and affiliations
2025 (English)In: IEEE Transactions on Artificial Intelligence, ISSN 2691-4581, Vol. 6, no 3, p. 747-758Article in journal (Refereed) Published
Abstract [en]
Recent developments in Swin Transformer have shown its great potential in various computer vision tasks, including image classification, semantic segmentation, and object detection. However, it is challenging to achieve desired performance by directly employing the Swin Transformer in multi-view 3D object recognition since the Swin Transformer independently extracts the characteristics of each view and relies heavily on a subsequent fusion strategy to unify the multi-view information. This leads to the problem of the insufficient extraction of interdependencies between the multi-view images. To this end, we propose an aggregation strategy integrated into the Swin Transformer to reinforce the connections between internal features across multiple views, thus leading to a complete interpretation of isolated features extracted by the Swin Transformer. Specifically, we utilize Swin Transformer to learn view-level feature representations from multi-view images and then calculate their view discrimination scores. The scores are employed to assign the view-level features to different groups. Finally, a grouping and fusion network is proposed to aggregate the features from view and group levels. The experimental results indicate that our method attains state-of-the-art performance compared to prior approaches in multi-view 3D object recognition tasks. The source code is available at https://github.com/Qishaohua94/DEST. ©2024 IEEE.
Place, publisher, year, edition, pages
Piscataway, NJ: IEEE, 2025. Vol. 6, no 3, p. 747-758
Keywords [en]
3D Object Classification, 3D Object Retrieval, Feature Fusion, Grouping Mechanism, Multi-view learning, Swin Transformer
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:hh:diva-54973DOI: 10.1109/TAI.2024.3492163Scopus ID: 2-s2.0-85208686299OAI: oai:DiVA.org:hh-54973DiVA, id: diva2:1915950
Funder
VinnovaSwedish Research Council
Note
This work is supported by the National Natural Science Foundation of China No. 62373343, Beijing Natural Science Foundation No. L233036, Swedish Research Council (VR) and the Swedish Innovation Agency (VINNOVA).
2024-11-262024-11-262025-10-01Bibliographically approved