Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has shown promising open-world performance in zero-shot 3D shape understanding tasks by information fusion among language and 3D modality. It first renders 3D objects into multiple 2D image views and then learns to understand the semantic relationships between the textual descriptions and images, enabling the model to generalize to new and unseen categories. However, existing studies in zero-shot 3D shape understanding rely on predefined rendering parameters, resulting in repetitive, redundant, and low-quality views. This limitation hinders the model's ability to fully comprehend 3D shapes and adversely impacts the text–image fusion in a shared latent space. To this end, we propose a novel approach called Differentiable rendering-based multi-view Image–Language Fusion (DILF) for zero-shot 3D shape understanding. Specifically, DILF leverages large-scale language models (LLMs) to generate textual prompts enriched with 3D semantics and designs a differentiable renderer with learnable rendering parameters to produce representative multi-view images. These rendering parameters can be iteratively updated using a text–image fusion loss, which aids in parameters’ regression, allowing the model to determine the optimal viewpoint positions for each 3D object. Then a group-view mechanism is introduced to model interdependencies across views, enabling efficient information fusion to achieve a more comprehensive 3D shape understanding. Experimental results can demonstrate that DILF outperforms state-of-the-art methods for zero-shot 3D classification while maintaining competitive performance for standard 3D classification. The code is available at https://github.com/yuzaiyang123/DILP. © 2023 The Author(s)
Few-Shot Class-Incremental Learning (FSCIL) aims to learn new classes incrementally with a limited number of samples per class. It faces issues of forgetting previously learned classes and overfitting on few-shot classes. An efficient strategy is to learn features that are discriminative in both base and incremental sessions. Current methods improve discriminability by manually designing inter-class margins based on empirical observations, which can be suboptimal. The emerging Neural Collapse (NC) theory provides a theoretically optimal inter-class margin for classification, serving as a basis for adaptively computing the margin. Yet, it is designed for closed, balanced data, not for sequential or few-shot imbalanced data. To address this gap, we propose a Meta-learning- and NC-based FSCIL method, MetaNC-FSCIL, to compute the optimal margin adaptively and maintain it at each incremental session. Specifically, we first compute the theoretically optimal margin based on the NC theory. Then we introduce a novel loss function to ensure that the loss value is minimized precisely when the inter-class margin reaches its theoretically best. Motivated by the intuition that “learn how to preserve the margin” matches the meta-learning's goal of “learn how to learn”, we embed the loss function in base-session meta-training to preserve the margin for future meta-testing sessions. Experimental results demonstrate the effectiveness of MetaNC-FSCIL, achieving superior performance on multiple datasets. The code is available at https://github.com/qihangran/metaNC-FSCIL. © 2024 The Author(s)
Three-dimensional human pose and shape estimation is to compute a full human 3D mesh given a single image. The contamination of features caused by occlusion usually degrades its performance significantly. Recent progress in this field typically addressed the occlusion problem implicitly. By contrast, in this paper, we address it explicitly using a simple yet effective de-occlusion multi-task learning network. Our key insight is that feature for mesh parameter regression should be noiseless. Thus, in the feature space, our method disentangles the occludee that represents the noiseless human feature from the occluder. Specifically, a spatial regularization and an attention mechanism are imposed in the backbone of our network to disentangle the features into different channels. Furthermore, two segmentation tasks are proposed to supervise the de-occlusion process. The final mesh model is regressed by the disentangled occlusion-aware features. Experiments on both occlusion and non-occlusion datasets are conducted, and the results prove that our method is superior to the state-of-the-art methods on two occlusion datasets, while achieving competitive performance on a non-occlusion dataset. We also demonstrate that the proposed de-occlusion strategy is the main factor to improve the robustness against occlusion. The code is available at https://github.com/qihangran/De-occlusion_MTL_HMR. © 2023
Large deep learning models are impressive, but they struggle when real-time data is not available. Few-shot class-incremental learning (FSCIL) poses a significant challenge for deep neural networks to learn new tasks from just a few labeled samples without forgetting the previously learned ones. This setup can easily leads to catastrophic forgetting and overfitting problems, severely affecting model performance. Studying FSCIL helps overcome deep learning model limitations on data volume and acquisition time, while improving practicality and adaptability of machine learning models. This paper provides a comprehensive survey on FSCIL. Unlike previous surveys, we aim to synthesize few-shot learning and incremental learning, focusing on introducing FSCIL from two perspectives, while reviewing over 30 theoretical research studies and more than 20 applied research studies. From the theoretical perspective, we provide a novel categorization approach that divides the field into five subcategories, including traditional machine learning methods, meta learning-based methods, feature and feature space-based methods, replay-based methods, and dynamic network structure-based methods. We also evaluate the performance of recent theoretical research on benchmark datasets of FSCIL. From the application perspective, FSCIL has achieved impressive achievements in various fields of computer vision such as image classification, object detection, and image segmentation, as well as in natural language processing and graph. We summarize the important applications. Finally, we point out potential future research directions, including applications, problem setups, and theory development. Overall, this paper offers a comprehensive analysis of the latest advances in FSCIL from a methodological, performance, and application perspective. © 2023 The Author(s)
The incremental learning paradigm in machine learning has consistently been a focus of academic research. It is similar to the way in which biological systems learn, and reduces energy consumption by avoiding excessive retraining. Existing studies utilize the powerful feature extraction capabilities of pre-trained models to address incremental learning, but there remains a problem of insufficient utilization of neural network feature knowledge. To address this issue, this paper proposes a novel method called Pre-trained Model Knowledge Distillation (PMKD) which combines knowledge distillation of neural network representations and replay. This paper designs a loss function based on centered kernel alignment to transfer neural network representations knowledge from the pre-trained model to the incremental model layer-by-layer. Additionally, the use of memory buffer for Dark Experience Replay helps the model retain past knowledge better. Experiments show that PMKD achieved superior performance on various datasets and different buffer sizes. Compared to other methods, our class incremental learning accuracy reached the best performance. The open-source code is published athttps://github.com/TianSongS/PMKD-IL. © 2023 The Author(s)
Re-identification (ReID) of occluded persons is a challenging task due to the loss of information in scenes with occlusions. Most existing methods for occluded ReID use 2D-based network structures to directly extract representations from 2D RGB (red, green, and blue) images, which can result in reduced performance in occluded scenes. However, since a person is a 3D non-grid object, learning semantic representations in a 2D space can limit the ability to accurately profile an occluded person. Therefore, it is crucial to explore alternative approaches that can effectively handle occlusions and leverage the full 3D nature of a person. To tackle these challenges, in this study, we employ a 3D view-based approach that fully utilizes the geometric information of 3D objects while leveraging advancements in 2D-based networks for feature extraction. Our study is the first to introduce a 3D view-based method in the areas of holistic and occluded ReID. To implement this approach, we propose a random rendering strategy that converts 2D RGB images into 3D multi-view images. We then use a 3D Multi-View Transformation Network for ReID (MV-ReID) to group and aggregate these images into a unified feature space. Compared to 2D RGB images, multi-view images can reconstruct occluded portions of a person in 3D space, enabling a more comprehensive understanding of occluded individuals. The experiments on benchmark datasets demonstrate that the proposed method achieves state-of-the-art results on occluded ReID tasks and exhibits competitive performance on holistic ReID tasks. These results also suggest that our approach has the potential to solve occlusion problems and contribute to the field of ReID. The source code and dataset are available at https://github.com/yuzaiyang123/MV-Reid. © 2023 Elsevier B.V.