Depressformer: Leveraging Video Swin Transformer and fine-grained local features for depression scale estimationShow others and affiliations
2024 (English)In: Biomedical Signal Processing and Control, ISSN 1746-8094, E-ISSN 1746-8108, Vol. 96, no Part A, article id 106490Article in journal (Refereed) Published
Abstract [en]
Background and Objective:: By 2030, depression is projected to become the predominant mental disorder. With the rising prominence of depression, a great number of affective computing studies has been observed, with the majority emphasizing the use of audiovisual methods for estimating depression scales. Present studies often overlook the potential patterns of sequential data and not adopt the fine-grained features of Transformer to model the behavior features for video-based depression recognition (VDR). Methods: To address above-mentioned gaps, we present an end-to-end sequential framework called Depressformer for VDR. This innovative structure is delineated into the three structures: the Video Swin Transformer (VST) for deep feature extraction, a module dedicated to depression-specific fine-grained local feature extraction (DFLFE), and the depression channel attention fusion (DCAF) module to fuse the latent local and global features. By utilizing the VST as a backbone network, it is possible to discern pivotal features more effectively. The DFLFE enriches this process by focusing on the nuanced local features indicative of depression. To enhance the modeling of combined features pertinent to VDR, DCAF module is also presented. Results: Our methodology underwent extensive validations using the AVEC2013/2014 depression databases. The empirical results underscore its efficacy, yielding a root mean square error (RMSE) of 7.47 and a mean absolute error (MAE) of 5.49 for the first dataset. For the second database, the corresponding values were 7.22 and 5.56, respectively. And the F1-score is 0.59 on the D-vlog dataset. Conclusions: In summary, the experimental evaluations suggest that Depressformer architecture demonstrates superior performances with stability and adaptability across various tasks, making it capable of effectively identifying the severity of depression. Code will released at the link: https://github.com/helang818/Depressformer/. © 2024 Elsevier Ltd
Place, publisher, year, edition, pages
Amsterdam: Elsevier, 2024. Vol. 96, no Part A, article id 106490
Keywords [en]
Channel attention, Depressformer, Depression, Facial regions
National Category
Clinical Medicine
Identifiers
URN: urn:nbn:se:hh:diva-53799DOI: 10.1016/j.bspc.2024.106490ISI: 001248571700001Scopus ID: 2-s2.0-85194918848OAI: oai:DiVA.org:hh-53799DiVA, id: diva2:1870622
Note
Funding: This work is supported by National Natural Science Foundation of China (grant 62376215), the Open Fund of National Engineering Laboratory for Big Data System Computing Technology (Grant No. SZU-BDSC-OF2024-16), the Humanities and Social Sciences Program of the Ministry of Education (22YJCZH048), the Key Research and Development Project of Shaanxi Province (2024GX-YBXM-137), the Open Fund of Key Laboratory of Modem Teaching Technology, Ministry of Education, the National Natural Science Foundation of China (grant 62276210), the Shaanxi Provincial Social Science Foundation (grant 2021K015), the Shaanxi Provincial Natural Science Foundation (grant 2021JQ-824), the Shaanxi Provincial Natural Science Foundation (grant 2022JM-380), the Key Research and Development Program of Shaanxi (No. 2022ZDLGY06-03).
2024-06-142024-06-142024-08-15Bibliographically approved