hh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations
Zhengzhou University of Light Industry, Zhengzhou, China.ORCID iD: 0000-0002-5699-0176
Beijing Institute of Technology, Beijing, China.
Beijing Institute of Technology, Beijing, China.
Zhengzhou University of Light Industry, Zhengzhou, China.
Show others and affiliations
2023 (English)In: Information Fusion, ISSN 1566-2535, E-ISSN 1872-6305, Vol. 93, p. 282-301Article in journal (Refereed) Published
Abstract [en]

Sarcasm, sentiment and emotion are tightly coupled with each other in that one helps the understanding of another, which makes the joint recognition of sarcasm, sentiment and emotion in conversation a focus in the research in artificial intelligence (AI) and affective computing. Three main challenges exist: Context dependency, multimodal fusion and multitask interaction. However, most of the existing works fail to explicitly leverage and model the relationships among related tasks. In this paper, we aim to generically address the three problems with a multimodal joint framework. We thus propose a multimodal multitask learning model based on the encoder–decoder architecture, termed M2Seq2Seq. At the heart of the encoder module are two attention mechanisms, i.e., intramodal (Ia) attention and intermodal (Ie) attention. Ia attention is designed to capture the contextual dependency between adjacent utterances, while Ie attention is designed to model multimodal interactions. In contrast, we design two kinds of multitask learning (MTL) decoders, i.e., single-level and multilevel decoders, to explore their potential. More specifically, the core of a single-level decoder is a masked outer-modal (Or) self-attention mechanism. The main motivation of Or attention is to explicitly model the interdependence among the tasks of sarcasm, sentiment and emotion recognition. The core of the multilevel decoder contains the shared gating and task-specific gating networks. Comprehensive experiments on four bench datasets, MUStARD, Memotion, CMU-MOSEI and MELD, prove the effectiveness of M2Seq2Seq over state-of-the-art baselines (e.g., CM-GCN, A-MTL) with significant improvements of 1.9%, 2.0%, 5.0%, 0.8%, 4.3%, 3.1%, 2.8%, 1.0%, 1.7% and 2.8% in terms of Micro F1. © 2023 Elsevier B.V.

Place, publisher, year, edition, pages
Amsterdam: Elsevier, 2023. Vol. 93, p. 282-301
Keywords [en]
Affective computing, Emotion recognition, Multimodal sarcasm recognition, Multitask learning, Sentiment analysis
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:hh:diva-49903DOI: 10.1016/j.inffus.2023.01.005ISI: 000923909900001Scopus ID: 2-s2.0-85146144725OAI: oai:DiVA.org:hh-49903DiVA, id: diva2:1732586
Available from: 2023-01-31 Created: 2023-01-31 Last updated: 2023-08-21Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Tiwari, Prayag

Search in DiVA

By author/editor
Zhang, YazhouTiwari, Prayag
By organisation
School of Information Technology
In the same journal
Information Fusion
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 188 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf