Large language models (LLMs) and their variants have shown extraordinary efficacy across numerous downstream natural language processing tasks. Despite their remarkable performance in natural language generating, LLMs lack a distinct focus on the emotion understanding domain. As a result, using LLMs for emotion recognition may lead to suboptimal and inadequate precision. Another limitation of the current LLMs is that they are typically trained without leveraging multi-modal information. To overcome these limitations, we formally model emotion recognition as text generation tasks, and thus propose DialogueLLM, a context and emotion knowledge tuned LLM that is obtained by fine-tuning foundation large language models. In particular, it is a context-aware model, which can accurately capture the dynamics of emotions throughout the dialogue. We also prompt ERNIE Bot with expert-designed prompts to generate the textual descriptions of the videos. To support the training of emotional LLMs, we create a large scale dataset of over 24K utterances to serve as a knowledge corpus. Finally, we offer a comprehensive evaluation of DialogueLLM on three benchmarking datasets and significantly outperform 15 state-of-the-art baselines and 3 state-of-the-art LLMs. The emotion intelligence test shows that DialogueLLM achieves 109 score and surpasses 72 % humans. Additionally, DialogueLLM-7B can be easily reproduced using LoRA on a 40GB A100 GPU in 5 hours. © 2025 Elsevier Ltd