Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 14 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark used in this study can be accessed at https://adu-bench.github.io/.
Figure 1: ADU-Bench evaluates the open-ended audio dialogue understanding for LALMs, where users interact with LALMs directly through audio. Our ADU-Bench consists of 4 datasets, including (a) ADU-General dataset, (b) ADU-Skill dataset, (c) ADU-Multilingual dataset, and (d) ADU-Ambiguity dataset. In total, it encompasses 20,715 open-ended audio dialogues, comprising over 8,000 real-world recordings alongside synthetic audio samples.
Table 1: Data collection and statics on 4 datasets in our ADU-Bench, including dataset domains, dataset source, and dataset number. In total, ADU-Bench consists of 20,715 open-ended audio dialogues for LALMs.
Since ADU-Bench focuses on open-ended audio dialogue understanding, traditional automatic metrics such as WER, ROUGE, and METEOR are not suitable for accurately measuring performance, as they have been shown to have low correlation with human judgments. To address this open-ended evaluation challenge, recent studies have demonstrated that LLM-based evaluation exhibits better alignment with human preferences. Consequently, we propose to adopt the advanced LLM, GPT-4, to evaluate the quality of the responses generated by LALMs.
For evaluation, LALMs first accept audio instructions and generate textual responses directly, or convert audio responses into textual format. Subsequently, we present the textual transcriptions of audio instructions, textual references (expected ground-truths) generated by GPT-4, and textual responses generated by LALMs into the GPT-4 evaluator. Finally, the GPT-4 evaluator assigns an overall score on a scale of 0 to 10 for each data point. The score judgment is based on criteria including helpfulness, relevance, accuracy, and comprehensiveness, comparing the reference and generated responses. A higher score indicates the better overall performance of LALMs' capabilities in handling open-ended audio dialogues.
Figure 2: The evaluation method in our ADU-Bench. To benchmark the performance of open-ended audio dialogue understanding capabilities for LALMs, we adopt a GPT-4 evaluator to provide an evaluation score as the metric.
Table 2: The average evaluation scores for audio dialogue understanding performances under 14 different LALMs on 4 datasets in our ADU-Bench.
Figure 3: The score across each domain on ADU-Bench, including (a) ADU-General dataset, (b) ADU-Skill dataset, (c) ADU-Multilingual dataset, and (d) ADU-Ambiguity dataset.
Figure 4: Ablation study on ADU-Bench. (a) Real-world and synthetic audio can both serve as evaluation sources. (b) GPT-4 evaluator is aligned with human evaluation. (c) Scoring twice is necessary to eliminate the position bias.
@articles{adubench2025,
title={Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models},
author={Anonymous ACL submission},
journal={Under Review},
year={2025}
}