ADU-Bench:

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Anonymous ACL submission

Abstract

Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 14 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark used in this study can be accessed at https://adu-bench.github.io/.

ADU-Bench

• Dataset Collection and Statics

Figure 1: ADU-Bench evaluates the open-ended audio dialogue understanding for LALMs, where users interact with LALMs directly through audio. Our ADU-Bench consists of 4 datasets, including (a) ADU-General dataset, (b) ADU-Skill dataset, (c) ADU-Multilingual dataset, and (d) ADU-Ambiguity dataset. In total, it encompasses 20,715 open-ended audio dialogues, comprising over 8,000 real-world recordings alongside synthetic audio samples.

ADU-General dataset. The ADU-General dataset is purposefully constructed to evaluate the general dialogue understanding capabilities of LALMs. This dataset comprises 12,000 open-ended audio dialogues, specifically designed to reflect a wide array of inquiries and remarks commonly encountered in life. It covers 3 scenarios as follows. (1) Helpful questions. (2) Daily questions. (3) Daily statements.
ADU-Skill dataset. The ADU-Skill dataset is specifically designed to assess the domain-specific knowledge and skills of LALMs. This dataset comprises 3,750 audio dialogues and encompasses 12 different domains, including Mathematics, Physics, Chemistry, Biology, Computer Science, Coding, Law, Finance, Common Sense, Writing, Roleplay, and Medicine.
ADU-Multilingual dataset. The ADU-Multilingual dataset aims to evaluate the multilingual dialogue understanding abilities of LALMs, covering 9 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Russian, and Spanish. This dataset consists of 3,600 audio dialogues.
ADU-Ambiguity dataset. The ADU-Ambiguity dataset is specifically designed to evaluate the robustness of LALMs in addressing ambiguity from different phonetic elements present in audio dialogues. It is important to note that the ambiguity refers to instances where the textual transcriptions alone, without the accompanying audio or contexts, can lead to confusion. However, when considering the phonetic elements or contextual information provided by the audio, these ambiguities can be resolved, leading to a standard, unambiguous response for human. Concretely, this dataset consists of 1,390 audio dialogues, which can be classified into 4 types of ambiguous situations, as described below. (1) Intonation-based ambiguity. (2) Pause-based ambiguity. (3) Homophone-based ambiguity. (4) Repetition-based ambiguity.

Table 1: Data collection and statics on 4 datasets in our ADU-Bench, including dataset domains, dataset source, and dataset number. In total, ADU-Bench consists of 20,715 open-ended audio dialogues for LALMs.

• Evaluation Method

Since ADU-Bench focuses on open-ended audio dialogue understanding, traditional automatic metrics such as WER, ROUGE, and METEOR are not suitable for accurately measuring performance, as they have been shown to have low correlation with human judgments. To address this open-ended evaluation challenge, recent studies have demonstrated that LLM-based evaluation exhibits better alignment with human preferences. Consequently, we propose to adopt the advanced LLM, GPT-4, to evaluate the quality of the responses generated by LALMs.

For evaluation, LALMs first accept audio instructions and generate textual responses directly, or convert audio responses into textual format. Subsequently, we present the textual transcriptions of audio instructions, textual references (expected ground-truths) generated by GPT-4, and textual responses generated by LALMs into the GPT-4 evaluator. Finally, the GPT-4 evaluator assigns an overall score on a scale of 0 to 10 for each data point. The score judgment is based on criteria including helpfulness, relevance, accuracy, and comprehensiveness, comparing the reference and generated responses. A higher score indicates the better overall performance of LALMs' capabilities in handling open-ended audio dialogues.

Figure 2: The evaluation method in our ADU-Bench. To benchmark the performance of open-ended audio dialogue understanding capabilities for LALMs, we adopt a GPT-4 evaluator to provide an evaluation score as the metric.

Results and Analysis

• Overall Results

Table 2: The average evaluation scores for audio dialogue understanding performances under 14 different LALMs on 4 datasets in our ADU-Bench.

End-to-end LALMs capable of processing a wide range of modalities but their performances on audio dialogue understanding are relatively lower.
It can achieve enhanced performance when subjected to the targeted audio dialogue tuning for end-to-end LALMs
Cascaded LALMs, including LLaMA-2-7B, LLaMA-3-8B, LLaMA-3-70B, and GPT-4 with a Whisper model, obtain the higher scores in audio dialogue understanding.
Notably, for the advanced proprietary LALM, GPT-4o, achieves the highest average score of 8.16, which indicates that it is the best-performing model among the evaluated LALMs.
In summary, cascaded LALMs outperform end-to-end LALMs with the specializing audio instruction tuning, which in turn surpass end-to-end LALMs with multi-modal understanding beyond audio, which indicates that further modality alignment should be developed for the open-ended audio dialogue understanding.

• Results on Each Dataset

Figure 3: The score across each domain on ADU-Bench, including (a) ADU-General dataset, (b) ADU-Skill dataset, (c) ADU-Multilingual dataset, and (d) ADU-Ambiguity dataset.

ADU-General dataset. Existing open-sourced LALMs understand helpful questions better than daily questions and statements, highlighting the continued development in LALMs to better address the everyday human interactions.
ADU-Skill dataset. Existing open-sourced LALMs have knowledge backgrounds on some domains but they face challenges in subjects involving mathematical notations or programming languages, as well as areas requiring a deeper understanding of human behavior.
ADU-Multilingual dataset. Existing open-sourced LALMs struggle with their multilingual capabilities, highlighting the need for further research to consider various linguistic contexts when developing LALMs.
ADU-Ambiguity dataset. existing LALMs, including GPT-4o, display limitations in handling the audio dialogue ambiguity in different phonetic elements.

• Ablation Study

Figure 4: Ablation study on ADU-Bench. (a) Real-world and synthetic audio can both serve as evaluation sources. (b) GPT-4 evaluator is aligned with human evaluation. (c) Scoring twice is necessary to eliminate the position bias.

Effect of LALMs' size. While a larger model size generally contributes to better overall audio dialogue understanding performance, it can also introduce performance losses in certain domains.
Effect of real-world and synthetic audio. Both real-world audio and synthetic audio can effectively serve as evaluation sources for audio dialogue understanding.
Human evaluation study. Our analysis reveals that the pairwise preference consistency achieves a score above 85%, indicating that GPT-4 evaluation aligns well with human judgments.
Position bias study. This suggests the existence of a positional bias, particularly when textual responses are placed before the textual references. The observation highlights the importance of conducting a second scoring to account for and address this bias.

Demonstrations

• Example-1: Audio-General dataset (Helpful Question)

(Transcriptions: I am planning to visit Okinawa Japan from April 7th to April 10th. Do you have any recommendation on what to do while I'm there?)

Sure, Okinawa has a lot to offer. Here's a preliminary itinerary for your consideration: April 7: Shurijo Castle: Start your trip by visiting Shurijo Castle to learn about the Ryukyu kingdom and absorb some history. Although the main buildings of the castle were destroyed in a fire in 2019, there are still parts of the complex worth seeing. April 8: Churaumi Aquarium: Visit this aquarium which is one of the largest in the world. Here you will see the massive Kuroshio Tank, among others, which houses whale sharks and manta rays. April 9: Okinawa World: Spend the day here to explore Gyokusendo Cave, the second-largest series of caves in Japan. Okinawa World also includes a zoo, a snake museum, and several craft workshops. April 10: War Reminiscences: Visit the Okinawa Prefectural Peace Memorial Museum and the Cornerstone of Peace. Both are dedicated to the memory of those who died during World War II and serve as reminders of the horrors of war.

• Example-2: Audio-General dataset (Daily Question)

(Transcriptions: Want someone to take you home?)

As an AI, I don't have physical presence, thus I can't take anyone home. However, I'm here to assist you in any other areas you need help with.

• Example-3: Audio-General dataset (Daily Statement)

(Transcriptions: One today is worth two tomorrows.)

This phrase means that it is better to do something today rather than postponing it for the future. The value of today's effort will be twice as valuable as the same effort made in the future.

• Example-4: Audio-Skill dataset (Mathematics)

(Transcriptions: Carver just turned 45 years old, which makes him 5 years less than twice the age of his son. How old is his son, in years?)

Let x be the age of Carver's son. Since Carver's age (45) is 5 years less than twice the age of his son, then 45=(2*x)-5. Add 5 to each side of the equation and we get 50=2*x. Thus, the age of Carver's son is x=25 years old.

• Example-5: Audio-Skill dataset (Physics)

(Transcriptions: A three-dimensional harmonic oscillator is in thermal equilibrium with a temperature reservoir at temperature T. What is the average total energy of the oscillator?)

The average total energy of a three-dimensional harmonic oscillator in thermal equilibrium with a temperature reservoir at temperature T is given by U = (3/2)*k*T, where k is the Boltzmann constant and T is the absolute temperature in Kelvin.

• Example-6: Audio-Skill dataset (Chemistry)

(Transcriptions: What is the strongest base in liquid ammonia?)

The strongest base in liquid ammonia is sodium amide (NaNH2).

• Example-7: Audio-Skill dataset (Biology)

(Transcriptions: DNA sequences in many human genes are very similar to the sequences of corresponding genes in chimpanzees. What is the most likely explanation for this result?)

The most likely explanation for this result is that humans and chimpanzees share a common ancestor and thus have a high degree of evolutionary relationship. Over time, the genes of the two species have evolved and changed, but many have remained similar due to this shared ancestry. This is known as homology. DNA sequence similarity is one of the strongest lines of evidence for the theory of evolution.

• Example-8: Audio-Skill dataset (Computer Science)

(Transcriptions: What is the IP protocol primarily concerned with?)

The IP (Internet Protocol) is primarily concerned with delivering packets of data from the source computer to the destination computer. It involves addressing and routing of data across network boundaries.

• Example-9: Audio-Skill dataset (Code)

(Transcriptions: Write a function to add the k elements to each element in the tuple.)

def add_K_element(test_list, K):
res = [tuple(j + K for j in sub ) for sub in test_list]
return (res)

• Example-10: Audio-Skill dataset (Law)

(Transcriptions: What is the purpose of baselines under the law of the sea?)

Baselines defined under the law of the sea are used to determine the territorial sea, contiguous zone, and exclusive economic zone of a coastal state. From these baselines, the maritime boundaries of a country are measured and the extent of its jurisdiction over the surrounding waters is established. They are integral in avoiding maritime disputes between nations, ensuring coastal states' rights to natural resources, and maintaining international maritime order.

• Example-11: Audio-Skill dataset (Finance)

(Transcriptions: What channel structure is it where the product goes directly from the producer to the final customer?)

This channel structure is called a direct distribution channel.

• Example-12: Audio-Skill dataset (Common Sense)

(Transcriptions: Which band was formed at an earlier date, Unwritten Law or Jon Spencer Blues Explosion?)

Unwritten Law was formed in 1990, while Jon Spencer Blues Explosion was formed in 1991. Therefore, Unwritten Law was formed at an earlier date compared to Jon Spencer Blues Explosion.

• Example-13: Audio-Skill dataset (Writing)

(Transcriptions: Help me construct a catchy, yet scientifically accurate, headline for an article on the latest discovery in renewable bio-energy, while carefully handling the ethical dilemmas surrounding bio-energy sources. Propose 4 options.)

1. Green Harvest: Exciting New Progress in Bio-Energy Revolution, Navigating Ethical Crossroads 2. Exploring the Future of Renewable Energy: Breakthrough Discoveries Tread the Ethical Tightrope of Bio-Energy 3. Clean Power, Clean Conscience: Latest Innovations in Bio-Energy Balance Progress and Ethics 4. From Ethical Challenges to Green Triumph: The New Face of Renewable Bio-Energy Unveiled

• Example-14: Audio-Skill dataset (Roleplay)

(Transcriptions: Suppose you are a mathematician and poet. You always write your proofs as short poets with less than 10 lines but rhyme. Prove the square root of 2 is irrational number.?)

Take two numbers, a and b, Without any common factor, we see. If a² equals 2b², it's clear, A must be even, we cheer. But alas, b² then equals half a², Thus b also must even wear. Contradiction is at hand, Common factor then does stand. Thus irrational must root two be, A proof penned in verse, a poetic decree.

• Example-15: Audio-Skill dataset (Medicine)

(Transcriptions: What is the energy charge of the cell?)

The energy charge of a cell refers to the energy status within a cell that is determined by the relative concentrations of the different energy-rich molecules, mainly the adenine nucleotides - ATP, ADP, and AMP.

• Example-16: Audio-Multilingual dataset (Arabic)

(Transcriptions: ما هو لون السماء؟)

في الأيام المشمسة ، السماء عادة ما تكون زرقاء . ومع ذلك ، يمكن أن تتحول إلى الأحمر ، البرتقالي ، الوردي أو الأرجواني عند شروق الشمس أو غروبها . قد يكون هناك رمادي أو أبيض في يوم غائم أو أسود في الليل .

• Example-17: Audio-Multilingual dataset (Chinese)

(Transcriptions: 天空是什么颜色?)

在阳光明媚的日子里，天空通常是蓝色的。然而，它可以在日出或日落时变成红色、橙色、粉红色或紫色。阴天可能有灰色或白色，晚上可能有黑色。

• Example-18: Audio-Multilingual dataset (English)

(Transcriptions: What color is the sky?)

The sky is usually blue during a clear day. However, it can change to colors like red, orange, pink, or purple during sunrise or sunset. It can also appear gray or white on a cloudy day or black at night.

• Example-19: Audio-Multilingual dataset (French)

(Transcriptions: De quelle couleur est le ciel?)

Par temps clair, le ciel est généralement bleu. Cependant, au lever ou au coucher du soleil, il peut devenir rouge, orange, rose ou violet. Le gris ou le blanc peut également apparaître par temps nuageux et le noir peut apparaître la nuit.

• Example-20: Audio-Multilingual dataset (German)

(Transcriptions: Welche Farbe hat der Himmel?)

Der Himmel ist normalerweise blau an einem klaren Tag. Es kann jedoch zu Farben wie Rot, Orange, Rosa oder Lila während Sonnenaufgang oder Sonnenuntergang wechseln. Es kann auch grau oder weiß an einem bewölkten Tag oder schwarz in der Nacht erscheinen.

• Example-21: Audio-Multilingual dataset (Japanese)

(Transcriptions: 空の色は何ですか？?)

晴れた日には、空は通常青い。しかし、日の出や日没時には、赤、オレンジ、ピンク、紫に変更することができます。曇りの日にはグレーや白、夜間には黒も出現する可能性があります。

• Example-22: Audio-Multilingual dataset (Korean)

(Transcriptions: 하늘의 색깔은 무엇인가요?)

맑은 날에는 하늘이 보통 파란색이다.그러나 해가 뜨거나 질 때는 빨간색, 주황색, 분홍색 또는 보라색으로 변할 수 있습니다.흐린 날에도 회색이나 흰색, 밤에도 검은색이 나타날 수 있다.

• Example-23: Audio-Multilingual dataset (Russian)

(Transcriptions: Какого цвета небо?)

В ясные дни небо обычно голубое. Однако при восходе или закате он может стать красным, оранжевым, розовым или фиолетовым. Пасмурные дни также могут быть серыми или белыми, а ночью - черными.

• Example-24: Audio-Multilingual dataset (Spanish)

(Transcriptions: ¿De qué color es el cielo?)

En días soleados, el cielo suele ser Azul. Sin embargo, al amanecer o al atardecer puede convertirse en rojo, naranja, rosa o morado. Los días nublados también pueden ser grises o blancos, y los negros también pueden aparecer por la noche.

• Example-25: Audio-Ambiguity dataset (Intonation-based)

(Transcriptions: Alice says: Could you please help me with this task? Bob says Sorry? What does Bob mean?)

Bob did not understand or hear what Alice said, so he's asking her to repeat or clarify.

• Example-26: Audio-Ambiguity dataset (Pause-based)

(Transcriptions: Please explain this sentence: Frankie bumped into the boss with an umbrella. Who is with the umbrella?)

The Boss.

• Example-27: Audio-Ambiguity dataset (Homophone-based)

(Transcriptions: What is the difference between 'ate' and 'eight'?)

'Ate' is the past tense of the verb 'eat', referring to the action of consuming food.\n\n'Eight' is the number 8, which comes after 7 and before 9 in the sequential counting order.

• Example-28: Audio-Ambiguity dataset (Repetition-based)

(Transcriptions: Please explain this sentence: I saw a man saw a saw with a saw.)

This sentence might be confusing because it includes multiple uses of the word "saw", which can be a verb meaning "to perceive with the eyes" or "to cut with a saw (tool)". Decomposed, it means "I perceived with my eyes a man who was cutting a saw (the tool) with another saw (tool)". So basically, the observer is watching a man use a saw to cut another saw.

BibTeX

@articles{adubench2025,
    title={Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models},
    author={Anonymous ACL submission},
    journal={Under Review},
    year={2025}
}