Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
In spite of the recent progress in speech processing and synthesis, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by describing a community effort to collect and transcribe a new dataset, Casablanca. We also develop a number of strong baselines exploiting Casablanca.

Features
What you get with Casablanca
Arabic Transcription
Casablanca provides high-quality, manual transcriptions of Arabic speech from various dialects, ensuring accurate phonetic representation across multiple regional accents.
Gender
The dataset includes balanced representation from male and female speakers, enabling the development of gender-aware speech models.
Dialect
Casablanca covers a wide range of Arabic dialects, including Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccoan, Palestinian, and Yemeni varieties. This diversity allows for the training and evaluation of dialect-specific speech recognition systems, addressing the unique characteristics of each dialect. .
Code-Switching (Latin)
The dataset captures instances of code-switching between Arabic and (English and French) (Latin script), reflecting common speech patterns in multilingual environments.
Code-Switching (Transliterated)
In addition to code-switching in Latin script, Casablanca includes code-switching words that has been transliterated using Arabic characters.
Segement Start-End
Each speech sample is precisely segmented with start and end timestamps, facilitating easy downloading the segments and aligning them with the transcriptions.
Speech
Examples
Dialect | Speech Utterance | Transcription | Transcription CS | Gender |
---|---|---|---|---|
Moroccoan | ูุงุฏูู ูุง ู ููุชู ูุงุตูููุงู ูู ูููุช ูุถุฑุชููู ุนูููุง ุฎุงุตู ุชุนุฑู ุจูู ูุงุน ุงูุดุฑูุงุช ุงููุจุงุฑ ูู ูุงูุฏูุฑู ูู ุฒุงููุฑ ุงููุจุงุฑ ุจุญุงููุง ูุงูุชู ูุงู ุดู ููุงุฑ ูุณููู ู ุนุงูุง ุนูุฏ ูุงุญุฏ | ูุงุฏูู la multi-nationale ูู ูููุช ูุถุฑุชููู ุนูููุง ุฎุงุตู ุชุนุฑู ุจูู ูุงุน ุงูุดุฑูุงุช ุงููุจุงุฑ ูู ูุงูุฏูุฑู les affaires ุงููุจุงุฑ ุจุญุงููุง ูุงูุชู ูุงู ุดู ููุงุฑ ูุณููู ู ุนุงูุง ุนูุฏ ูุงุญุฏ | Male | |
Egyptian | ูู ููููู ูุง ู ุนูู . | you welcome ูุง ู ุนูู . | Male | |
Jordanian | ุงูู ูุฐุง ุงูู ุทุฑุจ ุงูููุฏู ุงูุชุฑูุงุดููุงู ูุนูู ุจุฑูููุดูุงู | ุงูู ูุฐุง ุงูู ุทุฑุจ ุงูููุฏู international ูุนูู professional | Female | |
Algerian | ุจูู ูุชูู ุง ุงู ุจูุณูุจู ุชุญูู ู ูุงุฏ ุงูุจูุณุช , ุจุตุญ ุชูุฌู ู ุชุฎุฏู ู ุฎุฏู ุฉ ูุญุฏูุฎุฑู . ูุงู ุฌูุฑุงู , ูุฐุง ุนูุฏู ุงูุจุงู ุ ู ุง ุนูุฏูุด | Bon ูุชูู ุง impossible ุชุญูู ู ูุงุฏ poste , ุจุตุญ ุชูุฌู ู ุชุฎุฏู ู ุฎุฏู ุฉ ูุญุฏูุฎุฑู . ูุงู ุฌูุฑุงู , ูุฐุง ุนูุฏู ุงูุจุงู ุ ู ุง ุนูุฏูุด | Male | |
Palestinian | ู ูุงูู ู ูุงูู ุจุณ ุนูู ุดุฑุท ุณูุชูู ูุงูู ู ุณูุชูู ูุงูู ูุนูู ูููุชู ูููุชู | ู ูุงูู ู ูุงูู ุจุณ ุนูู ุดุฑุท ุณูุชูู ูุงูู ู ุณูุชูู ูุงูู ูุนูู fifty fifty | Male | |
Emirati | ุนูู ุฃูุง ุจุนุฏ ูู ุนุฉุ ู ุง ุจุซูู ุนููู ุจุณ ูุงุชูู ุฑููุงู ุฌูู ูุฐุง ุฒููุ ู ุจ ุจุณ ุญููุ ุญูู ู ุญู ุฃุจูู | ุนูู ุฃูุง ุจุนุฏ ูู ุนุฉุ ู ุง ุจุซูู ุนููู ุจุณ ูุงุชูู royal jelly ูุฐุง ุฒููุ ู ุจ ุจุณ ุญููุ ุญูู ู ุญู ุฃุจูู | Female | |
Yemeni | ุงููู ุ ุฌูุก ู ู ุงุฌู ุงุดุฑุญ ูู ูู ุดู | ok ุ ุฌูุก ู ู ุงุฌู ุงุดุฑุญ ูู ูู ุดู | Male | |
Mauritanian | ูุฐุง ุจุนุฏ ูุงุน ุฏู ุจูุด ู ุนุฑูู ุงูู ุณูู ุจูุชูู. | ูุฐุง ุจุนุฏ ูุงุน deux poches ู ุนุฑูู une seule button. | Male |
Meet
Our team

Abdul-Mageed

Jarrar

Shehata

Berrada

Talafha

Kadaoui

Magdy

Habiboullah

Mohamed

El-Shangiti

Zayed

Cheikh

Alhamouri

Assi

Alraeesi

Mohamed

Alwajih

Mohamed


Nagoudi

Benelhadj

Alsayadi

Al-Dhabyani

Shatnawi

Ech-chammakhy

Makouar

Berrachedi