Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
In spite of the recent progress in speech processing and synthesis, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by describing a community effort to collect and transcribe a new dataset, Casablanca. We also develop a number of strong baselines exploiting Casablanca.
Features
What you get with Casablanca
Arabic Transcription
Casablanca provides high-quality, manual transcriptions of Arabic speech from various dialects, ensuring accurate phonetic representation across multiple regional accents.
Gender
The dataset includes balanced representation from male and female speakers, enabling the development of gender-aware speech models.
Dialect
Casablanca covers a wide range of Arabic dialects, including Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccoan, Palestinian, and Yemeni varieties. This diversity allows for the training and evaluation of dialect-specific speech recognition systems, addressing the unique characteristics of each dialect. .
Code-Switching (Latin)
The dataset captures instances of code-switching between Arabic and (English and French) (Latin script), reflecting common speech patterns in multilingual environments.
Code-Switching (Transliterated)
In addition to code-switching in Latin script, Casablanca includes code-switching words that has been transliterated using Arabic characters.
Segement Start-End
Each speech sample is precisely segmented with start and end timestamps, facilitating easy downloading the segments and aligning them with the transcriptions.
Speech
Examples
| Dialect | Speech Utterance | Transcription | Transcription CS | Gender |
|---|---|---|---|---|
| Moroccoan | ูุงุฏูู ูุง ู ููุชู ูุงุตูููุงู ูู ูููุช ูุถุฑุชููู ุนูููุง ุฎุงุตู ุชุนุฑู ุจูู ูุงุน ุงูุดุฑูุงุช ุงููุจุงุฑ ูู ูุงูุฏูุฑู ูู ุฒุงููุฑ ุงููุจุงุฑ ุจุญุงููุง ูุงูุชู ูุงู ุดู ููุงุฑ ูุณููู ู ุนุงูุง ุนูุฏ ูุงุญุฏ | ูุงุฏูู la multi-nationale ูู ูููุช ูุถุฑุชููู ุนูููุง ุฎุงุตู ุชุนุฑู ุจูู ูุงุน ุงูุดุฑูุงุช ุงููุจุงุฑ ูู ูุงูุฏูุฑู les affaires ุงููุจุงุฑ ุจุญุงููุง ูุงูุชู ูุงู ุดู ููุงุฑ ูุณููู ู ุนุงูุง ุนูุฏ ูุงุญุฏ | Male | |
| Egyptian | ูู ููููู ูุง ู ุนูู . | you welcome ูุง ู ุนูู . | Male | |
| Jordanian | ุงูู ูุฐุง ุงูู ุทุฑุจ ุงูููุฏู ุงูุชุฑูุงุดููุงู ูุนูู ุจุฑูููุดูุงู | ุงูู ูุฐุง ุงูู ุทุฑุจ ุงูููุฏู international ูุนูู professional | Female | |
| Algerian | ุจูู ูุชูู ุง ุงู ุจูุณูุจู ุชุญูู ู ูุงุฏ ุงูุจูุณุช , ุจุตุญ ุชูุฌู ู ุชุฎุฏู ู ุฎุฏู ุฉ ูุญุฏูุฎุฑู . ูุงู ุฌูุฑุงู , ูุฐุง ุนูุฏู ุงูุจุงู ุ ู ุง ุนูุฏูุด | Bon ูุชูู ุง impossible ุชุญูู ู ูุงุฏ poste , ุจุตุญ ุชูุฌู ู ุชุฎุฏู ู ุฎุฏู ุฉ ูุญุฏูุฎุฑู . ูุงู ุฌูุฑุงู , ูุฐุง ุนูุฏู ุงูุจุงู ุ ู ุง ุนูุฏูุด | Male | |
| Palestinian | ู ูุงูู ู ูุงูู ุจุณ ุนูู ุดุฑุท ุณูุชูู ูุงูู ู ุณูุชูู ูุงูู ูุนูู ูููุชู ูููุชู | ู ูุงูู ู ูุงูู ุจุณ ุนูู ุดุฑุท ุณูุชูู ูุงูู ู ุณูุชูู ูุงูู ูุนูู fifty fifty | Male | |
| Emirati | ุนูู ุฃูุง ุจุนุฏ ูู ุนุฉุ ู ุง ุจุซูู ุนููู ุจุณ ูุงุชูู ุฑููุงู ุฌูู ูุฐุง ุฒููุ ู ุจ ุจุณ ุญููุ ุญูู ู ุญู ุฃุจูู | ุนูู ุฃูุง ุจุนุฏ ูู ุนุฉุ ู ุง ุจุซูู ุนููู ุจุณ ูุงุชูู royal jelly ูุฐุง ุฒููุ ู ุจ ุจุณ ุญููุ ุญูู ู ุญู ุฃุจูู | Female | |
| Yemeni | ุงููู ุ ุฌูุก ู ู ุงุฌู ุงุดุฑุญ ูู ูู ุดู | ok ุ ุฌูุก ู ู ุงุฌู ุงุดุฑุญ ูู ูู ุดู | Male | |
| Mauritanian | ูุฐุง ุจุนุฏ ูุงุน ุฏู ุจูุด ู ุนุฑูู ุงูู ุณูู ุจูุชูู. | ูุฐุง ุจุนุฏ ูุงุน deux poches ู ุนุฑูู une seule button. | Male |
Meet
Our team
Abdul-Mageed
Jarrar
Shehata
Berrada
Talafha
Kadaoui
Magdy
Habiboullah
Mohamed
El-Shangiti
Zayed
Cheikh
Alhamouri
Assi
Alraeesi
Mohamed
Alwajih
Mohamed
Nagoudi
Benelhadj
Alsayadi
Al-Dhabyani
Shatnawi
Ech-chammakhy
Makouar
Berrachedi