In a context where a growing number of languages are in danger of extinction and linguists in dire need for efficient language documentation tools, Breaking the Unwritten Language Barrier (BULB) aims at supporting the documentation of unwritten languages with the help of modern natural language processing technologies, in particular automatic speech recognition (ASR) and machine translation (MT).
This ANR/DFG project relies on a strong German-French cooperation between linguists and computer scientists from ZAS (F. Hamlaoui), the KIT (S. Stüker) and the University of Stuttgart (S. Zerbian) on the German side, as well as the LPP (M. Adda-Decker, A. Rialland), the LLACAN (M. van de Velde, D. Idiatov), the LIMSI (L. Lamel and F. Yvon), the LIG (L. Besacier) and the IMMI-CNRS (G. Adda) on the French side. These researchers and their local teams are bringing together their expertise to address the documentation of three mostly unwritten and generally under-resourced African languages of the Bantu family: Basaa (Cameroon), Myene (Gabon) and Embosi (Republic of Congo).
The first phase of the project consists in collecting large speech corpora (at least 100 hours/language) using a three step resource economic methodology designed by S. Bird and M. Liberman:
- Step 1: collection of elicitated and natural speech (stories, dialogs, radio/TV broadcasts)
- Step 2: careful respeaking by some reference speakers to ensure more accurate automatic phonetic transcriptions
- Step 3: oral translation in a major language (here, French) to accelerate the documentation process.
This phase is coordinated by F. Hamlaoui and primarily involves the linguists partners at ZAS (E.-M. Makasso, J. Engelmann, C. Ngo Sohna and H. Salfner), at LLACAN, LPP, LIG and at the University of Stuttgart.
The LIMSI and KIT teams will work on the development of language independent phonetic recognition systems to automatically produce accurate transcriptions in source (Basaa/Embosi/Myene) and target (French) languages. Alignments between source and target languages will subsequently be performed by the IMMI-CNRS and the KIT teams, using and improving existing statistical machine translation techniques. These alignments will be highly valuable to linguists and phoneticians for large scale acoustic-phonetic studies, phonological and prosodic data mining and dialectal variations studies, as well as morphological studies and dictionary elaborations.
Beyond the positive outcomes for the documentary linguistic community, BULB generally aims at participating in the preservation of linguistic and cultural diversity by providing communities with tools (e.g. writing systems, dictionaries, grammars) that will heighten the perceived value of their unwritten languages, facilitate the use of these languages in a wider array of settings, and thus help preventing them from disappearing.