UBC scholar helping AI overcome its language barrier

BRITISH COLUMBIA: People around the world have been impressed with ChatGPT, a new chatbot developed by OpenAI that uses natural language processing (NLP) to seemingly answer any question.

It works much better than similar technologies that came before it as long as you prompt it in English.

Ife Adebara is working to make sure less widely spoken languages from her home continent of Africa aren’t left behind as NLP technology makes these giant leaps. Adebara is a UBC PhD student in the deep learning and natural language processing group, a diverse team of researchers led by associate professor Muhammad Abdul-Mageed, Canada Research Chair in natural language processing and machine learning.

We spoke with Adebara about her team's work to democratize technology.

What is NLP?

"In simple terms, NLP is about teaching a computer to process human language. For example, Alexa uses NLP to have conversations with people. Google Translate uses NLP to understand a source language and translate it into a target language. When you’re writing an email and the system gives you suggestions of what to write next, it’s using NLP."

Could somebody build a ChatGPT for Swahili or Zulu or Yoruba?

"There's barely enough data for any of those languages, which is why we call them 'Low-resource' languages. If there isn't enough data, you're not able to build a robust system. Most African countries use English, French or Portuguese as the official language, so every conversation in government, education and health care is documented in one of those languages. More than 7,000 languages are spoken across the world, and ChatGPT primarily supports English and minimally knows about 20 other languages. It works really well for English because there is a lot of data available for the language."

What else makes it challenging to bring this technology to low-resource African languages?

"Even when data is available, it may not be of high quality or properly prepared for a task such as machine translation. Also, these can be very different types of languages. At least 80 per cent of African languages are tone languages. You might have the same sequence of alphabet or letters, but the way the words are pronounced makes a difference in meaning. They're just so different from English and other Indo-European languages that they might need to be processed differently."

What specifically have you been working on?

"With support from Google and Advanced Micro Devices, I built a language identification system called AfroLID. It works for 517 African languages and is publicly available. It's a precursor to a lot of other AI models. If you want to build a language translation model, you need to be able to first identify which language it is before you can translate it. I want to keep improving the model so that every language can be identified to the highest accuracy possible. Most of these languages have not been used for any AI work before, and bringing them into the world of AI is really important.

I'm also working on policy trying to encourage more people to use their languages, and engaging governments and policymakers in African countries to ensure that African languages are used for official business"

What motivates you to do this?

"I was raised in Nigeria and went to school there. I understand the need for people to have access to technology in their languages. A lot of people can't speak English or any foreign languages, and they're not able to use technology because of language barriers.

Another problem is that many languages are at the point of endangerment. People feel pressured to use technology, so they reduce use of their own languages and invest more in other languages, to the detriment of their own language. The death of a language is like the death of a species because you lose all the wisdom that comes with that language.

The other issue is that a whole group of people who are not able to access technology may be shut out from entire global conversations. This could amplify existing biases through the over-representation and under-representation of certain groups of people in the data used to train models. That’s not good for them, and it’s not good for the rest of the world."