Inception, an Abu Dhabi-based subsidiary of G42, has released an Arabic large language model (LLM) to open source. The new model, called Jais, uses 13 billion parameters, which is a measure of its sophistication and degree of precision. Parameters can be thought of as coefficients to a series of algebraic equations.
During the learning phase, the values of the parameters are derived from the training data and saved as part of the neural network, which is then used for the inference phase. The inference phase is when the model is deployed – taking questions and commands from users and producing answers.
On a worldwide scale, Jais is a respectably large model, fitting between GPT-2, which has 1.5 billion parameters, and GPT-3, which has 175 billion. GPT-4 is far ahead of the rest, with 1.7 trillion parameters.
How Jais was developed
Named after UAE’s highest mountain Jebel Jais, the LLM was developed by Cerebras Systems, Inception, and Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) – the world’s first graduate research university dedicated to artificial intelligence (AI). Jais was trained on Condor Galaxy, the multi-exaFLOP AI supercomputer recently announced by G42 and Cerebras.
One of the challenges in training an LLM is getting enough text for input. That’s relatively easy for English, by far the most prevalent language on the internet. According to statista, as of January 2023, 58.8% of web content was in English, with Russian running a distant second at 5.3%. Arabic language text accounts for only 0.9% of the content on the worldwide web.
“Once we began lifting our heads up beyond English, we saw that not having enough data is also a problem for other languages,” says Andrew Feldman, CEO and co-founder of Cerebras Systems. “Even when the number of speakers of a language is very large, the amount of text on the internet may be small. This is true for Spanish, for example. There is a continent of Spanish speakers, but the amount of text on the internet is relatively small.
“It’s also true for Hindi and Mandarin, each with hundreds of millions of speakers. Even though the Chinese government spent a huge amount of time and money to remedy this problem, there still isn’t necessarily enough Mandarin text to feed a data-hungry AI algorithm.”
“There are other challenges with Arabic. The text that is available is often a poor translation from English or it may be too formal. In Arabic, some of the writing on the internet is religious writings or poetry, which is important, but not particularly useful if you want to build a chatbot. You have to find modern versions of the language in a conversational style.”
To bridge the gap, a 398 billion-word Arabic and English dataset was developed specifically to train Jais and other AI models. Some aspects of an LLM can be trained using data from other languages – in this case, English. For example, the model can learn to summarise by examining content and summaries of that same content, independently of the language.
Another challenge with Arabic is the number of dialects. “No two people in the Arab world outside of the media speak to each other in formal Arabic,” says Andrew Jackson, CEO of Inception. “They use one of the dialects. We have been gathering as many conversational datasets as possible and using them to introduce the tokens to our model. Once you have a broad set of different dialects, you tweak the model on the output side so it can decide that when this chat bot is used in Lebanon, the response is given in the Lebanese dialect.”
The significance of Jais to the Arabic speaking people
“At G42, we’ve always had bold ambitions and the drive to pursue them,” says Jackson. “We’re trying to contribute as much as possible to the global development of AI by providing meaningful input.
“We’re very firm believers that within the next decade, AGI [artificial general intelligence] will become real, and we want to contribute to that and make sure it’s done in a safe way. We want to make sure AI works for the industries that are important to the region, including the government, healthcare, energy, and financial sectors.”
The new LLM responds to one of the important needs in the region, which is sovereign control. Nobody wants to depend on outside help for such a critical technology as AI. Jais encourages a fully in-house approach, where developers download the model and integrate it into their applications.
This inherent sovereignty reduces dependency on external resources, allowing organisations across the Middle East to run the model within their own infrastructures, maintaining complete control over usage and fine-tuning the model for their own purposes.
Jais gives the more than 400 million Arabic-speaking people in the world more direct access to the powers of AI, and the LLM is a step forward for Abu Dhabi in its ambitions to become a world-leading hub for AI.
Inception chose to release Jais as open source to promote the budding ecosystem around Arabic language AI and to specifically target the scientific, academic, and developer communities. The company also hopes to serve as an example for native speakers of other languages that are currently underrepresented in mainstream AI.
Several organisations have already began using Jais. This includes the UAE Ministry of Foreign Affairs, the UAE Ministry of Industry and Advanced Technology, the Department of Health – Abu Dhabi, the Abu Dhabi National Oil Company (ADNOC), Etihad Airways, and e&. Independent software developers have also taken an interest. Within a day of its release, Jais had already been downloaded from Hugging Face thousands of times.
“This is not the be all end all for us,” says Jackson. “We want to fine tune our foundational model for proprietary data sets so companies in different industries can take use it for their specific needs.”