-
AI finds itself at the epicenter of an increasing number of conversations, both professional and otherwise. While a significant amount of what is said about AI is, essentially, marketing, there is potential for some AI tools, used properly, to make some processes more efficient and certain kinds of information more accessible.
For that accessibility to be realized, however, the large language models (LLM) on which AI chatbots and agents are based must accurately reflect the linguistic diversity of the planet.
That is not the case currently.
The most popular large language models are all too often trained on predominantly English data. This is a reflection of the undue dominance of English in certain domains, which also has the potential to reinforce that dominance by introducing new content predominantly in English.
To counter that trend, intentionally non-English-centric and sovereign data models have been gaining popularity.
Below is a list of LLMs that were intentionally trained on non-English data.
This list is ever-growing and is meant to be a work in progress. If you know of (or have created yourself) an LLM that was specifically trained on non-English data, I would love to hear from you. Contact me at hello@aliftoomega.com