Pablo Haya Coll
Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)
I think this is good news that highlights the value of the PERTE [strategic project for the recovery and economic transformation] of the new language economy and serves as a letter of introduction for the new team of the Secretary of State for Digitalisation and Artificial Intelligence (SEDIA). It is an action that aligns developments in natural language processing (NLP) in Spanish and co-official languages within the National Artificial Intelligence Strategy (ENIA).
The currently existing large language models (also called foundational models or large language models) have been trained with huge collections of documents (corpora) mainly extracted from public web pages. These corpora include documents in multiple languages, but with a very unbalanced distribution towards English. For example, the HPLT project (funded by the European Union) has collected and published 7 petabytes of documents extracted from the web. When you look at the distribution there is about 1,000 times more data in English than in Spanish. If you look at the co-official languages, this disproportion is much more pronounced.
It should be noted that, despite this disproportion in the training data, multilingual models perform reasonably well in Spanish in generalist tasks. There is still room for improvement and a model adapted to Spanish will certainly perform better. But we are at a time when technological advances in PLN are occurring at breakneck speed, which requires moving fast.