
According to an article published by the Spanish daily El País, the Spanish government and the Basque autonomous government will invest 10.5 million euros by 2028 to ensure the place of Euskara in artificial intelligence technologies.
The Spanish government and the Basque autonomous government have signed an agreement providing for an investment of 10.5 million euros by 2028 to ensure the future of the Basque language in the realm of artificial intelligence. The agreement, published in the Spanish Official Gazette (BOE), starts from a blunt observation: "Basque is a language at risk of digital extinction."
The goal is to strengthen the presence of Euskara in digital tools – smartphones, tablets, smartwatches, voice assistants, and artificial intelligence systems – by creating a vast linguistic corpus accessible to researchers and developers.
To achieve this, thousands of hours of audio recordings and millions of text segments will be collected, annotated, and then used to train machine learning models. These resources will notably enable the development of voice recognition systems, machine translation, and conversational assistants capable of functioning in Basque.
"Phones, tablets, smartwatches, applications, and digital assistants will thus be able to interact with their users in Basque," specifies the agreement signed by the Spanish Minister of Digital Transformation and the Basque Industry Counselor.
A strategic project
The project is led by the Euskorpora association, which brings together public and private stakeholders, including Vicomtech, Euskaltzaindia (the Academy of the Basque Language), Euskaltel, Kutxabank, Iberdrola, CAF, Petronor, and the Mondragón group.
We know that Basque will be part of the new digital environment. Either we will be actors in this new world, or we will be condemned to a secondary role that does not suit us and that we do not wish for. - The newspaper Lehendakari Imanol Pradales
The creation of the corpus will take place in three phases and will result in the open-source availability of linguistic resources and language models usable by companies, researchers, and European platforms.
The public model ALIA
The corpus will also feed into "ALIA," the large language model developed by the Spanish state. Unlike major international models like ChatGPT, Gemini, or Copilot, which are primarily trained in English, ALIA is designed from the outset to integrate the languages of the Spanish state: Castilian, Catalan, Galician, Valencian, and Basque. The data comes from numerous public sources, such as parliamentary debates or scientific publications. The Spanish government plans to invest an additional 10 million euros in ALIA. The ambition is for the model to understand idiomatic expressions, cultural references, and contexts specific to each of these languages.
The ministry claims it wants to develop an open and transparent model, even if its training also relies, like most current major models, on Common Crawl, a vast database made up of content accessible on the Internet.
The Basque project could also serve as a model for other European minority languages. For Breton, the establishment of quality linguistic corpora now appears to be a strategic issue to avoid the same risk of "digital extinction." The approximately 60,000 articles published by ABP could, in the long term, constitute a valuable public resource, alongside content produced by media, institutions, and collaborative projects like Wikipedia in Breton. However, funding would still be needed for their translation into Breton and proofreading.
Breton also needs ambitious investments to ensure its digital future. The Spanish initiative shows that a proactive public policy is possible.