📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva project trained a large-scale LLM from scratch with significant Italian data but scored near chance on academic benchmarks, questioning the assumptions about scale and native-language investment. This challenges the European sovereign-LLM approach and highlights ongoing debates about optimal strategies.
Italy’s Minerva-3B, a large-scale sovereign language model trained entirely from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored just 4.9% on the INVALSI Italian school-exam benchmark, revealing a significant challenge to assumptions about scale and language-specific investment in AI development.
The Minerva project, led by Sapienza University of Rome and supported by Italy’s national research infrastructure, trained models ranging from 350 million to 7 billion parameters using the CINECA supercomputer, with weights and data openly published from inception. Despite this substantial effort and a focus on native Italian data, the 3B model’s performance on academic tests was near chance, indicating that large-scale training alone does not guarantee complex language understanding.
Researchers concluded that, while the dataset size and parameter count are crucial, they are not sufficient for mastering complex language tasks such as academic assessments. The empirical results suggest that even significant native-language investment may need to be scaled further to achieve desired proficiency levels, challenging the assumption that more data and larger models automatically lead to better performance.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.

Advanced Language Tool Kit: Teaching the Structure of the English Language
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.

Handbook of Research on Methodologies and Applications of Supercomputing (Advances in Systems Analysis, Software Engineering, and High Performance Computing)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

From Weights to Wisdom: The Complete Guide to Running and Adapting Opensource AI Models
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.

Assessing English Language Learners: Bridges From Language Proficiency to Academic Achievement
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-LLM Strategies
The results from Minerva demonstrate that high native-language data and large models may still fall short of achieving deep language understanding at the scale currently pursued by European projects. This raises important questions about the necessary investment levels and the feasibility of developing truly capable country-specific LLMs within current resource constraints. The findings suggest that European efforts may need to reconsider their scale and resource commitments to meet their strategic AI goals, moving beyond the binary of ‘from scratch or continuation’ to a nuanced understanding of scale requirements.
Background on European Sovereign LLM Efforts
European nations have debated strategies for developing sovereign language models, with approaches ranging from continuation training of multilingual models (e.g., Portugal’s AMÁLIA) to training from scratch. Italy’s Minerva project, launched with extensive national funding and infrastructure, exemplifies the ‘from scratch’ approach, emphasizing native data and open weights. While these efforts have yielded promising technical results, their performance on complex tasks has raised questions about the sufficiency of current investment levels and model sizes.
Previous benchmarks and research indicated that larger datasets and more parameters generally improve language understanding, but empirical evidence from Minerva suggests that this relationship is not straightforward, especially for complex academic language tasks.
Unresolved Questions About Investment and Performance
It remains unclear whether further scaling—more data, larger models, or different training methodologies—can significantly improve Minerva’s performance on complex language tasks. The long-term implications for European sovereign-LLM strategies are still developing, and ongoing research will clarify whether these results are an anomaly or indicative of broader limitations.
Next Steps in Evaluating and Scaling European LLMs
The Minerva team plans to continue iterative training and evaluation, exploring larger models and alternative training approaches. Policymakers and researchers are expected to reassess resource allocations and strategic priorities based on these findings, with potential shifts toward more ambitious scaling or hybrid approaches combining multilingual and native-language models. Further benchmarking and cross-national collaboration will be critical to understanding the full implications.
Key Questions
Why did Minerva-3B perform poorly on the Italian academic benchmark?
Despite extensive native Italian data and large model size, Minerva-3B’s performance suggests that scale alone is insufficient for mastering complex language tasks, indicating a need for even larger models or different training strategies.
Does this mean European sovereign-LLMs are not viable?
Not necessarily. The results highlight challenges and suggest that current scale levels may be inadequate. They do not preclude future success but call for reevaluating investment and scaling strategies.
What does this mean for other European projects like AMÁLIA?
It indicates that approaches relying solely on continuation training or smaller native datasets may need to be complemented with larger-scale efforts to achieve desired performance levels.
Are there alternative methods to improve performance besides scaling?
Yes, approaches such as targeted fine-tuning, architectural innovations, or hybrid training strategies could potentially enhance language understanding without solely increasing scale.
Source: ThorstenMeyerAI.com