Autor/es reacciones

Víctor Etxebarria

Professor of Systems Engineering and Automatics at the University of the Basque Country (UPV/EHU)

This paper demonstrates, mathematically and rigorously, that generative AIs can malfunction if trained on AI-generated data. The effect that the authors propose to call ‘model collapse’ is true: large language models (LLMs) - on which current generative AIs base their operation - actually collapse (stop working, respond badly, give incorrect information). This is a statistical effect perfectly demonstrated in the article and illustrated with examples and experiments, provided that LLM models are trained recursively (i.e.: by giving the generative AI training data previously generated by a generative AI). In this sense, the paper demonstrates that generative AIs trained in this way are actually degenerative.

The AIs are trained on huge amounts of data present on the internet, produced by people who have legal rights of authorship of their material. To avoid lawsuits or to save costs, technology companies use data generated by their own AIs to further train their machines. This increasingly widespread procedure renders AIs useless for any truly reliable function. This makes AIs not only useless tools to help us solve our problems, but they can be harmful, if we base our decisions on incorrect information.

The authors of this excellent article recommend that the AI industry use training with truly intelligent (i.e. human) data. They also recognise that pre-filtering automatically generated data to avoid degeneracy is not necessarily impossible, but requires a lot of serious research.

EN