Autor/es reacciones

Pablo Haya Coll

Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)

The study provides deeper insights into the reliability of large language models (LLMs), challenging the assumption that scaling and tuning these models always improves their accuracy and alignment. On the one hand, they observe that, although larger, fine-tuned models tend to be more stable and provide more correct answers, they are also more prone to making serious errors that go unnoticed, as they avoid not responding. On the other hand, they identify a phenomenon they call the ‘difficulty discordance phenomenon’. This phenomenon reveals that, even in the most advanced models, errors can appear in any type of task, regardless of its difficulty. This implies that errors persist, even in tasks that are considered simple.

Unfortunately, the journal publishes the article more than a year after receiving it (June 2023). Thus, the LLMs analysed in the study correspond to 2023 versions. Currently, two new versions of OpenAI are already available: GPT4o and o1, as well as a new version of Meta: Llama 3. It would not be unreasonable to assume that the conclusions of the study can be extrapolated to GPT4o and Llama 3, given that both versions maintain a similar technical approach to their predecessors. However, the OpenAI o1 series of models is based on a new training and inference paradigm, which is specifically designed to address the types of problems present in the test sets used in the study. In fact, by manually testing o1-preview with the exampleprompts described in the paper, a significant improvement is already observed on those problems where the study indicates that GPT4 fails. Thus, review and acceptance times in journals should be adjusted to keep pace with technological advances in LLMs, in order to prevent results from being published out of date.

EN