Autor/es reacciones

Josep Curto

Academic Director of the Master's Degree in Business Intelligence and Big Data at the Open University of Catalonia (UOC) and Adjunct Professor at IE Business School

After reviewing the article, we can comment that it is a rigorous article that offers a different view and will generate controversy regarding the evolution of LLMs [large language models]. It is not the first article to question the benchmarks used to compare different models (either against previous versions from the same manufacturer or against competitors). A complementary approach would be LiveBench: A Challenging, Contamination-Free LLM Benchmark (the ranking can be found here) in which it is assumed that the training datasets contain the benchmark answers and therefore the results are better than they actually are.

One of the big challenges in the context of LLMs is interpretability and explainability (to humans). Unfortunately, as the architecture grows in complexity, the explanation also grows in complexity and can quickly become beyond our ability to understand.

[The research] offers a novel approach to evaluating LLMs that hopefully can be extended further in future work.

[In terms of limitations] As discussed in the article, the humans involved are not experts in the field. Another limitation is not including GPT 4o, GPT o1 or other new versions, but considering that new LLMs appear every week (promising better performance than the rest) it is difficult to conduct a study of this kind without fixing the LLMs to be worked with.

EN