Alfonso Valencia
ICREA professor and director of Life Sciences at the Barcelona National Supercomputing Centre (BSC).
ChatGPT is a computational natural language processing system built by OpenAI on top of a GPT3.5 (Generative Pretrained Transformer). The GPT has been trained on large amounts of text to correlate words in context, for which it handles about 175 billion parameters. ChatGPT has been further refined to answer questions by stringing words together, following the internal correlation model.
ChatGPT neither "reasons" nor "thinks", it just provides a text based on a huge and very sophisticated probability model.
The test has three levels: a) second-year medical students who’ve done about 300 hours of study, b) fourth-year medical students with about 2 years of clinical rotations under their belt, and c) students who have completed more than half a year of postgraduate education.
The test included three types of questions adapted for submission to the ChatGPT system:
-
Open-ended questions, e.g. "In your opinion, what is the reason for the patient's pupillary asymmetry?"
-
Multiple-choice questions without justification. A typical case would be a question such as: " “The patient’s condition is mostly caused by which of the following pathogens?”
-
Multiple-choice questions with justification, such as: “Which of the following is the most likely reason for the patient’s nocturnal symptoms? Explain your rationale for each choice.”
The results were evaluated by two experienced doctors and the discrepancies were evaluated by a third expert.
Summing up the results, we can say that the answers were accurate to an extent that is equivalent to the minimum level of human learners who passed that year.
There’s a number of interesting observations:
-
It is striking that, in just a few months, the system has improved significantly—partly because it has gotten better and partly because the amount of biomedical data has increased considerably.
-
The system is better than other ones trained on scientific texts alone. The reason has to be that the statistical model is more thorough.
-
There is an interesting correlation between the quality of the results (accuracy), the quality of the explanations (concordance) and the ability to produce non-trivial explanations (insight). The explanation may be that, when the system is working on a case where it has a lot of data, the correlation model is better, producing better and more coherent explanations. This seems to give some insight into the inner workings of the system and the importance of the structure of the data it relies on.
The study is careful in key areas, such as checking that the questions and answers were not openly available on the web and could not have been used to train the system, or that it did not retain the memory of previous answers. It also has limitations, such as a limited sample size (with 350 questions: 119, 102 and 122 for levels 1, 2 and 3, respectively). The study also represents a limited scenario as it only works with text. In fact, 26 questions containing images or other non-textual information were removed.
What does this tell us?
-
Exams should not be in written form, since it is possible to answer them without "understanding" either the questions or the answers. In other words, such written exams are useful neither for assessing the knowledge of a student (be it a machine or a human being), nor to measure their ability to respond to a real case (which is nil in the case of the machine).
-
Natural language processing systems based on "Transformers" are reaching very impressive levels of writing that are basically comparable to humans.
-
Humans are still exploring how to use these new tools.