Autor/es reacciones

Beatriz Seoane Bartolomé

Lecturer in the Department of Theoretical Physics and member of the Dynamics of Disordered Systems group at the Complutense University of Madrid

The problem with predicting the three-dimensional structure of a protein from its amino acid sequence, known as the 'protein folding problem', has been a central challenge not only for biology but also for chemistry and physics. Its importance lies in the fact that understanding how proteins fold is crucial to understanding their function in organisms and, by extension, in life itself. Furthermore, this understanding has significant practical applications, such as the design of optimized enzymes for industrial processes and the development of antibodies to combat various diseases.

The reason it is so important to know the three-dimensional structure of a protein is that its function primarily depends on its shape and not just on the specific amino acid sequence. Very different sequences may lead to similar shapes with practically identical functions, just as small changes in a protein sequence can denature it and destroy its function. For decades, physicists have tried to predict these structures by modeling the interactions between amino acids. However, the challenge is twofold: first, it is necessary to accurately model these interactions, which requires very well-calibrated force fields; second, even with good modeling, finding the minimum energy structure (i.e., the equilibrium state) is extremely slow from a computational perspective. This is because protein folding is a highly complex optimization problem, with many interactions that can be oppositional in nature. To date, molecular dynamics simulations have only been able to effectively reproduce the structures of very small proteins.

In the last decade, the approach to the protein folding problem has radically changed, primarily due to the massive accumulation of protein sequences in databases, made possible by the drastic reduction in the costs of genomic sequencing. The new idea was simple but innovative: although we do not fully understand how to model the interactions between amino acids, we now have access to a vast amount of data on protein sequences and their viable mutational variations, meaning those that have survived evolutionary pressure.

Instead of trying to model the interactions at a physical level, researchers began to statistically study families of 'homologous proteins', that is, sequences with similar functions in different but evolutionarily related organisms. From this data, they were able to infer two key things: first, which amino acids could not mutate in isolation without denaturing the protein; and second, which pairs of amino acids needed to be in contact in the three-dimensional structure, as a mutation in one would destabilize those critical contacts and, consequently, the structure.

This bioinformatics approach, completely 'data-driven', combined with improved models that allowed for the identification of correlations beyond pairs of amino acids, enabled effective learning of 'important mutational couplings', that is, the constraints on how amino acids could change without altering the function of the protein. Subsequently, this strategy was combined with supervised 'machine learning' methods, where models learned to predict the three-dimensional structure of proteins whose structures were already known, using their sequences as a training base. 

This approach culminated in a historic milestone in 2020 during the CASP (Critical Assessment of Structure Prediction) competition when AlphaFold2 was able to predict the structures of proteins that had never before been experimentally resolved with great accuracy. Surprisingly, this included proteins with very different sequences from those studied previously, where traditional methods failed spectacularly. Thus, the protein folding problem was practically solved, not through detailed physical modeling of its components, but by imitating patterns from stored evolutionary data.

This achievement has truly revolutionized computational biology, where the combination of large volumes of data with the power of artificial intelligence has surpassed decades of attempts based solely on physical models.

EN