The Next Step for AI in Biology Is to Predict How Proteins Behave in the Body

Proteins are often called the building blocks of life.

While true, the analogy evokes images of Lego-like pieces snapping together to form intricate but rigid blocks that combine into muscles and other tissues. In reality, proteins are more like flexible tumbleweeds—highly sophisticated structures with “spikes” and branches protruding from a central frame—that morph and change with their environment.

This shapeshifting controls the biological processes of living things—for example, opening the protein tunnels dotted along neurons or driving cancerous growth. But it also makes understanding protein behavior and developing drugs that interact with proteins a challenge.

While recent AI breakthroughs in the prediction (and even generation) of protein structures are a huge advance 50 years in the making, they still only offer snapshots of proteins. To capture whole biological processes—and identify which lead to diseases—we need predictions of protein structures in multiple “poses” and, more importantly, how each of these poses changes a cell’s inner functions. And if we’re to rely on AI to solve the challenge, we need more data.

Thanks to a new protein atlas published this month in Nature, we now have a great start.

A collaboration between MIT, Harvard Medical School, Yale School of Medicine, and Weill Cornell Medical College, the study focused on a specific chemical change in proteins—called phosphorylation—that’s known to act as a protein on-off switch, and in many cases, lead to or inhibit cancer.

The atlas will help scientists dig into how signaling goes awry in tumors. But to Sean Humphrey and Elise Needham, doctors at the Royal Children’s Hospital and the University of Cambridge, respectively, who were not involved in the work, the atlas may also begin to help turn static AI predictions of protein shapes into more fluid predictions of how proteins behave in the body.

Let’s Talk About PTMs (Huh?)

After they’re manufactured, the surfaces of proteins are “dotted” with small chemical groups—like adding toppings to an ice cream cone. These toppings either enhance or turn off the protein’s activity. In other cases, parts of the protein get chopped off to activate it. Protein tags in neurons drive brain development; other tags plant red flags on proteins ready for disposal.

All these tweaks are called post-translational modifications (PTMs).

PTMs essentially transform proteins into biological microprocessors. They’re an efficient way for the cell to regulate its inner workings without needing to alter its DNA or epigenetic makeup. PTMs often dramatically change the structure and function of proteins, and in some cases, they could contribute to Alzheimer’s, cancer, stroke, and diabetes.

For Elisa Fadda at Maynooth University in Ireland and Jon Agirre at the University of York, it’s high time we incorporated PTMs into AI protein predictors like AlphaFold. While AlphaFold is changing the way we do structural biology, they said, “the algorithm does not account for essential modifications that affect protein structure and function, which gives us only part of the picture.”

The King PTM

So, what kinds of PTMs should we first incorporate into an AI?

Let me introduce you to phosphorylation. This PTM adds a chemical group, phosphate, to specific locations on proteins. It’s a “regulatory mechanism that is fundamental to life,” said Humphrey and Needham.

The protein hotspots for phosphorylation are well-known: two amino acids, serine and threonine. Roughly 99 percent of all phosphorylation sites are due to the duo, and previous studies have identified roughly 100,000 potential spots. The problem is identifying what proteins—dubbed kinases, of which there are hundreds—add the chemical groups to which hotspots.

In the new study, the team first screened over 300 kinases that specifically grab onto over 100 targets. Each target is a short string of amino acids containing serine and threonine, the “bulls-eye” for phosphorylation, and surrounded with different amino acids. The goal was to see how effective each kinase is at its job at every target—almost like a kinase matchmaking game.

This allowed the team to find the most preferred motif—sequence of amino acids—for each kinase. Surprisingly, “almost two-thirds of phosphorylation sites could be assigned to one of a small handful of kinases,” said Humphrey and Needham.

A Rosetta Stone

Based on their findings, the team grouped the kinases into 38 different motif-based classes, each with an appetite for a particular protein target. In theory, the kinases can catalyze over 90,000 known phosphorylation sites in proteins.

“This atlas of kinase motifs now lets us decode signaling networks,” said Yaffe.

In a proof-of-concept test, the team used the atlas to hunt down cellular signals that differ between healthy cells and those exposed to radiation. The test found 37 potential phosphorylation targets of a single kinase, most of which were previously unknown.

Ok, so what?

The study’s method can be used to track down other PTMs to begin building a comprehensive atlas of the cellular signals and networks that drive our basic biological functions.

The dataset, when fed into AlphaFold, RoseTTAFold, their variants, or other emerging protein structure prediction algorithms, could help them better predict how proteins dynamically change shape and interact in cells. This would be far more useful for drug discovery than today’s static protein snapshots. Scientist may also be able to use such tools to tackle the kinase “dark universe.” This subset of kinases, more than 100, have no discernible protein targets. In other words—we have no idea how these powerful proteins work inside the body.

“This possibility should motivate researchers to venture ‘into the dark’, to better characterize these elusive proteins,” said Humphrey and Needham.

The team acknowledges there’s a long road ahead, but they hope their atlas and methodology can influence others to build new databases. In the end, we hope “our comprehensive motif-based approach will be uniquely equipped to unravel the complex signaling that underlies human disease progressions, mechanisms of cancer drug resistance, dietary interventions and other important physiological processes,” they said.

Image Credit: DeepMind