Artificial intelligence (AI) has already revolutionized the study of how proteins fold up into their 3D shapes, an achievement honored by last year’s Nobel Prize in Chemistry. Now, AI is transforming protein sequencing—identifying proteins from the sequence of amino acids that make them up. AI is often faster than conventional methods. It also enables researchers to sequence proteins they have never seen before, a common challenge in medical diagnostics, environmental studies, and archaeology.

In the latest advance, European researchers reported this week in Nature Machine Intelligence that an AI known as InstaNova can identify pathogenic proteins in wounds and unknown proteins produced by the brew of microbes in seawater samples. InstaNova isn’t alone. Over the past 4 years, researchers have unveiled more than two dozen protein sequencing AIs. “It seems clear that this is where the field is going to go,” says William Noble, a proteomics AI developer at the University of Washington.

Researchers in other areas are eager to apply the tools. Evolutionary biologists, for example, are using them to identify ancient proteins that could reveal insights into the differences between modern humans and our extinct relatives. “It’s already helpful,” says Enrico Cappellini, a paleoproteomics expert at the University of Copenhagen. “And it’s just going to get better and better.”

The world of proteins is far more complex than that of their genetic blueprints, DNA and RNA. The human genome, for example, contains roughly 20,000 genes, but those genes can give rise to 10 million different proteins, because of changes that can occur as DNA is copied into RNA or as RNA is translated into proteins, which themselves can be appended with myriad chemical modifications.

Biologists traditionally identify proteins by breaking them up into short fragments called peptides, each made up of between five and 20 amino acids. Scientists then weigh those fragments in a mass spectrometer, match the weights to those of known peptides in one of dozens of databases to determine their identity, and then piece together the fragments into the full molecule.

But there are problems with this approach. For starters, up to 70% of peptides found by mass spectroscopy aren’t in any databases. “Traditional proteomics is a bit like a Google search. If it’s not there, you will not find it,” says Timothy Patrick Jenkins, a proteomics expert at the Technical University of Denmark. And as the databases of peptides continue to grow, it’s taking ever more computer time to spot hits.

The new AI sequencers don’t bother searching for matches among known peptides. Instead, they calculate the weights of all the potential peptide fragments that could result from chemical modifications to a peptide of a given length. If the AI comes up with fragments that match ones from the actual sample, it tries to assemble them into full-length proteins.

To increase their accuracy, the protein sequencing AIs are trained on millions of known peptides and how they assemble into known proteins. This allows the AIs to learn the most common ways amino acid chains combine. The approach, Jenkins says, is similar to the way large language models (LLMs) such as ChatGPT train on vast bodies of text to learn the rules of syntax. Just as an LLM learns that “the boy bounces a ball” is more likely to be a valid sentence than “bounces a ball the boy,” the proteomics algorithms learn a kind of protein syntax, which provides the most likely sequence for a given set of peptides.

In 2021, Noble and his colleagues unveiled Casanovo, the first protein sequencing AI to use a deep neural network like the one that powers ChatGPT. In a 2024 paper in Nature Communications, Noble’s team reported that the AI proved adept at identifying novel sequences of peptides that weren’t in the training data. Additional experiments showed that Casanovo excelled at identifying the cell-surface peptides that the immune system targets when it attacks cancer, as well as unknown proteins in seawater samples.

Now, Jenkins and his colleagues have built on these results with InstaNova. It, too, uses a deep learning neural network. But unlike previous AI protein sequencing models, it adds a strategy called diffusion, an approach that has supercharged AI imagemaking models such as DALL-E, and protein structure models such as RoseTTAFold or AlphaFold. Diffusion models initially add random noise to the input data and then remove it to see how the procedure sharpens the output. Based on the outcome, they then apply noise removal more broadly to further sharpen the result. In their Nature Machine Intelligence paper, Jenkins and his colleagues report that in a head-to-head test with Casanovo, InstaNova, coupled with a refinement called InstanNova+, identified 42% more peptides in a labmade brew of proteins from nine organisms.

When the team applied its AI to real-world proteomics challenges, it found, among other results, that it identified 1225 peptides unique to the blood protein albumin in infected leg wounds, 10 times more than conventional database searches did. Of those 254 were new peptides not in the databases. The researchers also mapped other peptides to 52 bacterial proteins. These and other results show that InstaNova “can analyze complex samples and come up with answers,” says Catrine Soiberg, who heads R&D for Atlas Antibodies, a firm that helps researchers map proteins throughout tissues. Noble, who got an early look at InstaNova and has already put it through its paces, calls it “a real advance.”  

Others are running with it as well. Matthew Collins, a proteomics researcher at the University of Cambridge, has recently been testing several AI protein sequencing tools to analyze archaeological samples. In most cases the proteins in the samples have undergone extensive chemical changes after eons underground or came from extinct plants and animals, so they are unlikely to be represented in conventional protein and peptide databases. The models, Collins says, “are particularly good for messy environments [where] you don’t know what’s there.”

Already the AI tools have enabled his team to spot signatures of rabbit proteins in Neanderthal sites and fish muscle proteins in ancient Brazilian pots. “[The models] are so useful, we have switched all our research to work with them,” Collins says. “In my mind it’s a step change.”

More: https://www.science.org/content/article/ai-revolution-comes-protein-sequencing