New preprint🚨
Imagine (re)designing a protein via inverse folding. AF2 predicts the designed sequence to a structure with pLDDT 94 & you get 1.8 Å RMSD to the input. Perfect design?
What if I told u that the structure has 4 solvent-exposed Trp and 3 Pro where a Gly should be?
Why to be wary🧵👇
Anecdotes about this kind of biophysics-ignorant confidence of AlphaFold & Co in certain sequences have been around for a while. We have now systematically assessed this and other undesirable behavior of folding models in the context of (de novo) protein design.
1/19
Led by
@kerlenkorbeld.bsky.social and Seva Viliuga, we started this project under the premise of the field's most common protein design evaluation workflow: the self-consistency pipeline (function->backbone->sequence->evaluation), where folding models are used for the last step.
2/19
After refolding the designed sequence, two metrics are typically computed: pLDDT (structure confidence) and scRMSD (backbone similarity to the input). If favorable, i.e. it fulfills the “designability” criteria, the sequence (or seq-struc combo) is considered a good design
3/19
While widely used, we've not seen a systematic analysis of how successful this evaluation step is. Are folding models indeed good at this task? Does it really sort out bad designs? Can you quantify how “good” a sequence is? And why is everyone using different thresholds?!
4/19
We probed this for ESMfold, and for AF2, where two modes can be employed: the default MSA mode (often seen for redesign of native proteins) and single sequence mode (often seen for de novo protein design)
First, we checked how good they are at identifying clearly bad designs
5/19
This is easy: just take natural protein seqs & randomly swap letters. In reality, you get a non-folder after a few exchanges. Yet, folding models very stubbornly insist that such seqs fold into the same structure despite nonsensical residues all over the place.
@sokrypton.org saw this for Ala
6/19

Sergey Ovchinnikov on X: "I've previously hypothesized reason why it's easy to come up with adversarial sequence for alphafold (for design) is because it only looks at a subset of input positions. For a talk this morning, I figured I should just do the test and show this. 😅 (2/2) https://t.co/AcRB9nNkNL" / X
I've previously hypothesized reason why it's easy to come up with adversarial sequence for alphafold (for design) is because it only looks at a subset of input positions. For a talk this morning, I figured I should just do the test and show this. 😅 (2/2) https://t.co/AcRB9nNkNL
We found that this trend is exacerbated by the availability of evolutionary info to models: AF2 MSA behaves the worst, ESMfold is a bit better, and AF ss is the best (out of bad bunch).
As others noted: the signal from these MSAs / pLM embeddings overrules “reason”
7/19
Dec 16, 2025 15:23Worse: If you repeat this experiment starting from sequences designed with ProteinMPNN, the effect is even worse. The reason likely is that ProteinMPNN designs very strongly and unambiguously encode the intended structure, which can make folding models overconfident about their predictions.
8/19
We next looked at a set of literature-reported experimentally tested designs and compared folding models’ ability to act as “oracles”. Again, evo info was detrimental. MSA mode’s poor performance can be obfuscated though, for seqs where the MSA is very shallow / empty.
Duh: empty MSA == ss mode
9/19
So, AF2 ss mode wins. Problem solved? Nope. Besides still being overly confident, it also has another major issue: it generally does not work very well. Even for small natural proteins, AF2ss barely can fold sequences that pass the commonly used pLDDT and scRMSD-based “designability” criteria.
10/19
Sequences from design models like ProteinMPNN boost folding success, but for anything beyond medium sized proteins, you almost never can accurately fold in ss mode.
11/19
To sum up so far
Designers: AF & co are bad at spotting poor designs. If u use ss mode, false positives go down, but maybe no design at all will fold
Devs: if ur new seq design algo makes seqs with “regular” seq-struc mapping (as in nature) it gets worse, and users may *think* your tool is 💩
12/19
We also question whether the refolding pipeline always is a robust evaluation metric to begin with. As e.g. noted in
@moalquraishi.bsky.social's Genie2 paper, the designability of PDB structures is on average much higher than that of the AFDB. Now, we have a pretty good guess why:
13/19
Realizing that this gap is RMSD, not pLDDT driven, we speculated the cause to be a PDB artefact: in xray strucs, flexible termini often lack density / unstructured parts were truncated to begin with. As a result, the designability of PDB structures gets very high compared to full-length AFDBs
14/19
Indeed, we show that if you truncate AFDB strucs to their corresponding PDB entry length, you get the same designability.
Without a PDB to compare to, you can also rescue via a simple trick: just compute statistical outliers from an initial alignment, realign without them, and use median RMSD
15/19
Interestingly, even though this demonstrates that the low designability of AFDB is essentially an artefact from RMSD calcs after alignment, using more sophisticated aligners (TMalgin, Sheba) do not achieve the same rescue.
16/19
These results reveal a broader issue: scRMSD is a very unflexible metric (pun intended). If not-being-as-stable-as-a-rock is a design goal of yours (as it should be, if you want to eventually design enzymes), this metric makes little sense as is.
17/19
Taken together, we hope our results highlight the current limitations of the self-consistency evaluation that is so commonly used in the field, and thus encourage to establish new/additional criteria, or at least be more aware of the downsides of AF & co metrics when assessing designs.
18/19
Thanks to Seva and Kerlen for their tireless effort to get the data as watertight as possible.
Here is the preprint link again:
www.biorxiv.org/content/10.6...
19/19

Limitations of the refolding pipeline for de novo protein design
With the emergence of powerful deep learning-based tools, computational protein design has become a widely accessible technique. Nowadays, it is possible to perform both sequence and structure design ...
20/19
Or, you know, if bioRxiv is down / extremely slow again, download the pdf here:
www.fuerstlab.com/uploads/2025...https://www.fuerstlab.com/uploads/2025.12.09.693122v1.full.pdf