Protein Fold Redundancy: Why Scaling Natural Sequences Hits

Protein Fold Redundancy: Why Scaling Natural Sequences Hits Diminishing Returns

Ligo Biosciences has a problem. They train generative models to design enzymes, and they want more structural data. The obvious move: fold millions of natural protein sequences with AlphaFold2 and use those predicted structures as training examples. But when they tried, they discovered that natural protein folds are far more redundant than sequence counts suggest.

The core finding: The AlphaFold Database (AFDB) contains billions of predicted structures, but clustering reveals only ~25,000 reusable structural neighborhoods — not the 2.3 million non-singleton clusters reported by a fast Foldseek pass. That's a 100x overcount.

The Sequence-Fold Mismatch

Modern biomolecular models like AlphaFold3 rely on sequence scale. DeepMind's "all-PDB" approach used structure prediction to convert massive sequence databases (MGnify, etc.) into 3D structures. The assumption: more sequences → more structural diversity. But Ligo's analysis shows that evolution reuses stable folds heavily.

Example from their AFDB fragment clusters: three proteins share the same fold (TM-score > 0.75) despite only 23.9–28.3% sequence identity. One is a 3-oxoacyl-[acyl-carrier-protein] reductase from bacteria, another a NAD-binding protein from fungi, and a third a short-chain dehydrogenase from a different bacterium. Different sequences, same fold.

This matters because when you scale predicted structures, you're not adding independent examples — you're adding sequence variants of the same fold families.

Clustering the Predicted Protein Universe

Foldseek previously clustered the AFDB into 2.3 million non-singleton clusters. But Ligo argues this is an overestimate for two reasons:

Predicted structures include disordered regions. AlphaFold predicts the whole chain — floppy tails, linkers, signal peptides — even when those regions have low confidence (pLDDT). Clustering on full chains merges domains that should be separate.
Multi-domain proteins create ambiguous clusters. Two proteins might share one domain but differ in others. A fast clustering pass may split or merge them incorrectly.

To get cleaner data, Ligo developed a graph-theoretic splitting method. First, they filter residues with pLDDT < 65. Then they build a k-nearest-neighbor graph (k=15) on C-alpha atoms and apply spectral bisection to cut at the weakest connection points — typically where a high-confidence linker bridges two domains.

# Pseudocode for Ligo&#39;s spectral splitting
import numpy as np
from scipy.sparse.csgraph import laplacian
from scipy.linalg import eigh

def spectral_split(coords, plddt, threshold=65):
    # Filter high-confidence residues
    mask = plddt &gt;= threshold
    coords = coords[mask]
    
    # Build k-NN graph (k=15)
    from sklearn.neighbors import NearestNeighbors
    nbrs = NearestNeighbors(n_neighbors=15).fit(coords)
    distances, indices = nbrs.kneighbors(coords)
    
    # Weight edges by inverse distance
    n = len(coords)
    W = np.zeros((n, n))
    for i in range(n):
        for j, d in zip(indices[i], distances[i]):
            W[i, j] = 1.0 / (d + 1e-6)
    
    # Compute Laplacian and Fiedler vector
    L = laplacian(W, normed=True)
    eigenvalues, eigenvectors = eigh(L, subset_by_index=[0, 1])
    fiedler = eigenvectors[:, 1]
    
    # Split at the sign change of Fiedler vector
    split = fiedler &gt; 0
    return split

This method yields clean domain fragments. For example, a BRCA1 isoform (P38398-3) that spans 1863 residues gets split into multiple ordered domains, discarding low-confidence tails.

The Real Number: ~25,000 Structural Neighborhoods

After cleaning, Ligo clustered the fragments. Their current analysis suggests the true number of reusable structural neighborhoods is closer to 25,000 — not 2.3 million. That's a 100x reduction.

This has direct implications for generative enzyme design. If you train a model on 2.3 million clusters, you're overfitting to redundant folds. The model wastes capacity on sequence variation within the same structural family rather than learning to explore new geometries.

What This Means for Scaling Biomolecular Models

DeepMind's AlphaFold3 succeeded partly by scaling data — training on the entire PDB plus MGnify-folded structures. But Ligo's work suggests that beyond a certain point, adding more natural sequences yields diminishing returns in structural diversity.

The practical takeaway: curated structural diversity matters more than raw sequence count. For generative models, you want balanced representation across fold space, not millions of near-duplicates of common folds.

Ligo's next step is to release their clustering dataset and splitting tool as open-source. They hope it will help the community build better training sets for protein design models.

The Bottom Line

If you're training a generative model for protein design, don't just fold every sequence you can find. Cluster first. You'll get better generalization with a fraction of the data.

Protein Fold Redundancy: Why Scaling Natural Sequences Hits Diminishing Returns