The Problem: Optimal ApplicativeDo is Too Slow

GHC has a flag -foptimal-applicative-do that produces the best possible scheduling of independent statements in do notation. It's off by default because the algorithm is O(n³). A 1000-statement block takes 55 seconds to compile with the optimal flag, vs under 2 seconds with the default greedy heuristic.

Ian Duncan set out to fix this. The core problem: given a sequence of statements with dependencies, find the tree of Seq (sequential) and Par (parallel) combinators that minimizes the number of rounds (sequential depth). This is a dynamic programming problem over spans [i,j], with O(n²) spans and O(n) split points each, yielding O(n³).

The Insight: Borrow from RNA Folding

The recurrence for optimal cost is identical to Nussinov's algorithm for RNA secondary structure prediction. In RNA, bases pair to minimize free energy; in ApplicativeDo, statements group to minimize rounds. Both use the same triangular table and combine-over-split-point logic.

Duncan realized that for "tangled" spans (where no parallel grouping is possible), the optimal split must be at one of the two extremes: peel off the first statement or the last. This is because the cost function is monotonic — adding statements only increases cost. So any interior split would sum two larger sub-costs, which can't be cheaper than peeling a single statement. The only exception is when both extremes tie, requiring further search.

The Optimization: Bounding with Longest Chain

When a tie occurs, the optimal cost is either c or c+1. If the longest chain through the span equals c+1, the lower bound meets the upper bound, and the search terminates immediately. For a pure chain of dependencies, every span ties, but the longest chain equals its length, which is always c+1. So the algorithm performs a single check per span instead of O(n) splits.

Combined with the paper's existing optimizations (extreme-cut shortcut for non-tied spans), the total cut evaluations drop dramatically:

  • 100-chain: 166,650 naive → 4,950 optimized
  • 100 realistic: 4,544 → 1,426
  • 200 realistic: 47,471 → 3,583

The trees produced are identical to the naive optimal — same schedule, less work.

Residual Worst Case

Not all inputs benefit. A "hub" pattern (every statement feeds one final sink) still requires full O(n³) because every span ties with longest chain = 2, never closing the bound. Such shapes are rare in real code. For those, Duncan suggests a follow-up algorithm (not detailed here).

Practical Impact

This optimization makes -foptimal-applicative-do viable for everyday use. Developers using Haxl or other batching monads can now write plain do notation without worrying about manual <$> and <*> refactoring, and get optimal parallelism without 55-second compile times.

The patch is not yet merged into GHC; Duncan is working on it. But the technique — borrowing DP from computational biology — is a neat example of cross-domain algorithm transfer.

Code Example

Consider a block with two independent fetches:

numCommonFriends x y = do
  fx <- friendsOf x
  fy <- friendsOf y
  return (length (intersect fx fy))

With -foptimal-applicative-do, GHC desugars this to:

(\fx fy -> length (intersect fx fy)) <$> friendsOf x <*> friendsOf y

Instead of the sequential:

friendsOf x >>= \fx -> friendsOf y >>= \fy -> return (length (intersect fx fy))

This reduces two network round-trips to one.