The Problem: Optimal ApplicativeDo is Too Slow
GHC has a flag -foptimal-applicative-do that produces the best possible scheduling of independent statements in do notation. It's off by default because the algorithm is O(n³). A 1000-statement block takes 55 seconds to compile with the optimal flag, vs under 2 seconds with the default greedy heuristic.
Ian Duncan set out to fix this. The core problem: given a sequence of statements with dependencies, find the tree of Seq (sequential) and Par (parallel) combinators that minimizes the number of rounds (sequential depth). This is a dynamic programming problem over spans [i,j], with O(n²) spans and O(n) split points each, yielding O(n³).
The Insight: Borrow from RNA Folding
The recurrence for optimal cost is identical to Nussinov's algorithm for RNA secondary structure prediction. In RNA, bases pair to minimize free energy; in ApplicativeDo, statements group to minimize rounds. Both use the same triangular table and combine-over-split-point logic.
Duncan realized that for "tangled" spans (where no parallel grouping is possible), the optimal split must be at one of the two extremes: peel off the first statement or the last. This is because the cost function is monotonic — adding statements only increases cost. So any interior split would sum two larger sub-costs, which can't be cheaper than peeling a single statement. The only exception is when both extremes tie, requiring further search.
The Optimization: Bounding with Longest Chain
When a tie occurs, the optimal cost is either c or c+1. If the longest chain through the span equals c+1, the lower bound meets the upper bound, and the search terminates immediately. For a pure chain of dependencies, every span ties, but the longest chain equals its length, which is always c+1. So the algorithm performs a single check per span instead of O(n) splits.
Combined with the paper's existing optimizations (extreme-cut shortcut for non-tied spans), the total cut evaluations drop dramatically:
- 100-chain: 166,650 naive → 4,950 optimized
- 100 realistic: 4,544 → 1,426
- 200 realistic: 47,471 → 3,583
The trees produced are identical to the naive optimal — same schedule, less work.
Residual Worst Case
Not all inputs benefit. A "hub" pattern (every statement feeds one final sink) still requires full O(n³) because every span ties with longest chain = 2, never closing the bound. Such shapes are rare in real code. For those, Duncan suggests a follow-up algorithm (not detailed here).
Practical Impact
This optimization makes -foptimal-applicative-do viable for everyday use. Developers using Haxl or other batching monads can now write plain do notation without worrying about manual <$> and <*> refactoring, and get optimal parallelism without 55-second compile times.
The patch is not yet merged into GHC; Duncan is working on it. But the technique — borrowing DP from computational biology — is a neat example of cross-domain algorithm transfer.
Code Example
Consider a block with two independent fetches:
numCommonFriends x y = do
fx <- friendsOf x
fy <- friendsOf y
return (length (intersect fx fy))
With -foptimal-applicative-do, GHC desugars this to:
(\fx fy -> length (intersect fx fy)) <$> friendsOf x <*> friendsOf y
Instead of the sequential:
friendsOf x >>= \fx -> friendsOf y >>= \fy -> return (length (intersect fx fy))
This reduces two network round-trips to one.




