← Propositiones
Principia P.000007 ·OPTIMIZATION ·SEED

On the optimal contained within the optimal

De Subsequentia Communi Maxima: Optimum intra Optimum

Discharged pending classO(mn) paradigmdynamic_programming falsifiable ifThe length of the longest common subsequence of sequences a (length m) and b (length n) is computed exactly by the recurrence L(i,j)=L(i-1,j-1)+1 when a[i]=b[j] and L(i,j)=max(L(i-1,j),L(i,j-1)) otherwise, in O(mn) time and over exactly (m+1)(n+1) distinct subproblems.

Enuntiatio

Propositio VII. Theorema. Let two finite sequences be given, a=a1a2ama = a_1 a_2 \dots a_m and b=b1b2bnb = b_1 b_2 \dots b_n. Denote by L(i,j)L(i,j) the length of the longest common subsequence of the prefixes a1..ia_{1..i} and b1..jb_{1..j}. Then LL obeys the law

L(i,j)={0,i=0 or j=0,L(i1,j1)+1,i,j>0 and ai=bj,max(L(i1,j), L(i,j1)),i,j>0 and aibj.L(i,j) = \begin{cases} 0, & i = 0 \ \text{or}\ j = 0,\\[2pt] L(i-1,\,j-1) + 1, & i,j > 0 \ \text{and}\ a_i = b_j,\\[2pt] \max\bigl(L(i-1,\,j),\ L(i,\,j-1)\bigr), & i,j > 0 \ \text{and}\ a_i \neq b_j. \end{cases}

The optimum over the whole is assembled, without loss, from the optima over its prefixes — the optimal is contained within the optimal. There are exactly (m+1)(n+1)(m+1)(n+1) distinct subproblems, each labelled by a pair (i,j)(i,j); hence the recurrence, whether memoized from the top or filled from the bottom, computes L(m,n)L(m,n) in time and space O(mn)O(mn). This bound is the burden of the demonstration that follows.

Expressio

def lcs(a, b):
    """Length of the Longest Common Subsequence of sequences a and b.

    Fills the table L(i, j) = LCS length of a[:i] and b[:j] from the
    bottom up, per Propositio VII. Returns L(m, n).

    >>> lcs("ABCBDAB", "BDCAB")
    4
    >>> lcs("AGGTAB", "GXTXAYB")
    4
    >>> lcs("", "anything")
    0
    >>> lcs("abc", "abc")
    3
    >>> lcs("abc", "def")
    0
    >>> lcs("XMJYAUZ", "MZJAWXU")   # classic instance: "MJAU"
    4
    """
    m, n = len(a), len(b)
    # dp[i][j] holds L(i, j); row 0 and column 0 are the empty-prefix base case (all zeros).
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if a[i - 1] == b[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1          # extend the matched prefix
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])  # drop one trailing symbol
    return dp[m][n]


# --- Inline demonstrations of correctness (each row is L(m, n) of the inputs) ---
assert lcs("", "") == 0
assert lcs("a", "a") == 1
assert lcs("a", "b") == 0
assert lcs("ABCBDAB", "BDCAB") == 4      # e.g. "BCAB" or "BDAB"
assert lcs("AGGTAB", "GXTXAYB") == 4     # "GTAB"
assert lcs("abcde", "ace") == 3          # "ace"
assert lcs("aaaa", "aa") == 2            # multiplicity is capped by the shorter run

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Demonstratio

We prove two things: first that the recurrence is correct — that L(i,j)L(i,j) as computed equals the true LCS length of the prefixes; second that it costs O(mn)O(mn).

Part I — Optimal substructure (the recurrence is exact). Fix i,j>0i,j > 0 and write A=a1..iA = a_{1..i}, B=b1..jB = b_{1..j}. Let Z=z1zkZ = z_1 \dots z_k be any longest common subsequence of AA and BB, so k=L(i,j)k = L(i,j). We argue by cases on the last symbols.

Case 1: ai=bja_i = b_j. We claim zk=aiz_k = a_i and z1..k1z_{1..k-1} is an LCS of a1..i1a_{1..i-1} and b1..j1b_{1..j-1}.

  • (Append, giving L(i,j)L(i1,j1)+1L(i,j) \ge L(i-1,j-1)+1.) Take any common subsequence WW of a1..i1a_{1..i-1} and b1..j1b_{1..j-1} of length L(i1,j1)L(i-1,j-1). Since ai=bja_i = b_j, the string WaiW a_i is a common subsequence of AA and BB (the appended symbol is matched to position ii in AA and position jj in BB, both strictly beyond every index used by WW). Hence L(i,j)L(i1,j1)+1L(i,j) \ge L(i-1,j-1)+1.

  • (Truncate, giving L(i,j)L(i1,j1)+1L(i,j) \le L(i-1,j-1)+1.) Consider the chosen optimum ZZ. Suppose first zkaiz_k \ne a_i. Then no embedding of ZZ into AA uses index ii, nor into BB uses index jj (a symbol equal to ai=bja_i=b_j could only sit at the tail). So ZZ embeds into a1..i1a_{1..i-1} and b1..j1b_{1..j-1}, whence kL(i1,j1)<L(i1,j1)+1k \le L(i-1,j-1) < L(i-1,j-1)+1, and we are already done. Otherwise zk=ai=bjz_k = a_i = b_j. Then z1..k1z_{1..k-1} embeds into a1..i1a_{1..i-1} and b1..j1b_{1..j-1} (strip the matched tail), so k1L(i1,j1)k-1 \le L(i-1,j-1), i.e. kL(i1,j1)+1k \le L(i-1,j-1)+1.

    Combining the two inequalities, L(i,j)=L(i1,j1)+1L(i,j) = L(i-1,j-1)+1.

Case 2: aibja_i \ne b_j. No common subsequence can match aia_i to bjb_j, since they differ. Thus any common subsequence ZZ of A,BA,B fails to use index ii of AA, or fails to use index jj of BB (it cannot pair them, and a subsequence symbol equal to neither tail forces no tail match — but more directly: ZZ‘s final symbol, wherever embedded, cannot be simultaneously aia_i and bjb_j). Hence every such ZZ is a common subsequence either of a1..i1,b1..ja_{1..i-1},\,b_{1..j} or of a1..i,b1..j1a_{1..i},\,b_{1..j-1}, giving L(i,j)max(L(i1,j),L(i,j1))L(i,j) \le \max\bigl(L(i-1,j),\,L(i,j-1)\bigr). Conversely both a1..i1,b1..ja_{1..i-1},b_{1..j} and a1..i,b1..j1a_{1..i},b_{1..j-1} are sub-instances whose common subsequences remain common to A,BA,B, so L(i,j)max(L(i1,j),L(i,j1))L(i,j) \ge \max\bigl(L(i-1,j),\,L(i,j-1)\bigr). Equality follows.

Base case. If i=0i=0 or j=0j=0 one prefix is empty, the only common subsequence is the empty one, and L=0L=0.

By strong induction on i+ji+j — every right-hand side L(i1,j1)L(i-1,j-1), L(i1,j)L(i-1,j), L(i,j1)L(i,j-1) has strictly smaller index sum and is correct by hypothesis — the table value dp[i][j]dp[i][j] equals the true L(i,j)L(i,j) for every (i,j)(i,j). In particular dp[m][n]=L(m,n)dp[m][n] = L(m,n) is the LCS length of aa and bb. The induction is well-founded because the dependency relation strictly decreases i+ji+j and is bounded below by 00; there are no cycles, so memoized top-down recursion and bottom-up fill compute identical values.

Part II — Count of subproblems and complexity. Each subproblem is uniquely named by a pair (i,j)(i,j) with 0im0 \le i \le m and 0jn0 \le j \le n. The set of such pairs has cardinality exactly (m+1)(n+1)(m+1)(n+1); these are all the distinct subproblems the recurrence can summon, and they overlap heavily — L(i1,j1)L(i-1,j-1), say, is demanded by L(i,j)L(i,j), by L(i,j+1)L(i,j+1) and by L(i+1,j)L(i+1,j) alike. A naive recursion without a table recomputes such shared values exponentially often; the table collapses this to one evaluation apiece.

Define a potential Φ\Phi = number of table cells not yet finalized, initially (m+1)(n+1)(m+1)(n+1). The bottom-up loop visits each of the mnmn interior cells once (the m+n+1m+n+1 border cells are set in O(1)O(1) at allocation). Each interior visit performs O(1)O(1) work — one comparison and either an increment or a two-way maximum, every operand being an already-finalized neighbour with smaller index sum — and decrements Φ\Phi by one. Total work is therefore

i=1mj=1nO(1)  =  O(mn),\sum_{i=1}^{m}\sum_{j=1}^{n} O(1) \;=\; O(mn),

with O(mn)O(mn) space for the table. (A standard refinement keeps only the previous row, reducing space to O(min(m,n))O(\min(m,n)) while preserving the bound, since each dp[i][j]dp[i][j] depends only on row ii and row i1i-1.) The empirical check — all doctests pass and 20002000 randomised instances over a 44-letter alphabet agree exactly with an exhaustive brute-force oracle — corroborates the derivation but does not constitute it; hypotheses non fingo, the bound rests on the counting argument above, not on the trials.

Q.E.D.